pijul/pijul - Discussion #226 - Large file ergonomics

#226 Large file ergonomics

Closed on August 4, 2021

Ralith on December 12, 2020

Domains where large, typically binary, files must be version controlled are a rare case where git has failed to achieve much market share, and represent an opportunity for Pijul to gain significant traction by finally bringing them DVCS-era conveniences. Examples include game development and scientific data. Work-arounds like git lfs and git annex exist, but have major drawbacks making them ultimately unsatisfying solutions.

Git fails in these cases because it downloads and retains the entire history of large files into every clone when this is almost never required. Pijul’s data structure reportedly supports omitting the actual file data, suggesting that fast clones and efficient repository manipulation, with actual data being fetched lazily, are just a matter of user interface.

A compelling solution could identify files to handle shallowly based on a configuration that moves with the repository, ensuring consistent behavior when collaborating. The configuration could select files with a size threshold and/or path pattern matching, perhaps reusing the ignore file syntax. When cloning or manipulating a repository, only the data strictly necessary to construct a correct working tree would be fetched; future operations like unrecord or channel switch would fetch new data from the remote as needed. Some mechanism for recovering storage for long-unused states is probably also needed; perhaps a GC.

Good support for large binary files will lead to people storing them on the nest, so storage/bandwidth quotas/throttling will likely be needed. Care should be taken to ensure quotas are reasonable and spent according to reasonable logic, e.g. finding an affordable storage provider and not charging resource use against the quotas of users not responsible for incurring the use.

I’m interested in helping make this happen, but I’ll need some mentoring through the pijul code base to get started.

pmeunier on December 13, 2020

I’m interested in helping make this happen, but I’ll need some mentoring through the pijul code base to get started.

I’m also eager to get this working, and I’d like to mentor people as well.

The good news is, this is already more or less working, in the sense that the patch format splits the data into an “edit” parts and a “contents” part. When doing pijul pull, only the edit part is pulled and applied. Then, Pijul checks which parts of the file are alive, and downloads the relevant contents. The “edit” part contains a hash of the contents, which is verified when the full contents is downloaded.

The main downside of this approach in my opinion is that if a patch touches a large number of large files, all of them are downloaded at the same time, since we need to verify the hash of the contents.

I’m also not totally sure about what the implications of the lazy contents verification are. I don’t believe this could cause any problem, since it allows a server to send a sequence of two changes A and B, where A introduces some contents deleted by B.

Some mechanism for recovering storage for long-unused states is probably also needed; perhaps a GC.

This isn’t implemented yet, but would be very useful. For unrecord, you can use pull --full before unrecording (which is rather counter-intuitive).

Ralith on December 13, 2020

When doing pijul pull, only the edit part is pulled and applied.

Oh, awesome! Does that mean that unrecord and channel switch already have the logic to fetch new data from the remote if needed? Or is more than just what’s needed for the current working copy downloaded on pull/clone?

For unrecord, you can use pull –full before unrecording (which is rather counter-intuitive).

Before unrecording? I’m confused as to how that would be able to free any storage.

pmeunier on December 14, 2020

Oh, awesome! Does that mean that unrecord and channel switch already have the logic to fetch new data from the remote if needed? Or is more than just what’s needed for the current working copy downloaded on pull/clone?

No (contributions welcome!), and that is actually the answer to your next question:

Before unrecording? I’m confused as to how that would be able to free any storage.

Sorry, I was wrong. The way to do it is: unrecord, and if some patch contents aren’t found, pull the patches.

Ralith on December 14, 2020

To clarify, the existing behavior of pijul can leave the working tree in an inconsistent state until you pull whenever you unrecord or channel switch due to file content not being locally available?

pmeunier on December 14, 2020

Yes, I believe so. Channel switch (which I usually call reset) isn’t really a problem without unrecord. First, pull always pulls the full patches it needs, so if you never unrecord anything, you can always reset, since all the states are computed by pull.

But unrecord can indeed undelete a vertex, and reset will return an error if we don’t have the contents for that vertex.

Ralith on December 15, 2020

pull always pulls the full patches it needs

Where “needs” is defined exactly as “required by the current state for any channel”? Would it make sense to weaken this to “the current channel”?

When doing pijul pull, only the edit part is pulled and applied. Then, Pijul checks which parts of the file are alive, and downloads the relevant contents

I’m not sure I understand this correctly. If the edits are applied and the expected checksum is known, can’t the checksum be verified against local data only? What does it mean for only part of a file to be alive?

pmeunier on December 15, 2020

Where “needs” is defined exactly as “required by the current state for any channel”? Would it make sense to weaken this to “the current channel”?

No, and yes.

The way this works is, the pristine is essentially a graph of binary chunks, with some vertices marked “alive” and other marked “dead”. When pulling, all changes to the graph are applied, and then if a vertex introduced by a newly pulled change is still alive, the entire change is fetched.

I’m not sure I understand this correctly. If the edits are applied and the expected checksum is known, can’t the checksum be verified against local data only?

Not sure I understand the question. The checksum for the contents of a state is not always known, only the hash of the patches is known. As an optimisation, deciding whether two sets are equal is done with “commutative hashes” (which you can display with pijul log --states).

Ralith on December 31, 2020

some vertices marked “alive” and other marked “dead”.

What are the criteria for a vertex being marked alive? How does this relate to channels?

Not sure I understand the question.

To back up a bit, I’m confused by this previous statement:

if a patch touches a large number of large files, all of them are downloaded at the same time, since we need to verify the hash of the contents.

Is Pijul downloading complete files, or just diffs? Why would you need to download a complete file to verify the checksum of data that’s already on disk, as opposed to just downloading a checksum directly? Is this just a missing optimization?

Thanks for your patience with all these questions, and apologies if it’s all documented somewhere I haven’t found!

pmeunier on December 31, 2020

What are the criteria for a vertex being marked alive? How does this relate to channels?

The dead vertices are the ones that are deleted, the alive vertices are the ones that are not deleted.

Is Pijul downloading complete files, or just diffs? Why would you need to download a complete file to verify the checksum of data that’s already on disk, as opposed to just downloading a checksum directly?

Just diffs, but it stores them in files, hence the confusion. Let’s talk about “repository files” for the files tracked by Pijul (as printed by pijul ls), and “diff files” for files representing diffs (as found in .pijul/changes).

Diff files have essentially two sections: the first is mandatory, but the second section, containing the contents introduced by this diff, is optional. However, if you refer to that content at some point, you must have the entire diff file, not just the tiny fraction you’re looking at.

Now, about my explanation above: Pijul is not optimised for that use case (patches modifying lots of huge files in significant ways, where all these files but one are deleted later), but I don’t think it would make much sense anyway, since:

this would mean adding a finer per-file hash, making a security-critical thing (hash verification) complicated, whereas now hash verification can be done easily with 10 lines of Rust depending only on the blake3 and zstd-seekable crates.
if someone makes huge patches touching a lot of huge files in significant ways, and later only need a tiny part of these files because they deleted the rest, I don’t really know what they’re doing, so I’d rather understand their exact use case before introducing such optimisations.

pmeunier added tag Beta on January 12, 2021

pmeunier on June 7, 2021

Btw, I don’t remember whether I already told you, but I added my “bindiff” crate there, if you are still interested in the issue: https://nest.pijul.com/pmeunier/bindiff

I’m interested in your input on that very basic attempt to fix this.

pmeunier on August 4, 2021

This has now been merged into the current version. Thanks for the input!

pmeunier closed this discussion on August 4, 2021