The sound distributed version control system

#847 On the danger of forced linearization

Opened by dfranke on December 21, 2023
dfranke on December 21, 2023

I just discovered pijul yesterday, and promptly did a deep dive into its data model and theory of patches. Here I came across something that alarmed me. Channels in pijul impose on changes some total ordering that is compatible with the partial ordering imposed by their dependencies according to the theory of patches. This differs from git’s model, in which there is no such total ordering because merge commits have multiple predecessors. I think this aspect of pijul’s model is a big mistake, and here’s why.

Consider a single-file repository with the following initial commit to the main channel:

#![allow(unused_variables)]
fn main() {
    let mut x = 0;
    let mut y = 0;

    // x += 1;             // 1B
    // x -= 1;             // 1C
    // y += 1;             // 2B
    // y -= 1;             // 2C
    // assert_eq!(x, 0);   // 2A
    // assert_eq!(y, 0);   // 1A
}

// This comment is updated by every change on channel 1.

// This comment is updated by every change on channel 2.

We then create two channels, #1 and #2, forked from main, and then record three changes to each channel. Changes 1A, 1B, and 1C, made in that order on channel 1, uncomment the respective lines in main, and also modify the comment at the bottom. Ditto 2A, 2B, and 2C on channel 2.

Now let’s consider what happens if we want to merge everything from both these channels back into main.

Textually, according to the theory of patches, we have dependencies

  • 1A < 1B < 1C
  • 2A < 2B < 2C

because of changes to the comments at the bottom. However, there are also semantic constraints which the theory of patches doesn’t capture:

  • !(2A < 1C)
  • !(1A < 2C)

Taking x < y to be false if either x or y is completely absent. When these constraints are unsatisfied, the program will crash with an assertion failure.

There is no total ordering which simultaneously satisfies all of these constraints. The program will compile and run normally at every point in the history of channel 1 and of channel 2. When these two channels are merged together back into main, the program also will compile and run normally at the tip of the channel where all six changes are present. But there is no way to perform this merge without introducing somewhere into the main channel’s history a crash bug which never actually existed at any point in the program’s development!

This sort of misrepresentation seems to run against some fundamental desiderata of what a version control system is supposed to do. It becomes concretely painful if you’re trying to track down bugs using something akin to git bisect. Channels need to be rethought and generalized so that they are capable of representing nonlinear history.

pmeunier on December 23, 2023

This is a really interesting and insightful comment, and I agree with everything. Congrats on the deep dive! I find the wording “big mistake” a little strong though: the current behaviour is actually something we might want in some cases, some projects/repos.

The alternative (snapshot-based systems) has a different kind of “big mistake”, which is to believe that the 3-way merge problem (try and merge two HEADs using a common ancestor) can be solved efficiently in a transparent (associative) way.

As you mentioned, this could be solved in Pijul by adding metadata to patches to group them by “topic”. This would IMHO yield the best of both worlds. This was actually suggested on our Zulip a little while ago as a way to fix another problem we have (the current inability to edit patches with dependents).

The best part of this is that if we were to implement it, we would still keep the current workflows and add new ones. A “channel” would still be the linear log of one possible history (a bit like a Git reflog, but scoped to a “flavour” of the project in one specific repo), but you would be able to organise some of your features as branches, and not all of them.

Of course, some projects would be able to enforce good practices by forcing all patches to be tagged.

dfranke on December 24, 2023

I do think there’s an elegant synthesis to be had here. I don’t think the topic concept is quite it.

I think “patch-based” versus “snapshot-based” is a false dichotomy. They represent two distinct, but non-conflicting, goals for what you want out of a version control system. Snapshots are the “what”: they capture the time-evolution of the state of your project. Patches are the “why”: each patch is an expression of programmer intent, and by representing a transition from snapshot to another as a series of patches, you see what intentions went into that change. Tracking patches from one branch/channel to another therefore functions as a means of confirming that a merge was carried out as intended and that no code was inadvertently dropped or introduced. Notice that I’m edging close to something like a Curry-Howard paradigm here, where snapshots are types and patches are terms that “prove” those types.

In the synthesis I have in mind, channels consist of a linear succession of snapshots, and the transition to each snapshot from its predecessor (a “commit”) consists of a sequence of patches along with a justification for them. What the justification must show depends on the type of commit being made.

  • A commit which is new development or a “fast-forward”, meaning that it applies one or more patches against the same snapshot that they were originally written against, is axiomatically-self justifying.
  • A commit which cherry-picks patches from another channel needs to show that those patches commute freely from one point to the other, or else carry a conflict-resolution at each point where they can’t.
  • Justifying a merge commit is the same as justifying a cherry-pick of each patch being merged.
  • Justifying a revert is like justifying a cherry-pick of an inverse patch from the point immediately after the original patch was introduced.

The justification logic should also include identity and composition laws: it’s always legal to add a no-op patch, and it’s always legal to split a patch into two patches which compose into it.

In my original example, the history of main channel would reflect the initial commit, and then a merge of branch 1, and then a merge of branch 2. The initial commit would be new development and therefore require no justification. The merge of branch 1 would be justified by noting that it’s just a fast-forward, since branch 1 takes off from the same initial commit that the main branch is still on. The merge of branch 2 would be justified by showing that 2A, 2B, and 2C all commute cleanly across 1A, 1B, and 1C (which they do, because the patch algebra doesn’t know anything about the semantic constraints created by the assert statements). If you were to construct the intermediate state reached part-way through the justification of second merge, you’d still witness the assertion failure, but the channel history would clearly reflect that the merge was carried out as a single transaction and that this state never really occurred.

Now, suppose that you’re an especially careful developer and that as a matter of policy, you want to enforce that your test suite passes even at the intermediate steps of a merge. This is where the identity and composition laws become useful. One approach could be to decompose each patch into the hunk that changes the comment at the bottom and the hunk that uncomments a line in the main function, so that the two hunks that uncomment the assertions can commute to the end of the sequence. Another approach could be to introduce a no-op patch at the beginning of the sequence, decompose that no-op into a patch which adds a return statement above the assertions and a patch which deletes that statement, and then commute the deletion to the end of the sequence.

Some elaboration on your homomorphic hashing scheme could turn it into a system for justifying commits in zero knowledge, but I’m not sure if this is useful beyond being fun to think about.

pmeunier on December 25, 2023

First, I understand why you disagree with the patch/snapshot dichotomy, but it is important for algorithms. When your primary objects are snapshots, I don’t know any efficient algorithm to merge them, that has the properties I’m interested in: associativity is the most basic one, commutativity between conflict resolutions and merges (which Git tries to emulate using git rerere, when that command works) is next. This becomes obvious with office documents: in the tiny company I work for, I keep insisting that others send me comments and changes on existing office documents, rather than entire new version, because it is hard for me to understand how the documents have evolved.

Your comment is actually close to something I’ve been wanting to implement for ages, as a way to make history browsing efficient: at the moment there’s only one datastructure which gets edited with each patch. However, if we split this datastructure into a stack of “light tags”, where each “light tag” is a compressed version of the parts that differ from the rest of the stack, this could become the hybrid system we’re thinking about.

The main question becomes, what UI do we want for that? You seem to imply that a “light tag” should be created implicitly on every merge. One issue is that merges aren’t explicit at the moment, pijul pull and pijul apply can be used to mean many different things in your context. Maybe adding a --create-merge-tag flag on pijul pull would work? Did you have another way in mind?

The exact UI might become clearer once light tags are implemented, I should definitely start working on that.

dfranke on December 25, 2023

I’m observing some convergent evolution happening here :-). In git’s case, it presents its data model as being snapshot-based even at the level of plumbing commands, but then when you look a little deeper under the hood into the format of its packfiles you see that it actually represents a lot of things as patches for the sake of efficiency. And then much later, to work around the inadequacies of three-way merge, it grew rerere with its first-class notion of merge conflicts. Pijul had patches and first-class merge conflicts from the very beginning, and now we’re here discussing how best to represent snapshots.

Anyway, I think what you want is, yes, a “light tag” created by every command that mutates a channel, but not necessarily any explicit ontology of operation types. Instead, have the tag record a provenance for each patch added or removed from the channel since the previous tag, where the provenance is a pointer to another light tag, or nil in the case of the patch being new development. This would subsume concepts such as cherry-pick, merge, and revert, as they’d simply become descriptive of information that can be gleaned from provenance.