pijul_org / manual

More theory, format, and licenses

By pmeunier on February 27, 2019
This patch is not signed.
8acQniDBXbPDipeqmc3PjMu135yFgJJQ4rEXg7qyaLabNeCJ4hQX5k2jwEUb1fVkNWmrYW5xykrjUtgtchPSzsTM
This patch is in the following branches:
master
In file LICENSE
1
2
3
4
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 France License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0 or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.


Code exerpts taken from Pijul and reproduced here are licensed under the same license as Pijul, i.e. the GPL-2.0 or any later version at your convenience.
44
45
46
    - [pijul tag](./reference/tag.md)
    - [pijul unrecord](./reference/unrecord.md)
- [Licenses](./licenses.md)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
The `partials` table maps branch names to the subset of the repository tracked by this branch.
# Format of a Pijul repository

When running `tree` in a Pijul repository with one patch, we get the following result:

~~~
.pijul
├── changes.wUAdAoNq
├── hooks
├── id
├── local
│   └── ignore
├── meta.toml
├── patches
│   ├── AKm4JUd3Z5LRLpg1jezgmhLWGyzR27Qv9ATDFSb7QquqvtMwqEoNiiuVP1u2ho7V8qbprRCRPsmGY6bumYjoH8JZ.gz
│   └── AKm4JUd3Z5LRLpg1jezgmhLWGyzR27Qv9ATDFSb7QquqvtMwqEoNiiuVP1u2ho7V8qbprRCRPsmGY6bumYjoH8JZ.sig
├── pristine
│   ├── db
│   └── db.lock
└── version

4 directories, 9 files
~~~

There are a number of files:


## Pristine

This directory contains two files, one called `db` and another one called `db.lock`.
File `db` is a [Sanakirja](https://crates.io/crates/sanakirja) database, which is essentially a collection of B trees stored in a file. Sanakirja databases have the following properties:

- They are transactional: committing a transaction is an atomic operation, and transactions can be cancelled entirely.

- They achieve transactionality using the copy-on-write strategy, which enables readers to read in parallel to one writer. Before a transaction is committed, readers only see the pre-commit version of the database. Moreover, multiple writers still exclude each other.

- B trees of size n are clonable in time and space O(log n). In practice, the complexity of cloning is often much lower (i.e. independent from the database size), depending on the history of the database.

The `.lock` file is used to exclude multiple processes to access `db` concurrently, since there is no inter-process transaction synchronisation in Sanakirja at the moment.

Pijul stores a number of tables, defined in module `libpijul::backend`:

~~~ rust
pub struct Dbs {
    /// A map of the files in the working copy.
    tree: sanakirja::Db<self::file_id::UnsafeFileId, self::inode::Inode>,
    /// The reverse of tree.
    revtree: sanakirja::Db<self::inode::Inode, self::file_id::UnsafeFileId>,
    /// A map from inodes (in tree) to keys in branches.
    inodes: sanakirja::Db<self::inode::Inode, self::file_header::FileHeader>,
    /// The reverse of inodes, minus the header.
    revinodes: sanakirja::Db<self::key::Key<PatchId>, self::inode::Inode>,
    /// Text contents of keys.
    contents: sanakirja::Db<self::key::Key<PatchId>, sanakirja::value::UnsafeValue>,
    /// A map from external patch hashes to internal ids.
    internal: sanakirja::Db<self::hash::UnsafeHash, self::patch_id::PatchId>,
    /// The reverse of internal.
    external: sanakirja::Db<self::patch_id::PatchId, self::hash::UnsafeHash>,
    /// A reverse map of patch dependencies, i.e. (k,v) is in this map
    /// means that v depends on k.
    revdep: sanakirja::Db<self::patch_id::PatchId, self::patch_id::PatchId>,
    /// A map from branch names to graphs.
    branches:
        sanakirja::Db<self::small_string::UnsafeSmallStr, (NodesDb, PatchSet, RevPatchSet, u64)>,
    /// A map of edges to patches that remove them.
    cemetery: sanakirja::Db<(self::key::Key<PatchId>, self::edge::Edge), self::patch_id::PatchId>,
    /// Dependencies
    dep: sanakirja::Db<self::patch_id::PatchId, self::patch_id::PatchId>,
    /// Files touched by patches.
    touched_files: sanakirja::Db<self::key::Key<PatchId>, self::patch_id::PatchId>,
    /// Partial checkouts: branch -> partial
    partials: sanakirja::Db<self::small_string::UnsafeSmallStr, self::key::Key<PatchId>>,
}
~~~

Let's look at each of these tables:

~~~ rust
    /// A map of the files in the working copy.
    tree: sanakirja::Db<self::file_id::UnsafeFileId, self::inode::Inode>,
    /// The reverse of tree.
    revtree: sanakirja::Db<self::inode::Inode, self::file_id::UnsafeFileId>,
~~~

The "tree" and "revtree" tables are used to represent the working directory tree, and assign "inodes", which are just arbitrary integers, to files and directories.

~~~ rust
    /// A map from inodes (in tree) to keys in branches.
    inodes: sanakirja::Db<self::inode::Inode, self::file_header::FileHeader>,
    /// The reverse of inodes, minus the header.
    revinodes: sanakirja::Db<self::key::Key<PatchId>, self::inode::Inode>,
    /// Text contents of keys.
~~~

The `inodes` and `revinodes` table map inodes (i.e. files in the working directory) to vertices in the graph, plus states of the file. The values in the `inodes` table have type `FileHeader`, which contains an inode, plus the pre-record "state" of the file, i.e. what the user did with the file recently.

~~~ rust
    contents: sanakirja::Db<self::key::Key<PatchId>, sanakirja::value::UnsafeValue>,
~~~

The `contents` table just maps nodes in the graph to their binary contents.

~~~ rust
    /// A map from external patch hashes to internal ids.
    internal: sanakirja::Db<self::hash::UnsafeHash, self::patch_id::PatchId>,
    /// The reverse of internal.
    external: sanakirja::Db<self::patch_id::PatchId, self::hash::UnsafeHash>,
~~~

The `internal` and `external` tables map internal patch ids (64-bit integers) to external patches (SHA512 hashes), in both directions.

~~~ rust
    /// A reverse map of patch dependencies, i.e. (k,v) is in this map
    /// means that v depends on k.
    revdep: sanakirja::Db<self::patch_id::PatchId, self::patch_id::PatchId>,
~~~

The `revdep` table contains a reverse mapping of the dependencies. Dependencies of a patch are stored in the patch file itself, but the `unrecord` command needs to know whether a patch is still depended upon before accepting to remove it. This table provides that information.

~~~ rust
    /// A map from branch names to graphs.
    branches:
        sanakirja::Db<self::small_string::UnsafeSmallStr, (NodesDb, PatchSet, RevPatchSet, u64)>,
~~~

The `branches` table is a map from branch names to a triple of:

- A `NodesDB`, representing the graph of the branch.
- A `PatchSet`, mapping internal patch identifiers to their sequence number (a 64-bit integer) on the branch.
- A `RevPatchSet`, mapping sequence numbers to internal patch identifiers. In practice, this table is implementing a randomly-addressable stack.


~~~ rust
    /// A map of edges to patches that remove them.
    cemetery: sanakirja::Db<(self::key::Key<PatchId>, self::edge::Edge), self::patch_id::PatchId>,
~~~

The `cemetery` table is used in `unrecord`, and is used to indicate what to do in the situation where two patches change the label of the same edge, and we unrecord one of them. What should be the label of the resulting edge? The `cemetery` table stores which patch deletes which edge.

This table duplicates information contained in the patches, but makes things more self-contained and faster to read when needed.

~~~ rust
    /// Dependencies
    dep: sanakirja::Db<self::patch_id::PatchId, self::patch_id::PatchId>,
~~~

The `dep` table maps internal patch identifiers to their dependencies. This piece of information is also contained in the patches, but having it here makes the database more self-contained and faster to access.

~~~ rust
    /// Files touched by patches.
    touched_files: sanakirja::Db<self::key::Key<PatchId>, self::patch_id::PatchId>,
~~~

The `touched_files` table maps vertices in the graph (vertices representing files) to patch identifiers. It is used to tell which patches to pull if we only want the subset of patches that touched a certain file or directory in the repository.

~~~ rust
    /// Partial checkouts: branch -> partial
    partials: sanakirja::Db<self::small_string::UnsafeSmallStr, self::key::Key<PatchId>>,
~~~

The `partials` table maps branch names to the subset of the repository tracked by this branch. When we do a partial clone, we fully apply all the patches related to the part of the repository we want to get. However, these patches might also touch other parts that we don't want to output. This table tells what parts of the repository we want to work on in that case.
1
2
3
4
5
6
7
# Licenses

Code exerpts taken from Pijul and reproduced in this manual are licensed under the same license as Pijul, i.e. the GPL-2.0 or any later version at your convenience.
See [https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt](https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt) for the full text of the license.


All other parts of this manual is licensed under the Creative Commons Attribution-ShareAlike 3.0 License. To view a copy of this license, visit [http://creativecommons.org/licenses/by-sa/3.0](http://creativecommons.org/licenses/by-sa/3.0) or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. The full text of the license is available at [https://creativecommons.org/licenses/by-sa/3.0/legalcode](https://creativecommons.org/licenses/by-sa/3.0/legalcode).
3
4

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
The theory of Pijul is actually quite simple, even though it requires great care in the implementation, in order to get the correct algorithmic complexity.

Files are generalised to arbitrary graphs of lines, with different kinds of labels on the edges. The global data structure in a repository is *append-only*, in the sense that patches can never delete lines: line deletions from a file are implemented as changes in the labels of the edges pointing to that line.
In Pijul, **files are generalised to arbitrary graphs of lines**, with different kinds of labels on the edges. In this representation, an edge from line a to line b means that a comes before b in the file.
The global data structure in a repository is *append-only*, in the sense that patches can never delete lines: line deletions from a file are implemented as changes in the labels of the edges pointing to that line.

The hidden Pijul subcommand `info` (`pijul info --debug`) dumps the entire graph into a file in graphviz' dot format, one for each branch of the repository, called `debug_branch` (usually `debug_master` by default).

## Motivation of this representation

This idea is motivated by a notion of category theory called a [pushout](https://en.wikipedia.org/wiki/Pushout_(category_theory) (see also [the ncat lab](https://ncatlab.org/nlab/show/pushout)).

A pushout can be seen as a merge between two operations. Originally, many of the ideas behind the theory of Pijul come from [a paper by Mimram and Di Giusto](https://arxiv.org/abs/1311.3903) showing how to generalise flat files to make them have all pushouts. In other words, they show that a close relative of the representation of files in Pijul is the smallest generalisation of standard files that handles all the conflicts, modulo some caveats (that paper doesn't handle line deletions or multiple files).

If you want to learn more about category theory, a good starting point is to try and understand the [Yoneda lemma](https://ncatlab.org/nlab/show/Yoneda+lemma), which roughly shows how to relate any category to operations related to sets. This is important in particular because sets are a well-understood category. A few concepts are required to understand that lemma, mostly [categories](https://ncatlab.org/nlab/show/category+theory), the important example of [the "Set" category](https://ncatlab.org/nlab/show/Set), [functors](https://ncatlab.org/nlab/show/functor) and  [natural transformations](https://ncatlab.org/nlab/show/natural+transformation) between them, and finally [presheaves](https://ncatlab.org/nlab/show/presheaf).


## Outputting a generalised file

One of the important algorithms in Pijul is the one that outputs generalised files into actual on-disk files.

### Conflicting and non-conflicting files

If the two following conditions are both met:

- the graph is acyclic, and contains a path passing through all the vertices.
- for each vertex v, all incoming edges of v have the same label ("alive" vs. "deleted").

Then the task of presenting the file to the user is relatively straightforward: just go along the path and output all alive vertices. Else, we say the file has a *conflict*, and presenting the state of the file to the user in an intelligible way is not obvious.

The major benefit of this approach, compared to the valid-state-only approach of Git, or to the patches-only approach of Darcs, is that conflicts are just like any normal state. If we remove one side of the conflict (for instance using [unrecord](reference/unrecord.html), the conflict disappears.
If we create a patch that solves a conflict in one context, the same patch will solve the same conflict in any other context. This implies in particular that conflicts do not reappear.

### Zombies

When Alice deletes a block of text, and Bob writes inside that same block in parallel, the merge results in a "deleted context" when Alice tries to apply Bob's patch. On Bob's side, things are a little more complicated, because the situation looks no different than if Alice had been aware of Bob's line when she deleted her lines.

In this case, Pijul detects the conflict on both sides, and reconstructs a minimal context of one line around Bob's line, and produces completely equivalent repositories on each side (which is one of the fundamental promises of patch commutation). Because these context lines will now have both alive and deleted incoming edges, they are called "zombies" in Pijul's source code.

### Solving conflicts

Solving "zombie conflicts" is relatively straightforward, as we just have to produce a patch setting all the labels of incoming edges to a vertex to the same value.

However, in order to solve ordering conflicts, we need to add new edges. For complicated reasons related to algorithmic complexity, Pijul chooses to break such an edge in two halves, adding a new vertex in the middle. This also coincidentally yields a simpler patch format, since patches that just add vertices and change edge labels are enough.

### Cycles

Sometimes, solving a conflict involves ordering lines that were previously unordered. If Alice and Bob have a conflict between two lines A and B, and solve the conflict with opposite orders (i.e. Alice writes AB and Bob BA), this union of their edits will be cyclic, since Alice adds an edge from A to B, and Bob adds an edge from B to A.


## Producing a patch