fetsorn/pijul-spec - Change RD5W5OJB37WCUCK6FYFXSMK7GAWIF5EN5QQWROZSOI4KJMUPJ55AC

add transcriptions

Created by fetsorn on June 27, 2024

RD5W5OJB37WCUCK6FYFXSMK7GAWIF5EN5QQWROZSOI4KJMUPJ55AC

Dependencies

In channels

main

Change contents

File addition: transcriptions (d--x------)
[3.1]

File addition: 20240625_fossil_versus_git.md (----------)

[0.1]

# Document Title
Fossil Versus Git
1.0 Don't Stress!
The feature sets of Fossil and Git overlap in many ways. Both are distributed version control systems which store a tree of check-in objects to a local repository clone. In both systems, the local clone starts out as a full copy of the remote parent. New content gets added to the local clone and then later optionally pushed up to the remote, and changes to the remote can be pulled down to the local clone at will. Both systems offer diffing, patching, branching, merging, cherry-picking, bisecting, private branches, a stash, etc.
Fossil has inbound and outbound Git conversion features, so if you start out using one DVCS and later decide you like the other better, you can easily move your version-controlled file content.¹
In this document, we set all of that similarity and interoperability aside and focus on the important differences between the two, especially those that impact the user experience.
Keep in mind that you are reading this on a Fossil website, and though we try to be fair, the information here might be biased in favor of Fossil, if only because we spend most of our time using Fossil, not Git. Ask around for second opinions from people who have used both Fossil and Git.
If you want a more practical, less philosophical guide to moving from Git to Fossil, see our Git to Fossil Translation Guide.
2.0 Differences Between Fossil And Git
Differences between Fossil and Git are summarized by the following table, with further description in the text that follows.
| GIT                                   | FOSSIL                                                 | more    |
| File versioning only                  | VCS, tickets, wiki, docs, notes, forum, chat, UI, RBAC | 2.1 ↓   |
| A federation of many small programs   | One self-contained, stand-alone executable             | 2.2 ↓   |
| Custom key/value data store           | The most used SQL database in the world                | 2.3 ↓   |
| Runs natively on POSIX systems        | Runs natively on both POSIX and Windows                | 2.4 ↓   |
| Bazaar-style development              | Cathedral-style development                            | 2.5.1 ↓ |
| Designed for Linux kernel development | Designed for SQLite development                        | 2.5.2 ↓ |
| Many contributors                     | Select contributors                                    | 2.5.3 ↓ |
| Focus on individual branches          | Focus on the entire tree of changes                    | 2.5.4 ↓ |
| One check-out per repository          | Many check-outs per repository                         | 2.6 ↓   |
| Remembers what you should have done   | Remembers what you actually did                        | 2.7 ↓   |
| Commit first                          | Test first                                             | 2.8 ↓   |
| SHA-1 or SHA-2                        | SHA-1 and/or SHA-3, in the same repository             | 2.9 ↓   |
2.1 Featureful
Git provides file versioning services only, whereas Fossil adds an integrated wiki, ticketing & bug tracking, embedded documentation, technical notes, a web forum, and a chat service, all within a single nicely-designed skinnable web UI, protected by a fine-grained role-based access control system. These additional capabilities are available for Git as 3rd-party add-ons, but with Fossil they are integrated into the design, to the point that it approximates "GitHub-in-a-box."
Even if you only want straight version control, Fossil has affordances not available in Git.
For instance, Fossil can do operations over all local repo clones and check-out directories with a single command. You can say "fossil all sync" on a laptop prior to taking it off the network hosting those repos, as before going on a trip. It doesn't matter if those repos are private and restricted to your company network or public Internet-hosted repos, you get synced up with everything you need while off-network.
You get the same capability with several other Fossil sub-commands as well, such as "fossil all changes" to get a list of files that you forgot to commit prior to the end of your working day, across all repos.
Whenever Fossil is told to modify the local checkout in some destructive way (fossil rm, fossil update, fossil revert, etc.) Fossil remembers the prior state and is able to return the check-out directory to that state with a fossil undo command. While you cannot undo a commit in Fossil — on purpose! — as long as the change remains confined to the local check-out directory only, Fossil makes undo easier than in Git.
For developers who choose to self-host projects rather than rely on a 3rd-party service such as GitHub, Fossil is much easier to set up: the stand-alone Fossil executable together with a 2-line CGI script suffice to instantiate a full-featured developer website. To accomplish the same using Git requires locating, installing, configuring, integrating, and managing a wide assortment of separate tools. Standing up a developer website using Fossil can be done in minutes, whereas doing the same using Git requires hours or days.
Fossil is small, complete, and self-contained. If you clone Git's self-hosting repository, you get just Git's source code. If you clone Fossil's self-hosting repository, you get the entire Fossil website — source code, documentation, ticket history, and so forth.² That means you get a copy of this very article and all of its historical versions, plus the same for all of the other public content on this site.
2.2 Self Contained
Git is actually a collection of many small tools, each doing one small part of the job, which can be recombined (by experts) to perform powerful operations. Git has a lot of complexity and many dependencies, so that most people end up installing it via some kind of package manager, simply because the creation of complicated binary packages is best delegated to people skilled in their creation. Normal Git users are not expected to build Git from source and install it themselves.
Fossil is a single self-contained stand-alone executable which depends only on common platform libraries in its default configuration. To install one of our precompiled binaries, unpack the executable from the archive and put it somewhere in your PATH. To uninstall it, delete the executable.
This policy is particularly useful when running Fossil inside a restrictive container, anything from classic chroot jails to modern OS-level virtualization mechanisms such as Docker. Our stock container image is under 8 MB when uncompressed and running. It contains nothing but a single statically-linked binary.
If you build a dynamically linked binary instead, Fossil's on-disk size drops to around 6 MB, and it's dependent only on widespread platform libraries with stable ABIs such as glibc, zlib, and openssl.
Full static linking is easier on Windows, so our precompiled Windows binaries are just a ZIP archive containing only "fossil.exe". There is no "setup.exe" to run.
Fossil is easy to build from sources. Just run "./configure && make" on POSIX systems and "nmake /f Makefile.msc" on Windows.
Contrast a basic installation of Git, which takes up about 15 MiB on Debian 10 across 230 files, not counting the contents of /usr/share/doc or /usr/share/locale. If you need to deploy to any platform where you cannot count on facilities like the POSIX shell, Perl interpreter, and Tcl/Tk platform needed to fully use Git as part of the base platform, the full footprint of a Git installation extends to more like 45 MiB and thousands of files. This complicates several common scenarios: Git for Windows, chrooted Git servers, Docker images...
Some say that Git more closely adheres to the Unix philosophy, summarized as "many small tools, loosely joined," but we have many examples of other successful Unix software that violates that principle to good effect, from Apache to Python to ZFS. We can infer from that that this is not an absolute principle of good software design. Sometimes "many features, tightly-coupled" works better. What actually matters is effectiveness and efficiency. We believe Fossil achieves this.
The above size comparisons aren't apples-to-apples anyway. We've compared the size of Fossil with all of its many built-in features to a fairly minimal Git installation. You must add a lot of third-party software to Git to give it a Fossil-equivalent feature set. Consider GitLab, a third-party extension to Git wrapping it in many features, making it roughly Fossil-equivalent, though much more resource hungry and hence more costly to run than the equivalent Fossil setup. The official GitLab Community Edition container currently clocks in at 2.66 GiB!
GitLab's requirements are easy to accept when you're dedicating a local rack server or blade to it, since its minimum requirements are more or less a description of the smallest thing you could call a "server" these days, but when you go to host that in the cloud, you can expect to pay about 8 times as much to comfortably host GitLab as for Fossil.³ This difference is largely due to basic technology choices: Ruby and PostgreSQL vs C and SQLite.
The Fossil project itself is hosted on a small and inexpensive VPS. A bare-bones $5/month VPS or a spare Raspberry Pi is sufficient to run a full-up project site, complete with tickets, wiki, chat, and forum, in addition to being a code repository.
2.3 Query Language
The baseline data structures for Fossil and Git are the same, modulo formatting details. Both systems manage a directed acyclic graph (DAG) of Merkle tree structured check-in objects. Check-ins are identified by a cryptographic hash of the check-in contents, and each check-in refers to its parent via the parent's hash.
The difference is that Git stores its objects as individual files in the .git folder or compressed into bespoke key/value pack-files, whereas Fossil stores its objects in a SQLite database file which provides ACID transactions and a high-level query language. This difference is more than an implementation detail. It has important practical consequences.
One notable consequence is that it is difficult to find the descendants of check-ins in Git. One can easily locate the ancestors of a particular Git check-in by following the pointers embedded in the check-in object, but it is difficult to go the other direction and locate the descendants of a check-in. It is so difficult, in fact, that neither native Git nor GitHub provide this capability short of crawling the commit log. With Fossil, on the other hand, finding descendants is a simple SQL query. It is common in Fossil to ask to see all check-ins since the last release. Git lets you see "what came before". Fossil makes it just as easy to also see "what came after".
Leaf check-ins in Git that lack a "ref" become "detached," making them difficult to locate and subject to garbage collection. This detached head state problem has caused grief for many Git users. With Fossil, detached heads are simply impossible because we can always find our way back into the Merkle tree using one or more of the relations in the SQL database.
The SQL query capabilities of Fossil make it easier to track the changes for one particular file within a project. For example, you can easily find the complete edit history of this one document, or even the same history color-coded by committer, Both questions are simple SQL query in Fossil, with procedural code only being used to format the result for display. The same result could be obtained from Git, but because the data is in a key/value store, much more procedural code has to be written to walk the data and compute the result. And since that is a lot more work, the question is seldom asked.
The ease of querying Fossil data using SQL means that status or history information about the project under management is easier to obtain. Being easier means that it is more likely to happen. Fossil reports tend to be more detailed and useful. Compare this Fossil timeline to its closest equivalent in GitHub. Judge for yourself: which of those reports is more useful to a developer trying to understand what happened?
The bottom line is that even though Fossil and Git are built around the same low-level data structure, the use of SQL to query this data makes the data more accessible in Fossil, resulting in more detailed information being available to the user. This improves situational awareness and makes working on the project easier.
2.4 Portable
Fossil is largely written in ISO C, almost purely conforming to the original 1989 standard. We make very little use of C99, and we do not knowingly make any use of C11. Fossil does call POSIX and Windows APIs where necessary, but it's about as portable as you can ask given that ISO C doesn't define all of the facilities Fossil needs to do its thing. (Network sockets, file locking, etc.) There are certainly well-known platforms Fossil hasn't been ported to yet, but that's most likely due to lack of interest rather than inherent difficulties in doing the port. We believe the most stringent limit on its portability is that it assumes at least a 32-bit CPU and several megs of flat-addressed memory.⁴ Fossil isn't quite as portable as SQLite, but it's close.
Over half of the C code in Fossil is actually an embedded copy of the current version of SQLite. Much of what is Fossil-specific after you set SQLite itself aside is SQL code calling into SQLite. The number of lines of SQL code in Fossil isn't large by percentage, but since SQL is such an expressive, declarative language, it has an outsized contribution to Fossil's user-visible functionality.
Fossil isn't entirely C and SQL code. Its web UI uses JavaScript where necessary. The server-side UI scripting uses a custom minimal Tcl dialect called TH1, which is embedded into Fossil itself. Fossil's build system and test suite are largely based on Tcl.⁵ All of this is quite portable.
About half of Git's code is POSIX C, and about a third is POSIX shell code. This is largely why the so-called "Git for Windows" distributions (both first-party and third-party) are actually an MSYS POSIX portability environment bundled with all of the Git stuff, because it would be too painful to port Git natively to Windows. Git is a foreign citizen on Windows, speaking to it only through a translator.⁶
While Fossil does lean toward POSIX norms when given a choice — LF-only line endings are treated as first-class citizens over CR+LF, for example — the Windows build of Fossil is truly native.
The third-party extensions to Git tend to follow this same pattern. GitLab isn't portable to Windows at all, for example. For that matter, GitLab isn't even officially supported on macOS, the BSDs, or uncommon Linuxes! We have many users who regularly build and run Fossil on all of these systems.
2.5 Linux vs. SQLite
Fossil and Git promote different development styles because each one was specifically designed to support the creator's main software development project: Linus Torvalds designed Git to support development of the Linux kernel, and D. Richard Hipp designed Fossil to support the development of SQLite. Both projects must rank high on any objective list of "most important FOSS projects," yet these two projects are almost entirely unlike one another, so it is natural that the DVCSes created to support these projects also differ in many ways.
In the following sections, we will explain how four key differences between the Linux and SQLite software development projects dictated the design of each DVCS's low-friction usage path.
When deciding between these two DVCSes, you should ask yourself, "Is my project more like Linux or more like SQLite?"
2.5.1 Development Organization
Eric S. Raymond's seminal essay-turned-book "The Cathedral and the Bazaar" details the two major development organization styles found in FOSS projects. As it happens, Linux and SQLite fall on opposite sides of this dichotomy. Differing development organization styles dictate a different design and low-friction usage path in the tools created to support each project.
Git promotes the Linux kernel's bazaar development style, in which a loosely-associated mass of developers contribute their work through a hierarchy of lieutenants who manage and clean up these contributions for consideration by Linus Torvalds, who has the power to cherry-pick individual contributions into his version of the Linux kernel. Git allows an anonymous developer to rebase and push specific locally-named private branches, so that a Git repo clone often isn't really a clone at all: it may have an arbitrary number of differences relative to the repository it originally cloned from. Git encourages siloed development. Select work in a developer's local repository may remain private indefinitely.
All of this is exactly what one wants when doing bazaar-style development.
Fossil's normal mode of operation differs on every one of these points, with the specific designed-in goal of promoting SQLite's cathedral development model:
    Personal engagement: SQLite's developers know each other by name and work together daily on the project.
    Trust over hierarchy: SQLite's developers check changes into their local repository, and these are immediately and automatically synchronized up to the central repository; there is no "dictator and lieutenants" hierarchy as with Linux kernel contributions. D. Richard Hipp rarely overrides decisions made by those he has trusted with commit access on his repositories. Fossil allows you to give some users more power over what they can do with the repository, but Fossil only loosely supports the enforcement of a development organization's social and power hierarchies. Fossil is a great fit for flat organizations.
    No easy drive-by contributions: Git pull requests offer a low-friction path to accepting drive-by contributions. Fossil's closest equivalents are its unique bundle and patch features, which require higher engagement than firing off a PR.⁷ This difference comes directly from the initial designed purpose for each tool: the SQLite project doesn't accept outside contributions from previously-unknown developers, but the Linux kernel does.
    No rebasing: When your local repo clone syncs changes up to its parent, those changes are sent exactly as they were committed locally. There is no rebasing mechanism in Fossil, on purpose.
    Sync over push: Explicit pushes are uncommon in Fossil-based projects: the default is to rely on autosync mode instead, in which each commit syncs immediately to its parent repository. This is a mode so you can turn it off temporarily when needed, such as when working offline. Fossil is still a truly distributed version control system; it's just that its starting default is to assume you're rarely out of communication with the parent repo.
    This is not merely a reflection of modern always-connected computing environments. It is a conscious decision in direct support of SQLite's cathedral development model: we don't want developers going dark, then showing up weeks later with a massive bolus of changes for us to integrate all at once. Jim McCarthy put it well in his book on software project management, Dynamics of Software Development: "Beware of a guy in a room."
    Branch names sync: Unlike in Git, branch names in Fossil are not purely local labels. They sync along with everything else, so everyone sees the same set of branch names. Fossil's design choice here is a direct reflection of the Linux vs. SQLite project outlook: SQLite's developers collaborate closely on a single coherent project, whereas Linux's developers go off on tangents and occasionally send selected change sets to each other.
    Private branches are rare: Private branches exist in Fossil, but they're normally used to handle rare exception cases, whereas in many Git projects, they're part of the straight-line development process.
    Identical clones: Fossil's autosync system tries to keep each local clone identical to the repository it cloned from.
Where Git encourages siloed development, Fossil fights against it. Fossil places a lot of emphasis on synchronizing everyone's work and on reporting on the state of the project and the work of its developers, so that everyone — especially the project leader — can maintain a better mental picture of what is happening, leading to better situational awareness.
By contrast, "…forking is at the core of social coding at GitHub". As of January 2022, Github hosts 47 million distinct software projects, most of which were created by forking a previously-existing project. Since this is roughly twice the number of developers in the world, it beggars belief that most of these forks are still under active development. The vast bulk of these must be abandoned one-off efforts. This is part of the nature of bazaar style development.
You can think about this difference in terms of feedback loop size, which we know from the mathematics of control theory to directly affect the speed at which any system can safely make changes. The larger the feedback loop, the slower the whole system must run in order to avoid loss of control. The same concept shows up in other contexts, such as in the OODA loop concept. Committing your changes to private branches in order to delay a public push to the parent repo increases the size of your collaborators' control loops, either causing them to slow their work in order to safely react to your work, or to over-correct in response to each change.
Each DVCS can be used in the opposite style, but doing so works against their low-friction paths.
2.5.2 Scale
The Linux kernel has a far bigger developer community than that of SQLite: there are thousands and thousands of contributors to Linux, most of whom do not know each other's names. These thousands are responsible for producing roughly 89× more code than is in SQLite. (10.7 MLOC vs. 0.12 MLOC according to SLOCCount.) The Linux kernel and its development process were already uncommonly large back in 2005 when Git was designed, specifically to support the consequences of having such a large set of developers working on such a large code base.
95% of the code in SQLite comes from just four programmers, and 64% of it is from the lead developer alone. The SQLite developers know each other well and interact daily. Fossil was designed for this development model.
When choosing your DVCS, we think you should ask yourself whether the scale of your software configuration management problems is closer to those Linus Torvalds designed Git to cope with or whether your work's scale is closer to that of SQLite, for which D. Richard Hipp designed Fossil. An automotive air impact wrench running at 8000 RPM driving an M8 socket-cap bolt at 16 cm/s is not the best way to hang a picture on the living room wall.
Fossil works well for projects several times the size of SQLite, such as Tcl, with a repository over twice the size and with many more core committers.
2.5.3 Individual Branches vs. The Entire Change History
Both Fossil and Git store history as a directed acyclic graph (DAG) of changes, but Git tends to focus more on individual branches of the DAG, whereas Fossil puts more emphasis on the entire DAG.
For example, the default behavior in Git is to only synchronize a single branch, whereas with Fossil the only sync option is to sync the entire DAG. Git commands, GitHub, and GitLab tend to show only a single branch at a time, whereas Fossil usually shows all parallel branches at once. Git has commands like "rebase" that help keep all relevant changes on a single branch, whereas Fossil encourages a style of many concurrent branches constantly springing into existence, undergoing active development in parallel for a few days or weeks, then merging back into the main line and disappearing.
This difference in emphasis arises from the different purposes of the two systems. Git focuses on individual branches, because that is exactly what you want for a highly-distributed bazaar-style project such as Linux. Linus Torvalds does not want to see every check-in by every contributor to Linux: such extreme visibility does not scale well. Contrast Fossil, which was written for the cathedral-style SQLite project and its handful of active committers. Seeing all changes on all branches all at once helps keep the whole team up-to-date with what everybody else is doing, resulting in a more tightly focused and cohesive implementation.
2.6 One vs. Many Check-outs per Repository
Because Git commingles the repository data with the initial checkout of that repository, the default mode of operation in Git is to stick to that single work/repo tree, even when that's a shortsighted way of working.
Fossil doesn't work that way. A Fossil repository is a SQLite database file which is normally stored outside the working checkout directory. You can open a Fossil repository any number of times into any number of working directories. A common usage pattern is to have one working directory per active working branch, so that switching branches is done with a cd command rather than by checking out the branches successively in a single working directory.
Fossil does allow you to switch branches within a working checkout directory, and this is also often done. It is simply that there is no inherent penalty to either choice in Fossil as there is in Git. The standard advice is to use a switch-in-place workflow in Fossil when the disturbance from switching branches is small, and to use multiple checkouts when you have long-lived working branches that are different enough that switching in place is disruptive.
While you can use Git in the Fossil style, Git's default tie between working directory and repository means the standard method for working with a Git repo is to have one working directory only. Most Git tutorials teach this style, so it is how most people learn to use Git. Because relatively few people use Git with multiple working directories per repository, there are several known problems with that way of working, problems which don't happen in Fossil because of the clear separation between a Fossil repository and each working directory.
This distinction matters because switching branches inside a single working directory loses local context on each switch.
For instance, in any software project where the runnable program must be built from source files, you invalidate build objects on each switch, artificially increasing the time required to switch versions. Most obviously, this affects software written in statically-compiled programming languages such as C, Java, and Haskell, but it can even affect programs written in dynamic languages like JavaScript. A typical SPA build process involves several passes: Browserify to convert Node packages so they'll run in a web browser, SASS to CSS translation, transpilation of Typescript to JavaScript, uglification, etc. Once all that processing work is done for a given input file in a given working directory, why re-do that work just to switch versions? If most of the files that differ between versions don't change very often, you can save substantial time by switching branches with cd rather than swapping versions in-place within a working checkout directory.
For another example, you might have an active long-running test grinding away in a working directory, then get a call from a customer requiring that you switch to a stable branch to answer questions in terms of the version that customer is running. You don't want to stop the test in order to switch your lone working directory to the stable branch.
Disk space is cheap. Having several working directories — each with its own local state — makes switching versions cheap and fast.
Plus, cd is faster to type than git checkout or fossil update.
2.7 What you should have done vs. What you actually did
Git puts a lot of emphasis on maintaining a "clean" check-in history. Extraneous and experimental branches by individual developers often never make it into the main repository. Branches may be rebased before being pushed to make it appear as if development had been linear, or "squashed" to make it appear that multiple commits were made as a single commit. There are other history rewriting mechanisms in Git as well. Git strives to record what the development of a project should have looked like had there been no mistakes.
Fossil, in contrast, puts more emphasis on recording exactly what happened, including all of the messy errors, dead-ends, experimental branches, and so forth. One might argue that this makes the history of a Fossil project "messy," but another point of view is that this makes the history "accurate." In actual practice, the superior reporting tools available in Fossil mean that this incidental mess is not a factor.
Like Git, Fossil has an amend command for modifying prior commits, but unlike in Git, this works not by replacing data in the repository, but by adding a correction record to the repository that affects how later Fossil operations present the corrected data. The old information is still there in the repository, it is just overridden from the amendment point forward.
Fossil lacks almost every other history rewriting mechanism listed on the Git documentation page linked above. There is no rebase in Fossil, on purpose, thus no way to reorder or copy commits around in the commit hash tree. There is no commit squashing, dropping, or interactive patch-based cherry-picking of commit elements in Fossil. There is nothing like Git's filter-branch in Fossil.
The lone exception is deleting commits. Fossil has two methods for doing that, both of which have stringent limitations, on purpose.
The first is shunning. See that document for details, but briefly, you only get mandatory compliance for shun requests within a single repository. Shun requests do not propagate automatically between repository clones. A Fossil repository administrator can cooperatively pull another repo's shun requests across a sync boundary, so that two admins can get together and agree to shun certain committed artifacts, but a person cannot force their local shun requests into another repo without having admin-level control over the receiving repo as well. Fossil's shun feature isn't for fixing up everyday bad commits, it's for dealing with extreme situations: public commits of secret material, ticket/wiki/forum spam, law enforcement takedown demands, etc.
There is also the experimental purge command, which differs from shunning in ways that aren't especially important in the context of this document. At a 30000 foot level, you can think of purging as useful only when you've turned off Fossil's autosync feature and want to pluck artifacts out of its hash tree before they get pushed. In that sense, it's approximately the same as git rebase -i, drop. However, given that Fossil defaults to having autosync enabled for good reason, the purge command isn't very useful in practice: once a commit has been pushed into another repo, shunning is more useful if you need to delete it from history.
If these accommodations strike you as incoherent with respect to Fossil's philosophy of durable, unchanging commits, realize that if shunning and purging were removed from Fossil, you could still remove artifacts from the repository with SQL DELETE statements; the repository database file is, after all, directly modifiable, being writable by your user. Where the Fossil philosophy really takes hold is in making it difficult to violate the integrity of the hash tree. It's somewhat tangential, but the document "Is Fossil a Blockchain?" touches on this and related topics.
One commentator characterized Git as recording history according to the victors, whereas Fossil records history as it actually happened.
2.8 Test Before Commit
One of the things that falls out of Git's default separation of commit from push is that there are several Git sub-commands that jump straight to the commit step before a change could possibly be tested. Fossil, by contrast, makes the equivalent change to the local working check-out only, requiring a separate check-in step to commit the change. This design difference falls naturally out of Fossil's default-enabled autosync feature and its philosophy of not offering history rewriting features.
The prime example in Git is rebasing: the change happens to the local repository immediately if successful, even though you haven't tested the change yet. It's possible to argue for such a design in a tool like Git since it lacks an autosync feature, because you can still test the change before pushing local changes to the parent repo, but in the meantime you've made a durable change to your local Git repository. You must do something drastic like git reset --hard to revert that rebase or rewrite history before pushing it if the rebase causes a problem. If you push your rebased local repo up to the parent without testing first, you cannot fix it without violating the golden rule of rebasing.
Lesser examples are the Git merge, cherry-pick, and revert commands, all of which apply work from one branch onto another, and all of which commit their change to the local repository immediately without giving you an opportunity to test the change first unless you give the --no-commit option. Otherwise, you're back in the same boat: reset the local repository or rewrite history to fix things, then maybe retry.
Fossil cannot sensibly work that way because of its default-enabled autosync feature and its purposeful paucity of commands for modifying commits, as discussed in the prior section.
Instead of jumping straight to the commit step, Fossil applies the proposed merge to the local working directory only, requiring a separate check-in step before the change is committed to the repository. This gives you a chance to test the change first, either manually or by running your software's automatic tests. (Ideally, both!) Thus, Fossil doesn't need rebase, squashing, reset --hard, or other Git commit mutating mechanisms.
Because Fossil requires an explicit commit for a merge, it has the nice side benefit that it makes you give an explicit commit message for each merge, whereas Git writes that commit message itself by default unless you give the optional --edit flag to override it.
We don't look at this difference as a workaround in Fossil for autosync, but instead as a test-first philosophical difference: fossil commit is a commitment. When every commit is pushed to the parent repo by default, it encourages a working style in which every commit is tested first. It encourages thinking before acting. We believe this is an inherently good thing.
Incidentally, this is a good example of Git's messy command design. These three commands:
$ git merge HASH 
$ git cherry-pick HASH 
$ git revert HASH
...are all the same command in Fossil:
$ fossil merge HASH
$ fossil merge --cherrypick HASH
$ fossil merge --backout HASH
If you think about it, they're all the same function: apply work done on one branch to another. All that changes between these commands is how much work gets applied — just one check-in or a whole branch — and the merge direction. This is the sort of thing we mean when we point out that Fossil's command interface is simpler than Git's: there are fewer concepts to keep track of in your mental model of Fossil's internal operation.
Fossil's implementation of the feature is also simpler to describe. The brief online help for fossil merge is currently 41 lines long, to which you want to add the 600 lines of the branching document. The equivalent documentation in Git is the aggregation of the man pages for the above three commands, which is over 1000 lines, much of it mutually redundant. (e.g. Git's --edit and --no-commit options get described three times, each time differently.) Fossil's documentation is not only more concise, it gives a nice split of brief online help and full online documentation.
2.9 Hash Algorithm: SHA-3 vs SHA-2 vs SHA-1
Fossil started out using 160-bit SHA-1 hashes to identify check-ins, just as in Git. That changed in early 2017 when news of the SHAttered attack broke, demonstrating that SHA-1 collisions were now practical to create. Two weeks later, the creator of Fossil delivered a new release allowing a clean migration to 256-bit SHA-3 with full backwards compatibility to old SHA-1 based repositories.
In October 2019, after the last of the major binary package repos offering Fossil upgraded to Fossil 2.x, we switched the default hash mode so that the conversion to SHA-3 is fully automatic. This not only solves the SHAttered problem, it should prevent a reoccurrence of similar problems for the foreseeable future.
Meanwhile, the Git community took until August 2018 to publish their first plan for solving the same problem by moving to SHA-256, a variant of the older SHA-2 algorithm. As of this writing in February 2020, that plan hasn't been implemented, as far as this author is aware, but there is now a competing SHA-256 based plan which requires complete repository conversion from SHA-1 to SHA-256, breaking all public hashes in the repo. One way to characterize such a massive upheaval in Git terms is a whole-project rebase, which violates Git's own Golden Rule of Rebasing.
Regardless of the eventual implementation details, we fully expect Git to move off SHA-1 eventually and for the changes to take years more to percolate through the community.
Almost three years after Fossil solved this problem, the SHAmbles attack was published, further weakening the case for continuing to use SHA-1.
The practical impact of attacks like SHAttered and SHAmbles on the Git and Fossil Merkle trees isn't clear, but you want to have your repositories moved over to a stronger hash algorithm before someone figures out how to make use of the weaknesses in the old one. Fossil has had this covered for years now, so that the solution is now almost universally deployed.
Asides and Digressions
    Many things are lost in making a Git mirror of a Fossil repo due to limitations of Git relative to Fossil. GitHub adds some of these missing features to stock Git, but because they're not part of Git proper, exporting a Fossil repository to GitHub will still not include them; Fossil tickets do not become GitHub issues, for example.
    The fossil-scm.org web site is actually hosted in several parts, so that it is not strictly true that "everything" on it is in the self-hosting Fossil project repo. The web forum is hosted as a separate Fossil repo from the main Fossil self-hosting repo for administration reasons, and the Download page content isn't normally synchronized with a "fossil clone" command unless you add the "-u" option. (See "How the Download Page Works" for details.) Chat history is deliberately not synced as chat messages are intended to be ephemeral. There may also be some purely static elements of the web site served via D. Richard Hipp's own lightweight web server, althttpd, which is configured as a front end to Fossil running in CGI mode on these sites.
    That estimate is based on pricing at Digital Ocean in mid-2019: Fossil will run just fine on the smallest instance they offer, at US $5/month, but the closest match to GitLab's minimum requirements among Digital Ocean's offerings currently costs $40/month.
    This means you can give up waiting for Fossil to be ported to the PDP-11, but we remain hopeful that someone may eventually port it to z/OS.
    "Why is there all this Tcl in and around Fossil?" you may ask. It is because D. Richard Hipp is a long-time Tcl user and contributor. SQLite started out as an embedded database for Tcl specifically. ([Reference]) When he then created Fossil to manage the development of SQLite, it was natural for him to use Tcl-based tools for its scripting, build system, test system, etc. It came full circle in 2011 when the Tcl and Tk projects moved from CVS to Fossil.
    A minority of the pieces of the Git core software suite are written in other languages, primarily Perl, Python, and Tcl. (e.g. git-send-mail, git-p4, and gitk, respectively.) Although these interpreters are quite portable, they aren't installed by default everywhere, and on some platforms you can't count on them at all. (Not just Windows, but also the BSDs and many other non-Linux platforms.) This expands the dependency footprint of Git considerably. It is why the current Git for Windows distribution is 44.7 MiB but the current fossil.exe zip file for Windows is 2.24 MiB. Fossil is much smaller despite using a roughly similar amount of high-level scripting code because its interpreters are compact and built into Fossil itself.
    Both Fossil and Git support patch(1) files — unified diff formatted output — for accepting drive-by contributions, but it's a lossy contribution path for both systems. Unlike Git PRs and Fossil bundles, patch files collapse multiple checkins together, they don't include check-in comments, and they cannot encode changes made above the individual file content layer: you lose branching decisions, tag changes, file renames, and more when using patch files. The fossil patch command also solves these problems, but it is because it works like a Fossil bundle, only for uncommitted changes; it doesn't use Larry Wall's patch tool to apply unified diff output to the receiving Fossil checkout.
This page was generated in about 0.008s by Fossil 2.25 [390e00134e] 2024-06-25 06:30:55

File addition: 20240418_learning_git_a_hands_on_and_visual_guide.md (----------)

[0.1]

Welcome to another episode.
My name is Helen Scott.
And today I'm going to be interviewing  Anna.
So  Anna is a creative.
She uses her communication and her storytelling,
and she's used it to tell Git in a simple
way.
So,  Anna, welcome.
 Thanks for having me, Helen.
I'm excited.
I'm excited too.
I'm just gonna see how many stories of my
Git frustration I can weave into this interview.
So  Anna's published her first book, "Learning
Git."
It's on the O'Reilly platform.
You published it June last year,  Anna?
June last year.
June last year.
So, I've got a copy of the book.
I've been very efficient, and put it back
on my bookshelf over here.
For the purposes of this interview, let's
start right at the beginning.
Tell us a little bit more about you and what
you enjoy doing, before we get started into
the book.cv
I think what I enjoy doing is taking a complicated
topic or a topic that confuses people, and
making it simpler and more approachable.
And presenting the information in a way that
it's easier to learn for beginners.
So that's what I did with Git.
And that's what I also do in my day job, because
I work as a technical writer.
So in my day job, I also take various topics
and explain them in a simpler way, and present
information in a simple way, so that people
can consume it better.
Fantastic.
And that's so important.
Having a technical writing background myself
as well, it's just super important that when
a user comes...looks for the documentation,
that it is written in such a way that you
can actually help them to move forward and
get past the problem that's made them go and
look at the documentation in the first place,
right?
Yes.
Okay.
So, I'm going to not derail this interview
completely with how annoyed I can get at Git
sometimes, but I think maybe some of our audience
might share some of these frustrations.
What was your primary motivators behind writing
this book?
Okay.
So, the reason I wrote this book is because
I needed this book.
So I'm just going to backtrack a little and
say that my entryway into the world of tech
was through the world of UX design, so user
experience design.
So, at some point in my life, I did a UX design
bootcamp, and I worked as a UX designer.
And then when I was working as a UX designer,
I'd realized that I knew how to design apps
and websites, so sort of like an architect.
But I didn't know how to actually build those
apps and websites.
I wasn't like the construction company that
gets the blueprints and actually builds them.
I got really curious about how these things
are built.
So I ended up doing a coding bootcamp.
That's kind of a three-to-four-month intensive
program, learned the basics of web development,
and then worked as a front-end developer.
The first time that I Git introduced to get
was in that coding boot camp.
But it was one hour where they just kind of
told us Git add, Git commit, Git push, go.
Off you go.
And obviously, that may have been sort of
enough when you were just working on your
projects in the coding boot camp.
But once I got my first job as a junior front-end
developer, and I had to work with a team of
developers and senior developers, I was terrified
of Git.
I mean, every time I had to do something that
I deemed was complicated, which was almost
everything, I would call on the senior developers,
and ask them to help me.
And I was always worried I was about to destroy
the repository, and take down the entire platform
and the entire website.
And this was like a massive ecommerce platform,
so that would have not been good.
Little did I know that that was not the case.
And that that was never gonna happen.
But anyways, so at some point during that
job, I realized I want to conquer this fear.
I want to learn how to use Git, and I want
to understand how it works, so that I can
be an independent developer, and not have
to ask for help all the time.
So I started learning, looking for online
resources to learn Git.
And what I realized was that there weren't
really any online resources that were designed
for people like me, that were really new to
tech, that had transitioned into tech from
non-technical backgrounds, and that explained
things in a simple way.
Then at some point, this creative idea came
to me of how I could teach Git using colors,
and storytelling, and visuals.
I mean, this was after I'd kind of understood
some of the basics.
So the first thing that I did was actually
make an online course teaching Git.
And that's still available online.
At the moment, it's on Udemy.
Who knows where it will be in the future.
But that journey...
When I was making the online course, I still
wanted to write a book.
But I felt that the barrier to entry to write
the book was higher than to make an online
course.
Because with online courses, you just kind
of record your voice, make some slides, record
them.
So I could do that a lot easier and publish
it online a lot easier.
But once I released that online course, I
started getting reviews, I started getting
feedback.
I realized my way of teaching Git really resonated
with a lot of people.
There were a lot of people that, just like
me, had not been served by the Git teaching
resources out there up until now.
My approach to organizing information and
presenting concepts worked for them.
And then I was like, all right, since it works,
let's write this book.
I am also a writer in my personal time, and
I love to journal.
So, writing is my medium of choice.
That's kind of how this book came about.
This is the book that I wish I had had in
my first week of that coding boot camp.
Or especially in that first week of my new
job as a front-end developer.
A hundred percent.
100%.
And just in hearing that story, there's so
much that resonated with me.
People have talked about, regardless of how
you ended up in the profession, whether you're
coming through a degree, you're coming through
a boot camp, you're self-taught, whatever
it is.
How much you learn about version control,
and, you know, Git is part of that, is really
variable.
In my experience, sometimes 
it's purely conceptual.
It's like, there is this thing that you can
do.
You will learn about it on your job.
And then you turn up at your job and you're
like, "Oh, I'm terrified of going to study
the whole repository."
So I think we've all been there.
And I think we've all had experience of knowing
who or even being at times that expert in
Git on the team, that people go to when they've
gone beyond the Git pull or, you know, Git
update, Git push.
They've gone beyond the basics, and they're
like, "Oh, I'm in trouble now.
Something's not working."
So I think certainly, I identified with a
lot of what you said there, and I expect our
audience did as well.
So much frustration.
The other thing that actually has been very
surprising is that it's not just developers
that use Git, there are so many other people
that work with developers or that do other
jobs that use Git.
And this I've discovered since publishing
the book.
Game artists.
Mechanical engineers also use Git.
Sometimes UX designers have to collaborate
with devs, and share assets or whatever.
Even product managers.
And actually one of the biggest audiences
my book actually got was technical writers,
because they often have this thing that we
call Docs as Code Approach.
And they use Git to manage the repositories
for the documentation websites.
So some technical writers come from a technical
background, but some don't.
And so, technical concepts don't come naturally
to them.
My book has really served various different
audiences, including junior developers and
experienced developers, but also just so many
other professions.
Which, yeah, has been very eye opening.
And again, identified with that, because it
was back in my technical writing career, and
I started using Git.
And I needed that Git expert in the team,
that developer, and I was like, "I've got
it in reverse.
Please help me."
So let's move swiftly on to talking about
the book itself, which is why the majority
of the audience will be here.
Can you give us an overview of the book and
talk about its structure a little bit more,
please?
Definitely.
So, the book is for absolute complete beginners.
If you have a bit of experience, you can maybe
try to skip a chapter.
Normally, you should do it from the beginning
to the end, though, because it's a hands-on
learning experience, where you're working
on a specific repository throughout the entire
book, which is actually called the rainbow
repository.
Because you're listing the colors of the rainbow.
I'll explain that a little bit more later.
But the first chapter actually just starts
off with installing Git and an introduction
to the command line, because some people actually
haven't worked in the command line and aren't
familiar with it.
So, it really is a book for absolute beginners.
Then I build this thing that I call the Git
diagram, which is my way of creating a mental
model of the different areas of Git and how
they fit together.
So, you have your project directory, the working
directory, the local repository.
Then inside the local repository, you have
the staging area and the commit history.
When I was learning Git, I came across diagrams
that tried to kind of depict these different
areas, and how they interact.
They didn't really make sense to me.
So I've actually created my own representation
of these areas.
This is really key to my teaching methodology.
Because the main part of my teaching methodology
is creating a mental model of how things work,
and making things tangible.
So, we build the Git diagram.
Once we have that, we go over the process
of making commits.
We introduce branches and what they are.
And we create visualizations for all of these
things.
So I visualize what branches look like.
They're just pointers to commits.
Then what else?
I have a list here.
I go over merging, and I introduce the two
types of merges, fast forward merges, three-way
merges.
We're at chapter five now.
And there we just go over the experience of
doing a fast-forward merge, not yet a three-way
merge.
That's a bit more spicy, and it comes later
on in the learning journey.
And then in chapter six is when we actually
go from working just in the local repository,
just on your computer, on your own.
We introduce hosting services.
so GitHub, GitLab, Bitbucket, 
and remote repositories, 
basically.
One thing I should mention right now, just
a quick parenthesis, is that, the book is
not prescriptive.
You can use whichever hosting service you
want.
You can use whichever text editor you want.
I didn't want to exclude anyone that uses
maybe a different technology.
And I wanted to make the book accessible to,
yeah, anything.
So yeah, so that's closing parenthesis now.
Moving on, chapter seven, we jump into creating
and pushing to a remote repository.
So we're really into, you know, local repository,
remote repository.
Chapter eight, we go over cloning and fetching
data.
So, in chapter eight, is where the learning
experience, we simulate that you're working
with another person.
So, in the book, we say that a friend of yours
wants to join you on your rainbow project.
So they create a...
Well, they clone your remote repository, and
they create a local repository called Friend
Rainbow.
And I mean, if you have a friend that you
can actually do all the exercises with, then
that's ideal.
But the most realistic thing is that you just
create a second local repository on your computer,
and you just kind of pretend it's on someone
else's computer.
But this is really important, because at the
end of the day, Git is a collaboration tool,
right?
It's version control and collaboration.
So, if you don't have any representations
of how the collaboration happens, then that
leaves out basically more than half of Git.
Chapter eight, you're learning how to clone,
how to fetch data from a remote repository.
And chapter nine, finally, we get into the
spicy topic of...
Well, not that spicy.
Three-way merges are pretty simple.
But then chapter 10, we get into merge conflicts,
which is the spicier topic, and the thing
that a lot of people are afraid of.
Chapter 11, rebasing.
Rebasing, I'd say is the most advanced thing
that I cover in my book.
Like I said, this book is a real basics and
beginner book.
So, rebasing is, yeah, at the end.
And finally, the last chapter is pull requests
or merge requests, whatever terminology you
want to use.
And obviously, pull requests, merge requests,
they're not actually a feature of Git itself.
They're a feature of the hosting services.
So GitHub, GitLab, Bitbucket, and others.
There are others.
I'm just not going to start naming 20 different
hosting services.
But I thought that they were so important
because they really...
Yeah, they're essential, almost for everyone's
workflow.
So I thought, okay, I'll make an exception
and include them, even though this is a book
about Git.
That's kind of an overview.
And like I mentioned, you're working on this
rainbow project, and it's hands on.
You are with your computer, doing the exercises.
So you are supposed to do the book from chapter
1 to chapter 12.
Because if you don't, then you'll miss out,
you won't be able to follow along.
But I have created an appendix, where I've
offered instructions on how someone can start,
like, create the minimum set up for their
project to start off on any chapter.
Because, yeah, maybe you read the book once,
and then you just want to review chapter eight,
or you just want to review chapter nine.
And you don't have to go from chapter one
all the way to that chapter just to be able
to review it.
But yeah, that's the kind of overview of the
book.
Brilliant.
So for my next question, I'm going to use
the book as my demo, so the audience can see
what I'm talking about when I ask this next
question.
For example, you've made extensive use of
color in this book.
And you've mentioned the 
rainbow project repeatedly.
What made you choose that theme?
And why?
When my creative idea of how I could teach
Git in a simple way came to me, it was all
about using color.
Because I thought to myself, one of the really
confusing or difficult things with Git, when
you're teaching it, is that there's commit
hashes.
So every commit...
A commit is a version of your project.
Every commit has a commit hash, so 40 characters,
letters, and numbers, that's unique.
And it's like a name for the commit.
But if you're teaching Git, and you're having
to refer to, well, remember commit, six, nine,
Z, blah, blah, blah, blah, blah.
That is so confusing.
Who wants to learn like that?
So I thought to myself, how can I use color
instead?
And so let me give an example.
In the rainbow project, the very first thing
you add to your project, you create a rainbow.txt
file.
So a txt file, very simple.
I keep everything really simple.
And I'll make a comment about that in a second.
And the first thing you add is just...red
is the first color of the rainbow.
You just add that sentence, first line of
your file, and you add that to the staging
area, and you make a commit.
And then I represent that commit in the diagrams
as a circle, that is red.
And so from then on, I can just say, the red
commits.
And that just simplifies the learning experience.
It makes it a lot more memorable, and also
very much more visual.
Because I'm not having to include, like, a
little commit hash in my diagram to try to
refer to the commit.
That's why I use color in my teaching methodology.
And the rainbow was just a really nice way
to structure it.
You know, we all...well, many of us, or most
of us know that the order of the colors of
the rainbow.
So it's a very familiar thing.
It was easy to then, yeah, structure the whole
book that way.
Although at the end, I ran out of colors.
I just had to add some random colors.
I actually have...
At the end, you add another file to your project
called othercolors.txt, and you start adding,
like, pink is not a color in the rainbow.
And gray is not a color in the rainbow.
Because I literally ran out of colors.
But also because I wanted to show, you know,
how you add new files to your project.
But the other thing I wanted to say about
keeping things simple, is that one of the
decisions I made with this book is that it
would have no code in it.
So, the files in your project are just dot-txt
files, which are just really plain text files.
They're not even markdown files.
Like, it is so simple.
Because I thought if I make the project that
you work on in the book, a web development
project, or a data science project, a Java
project, a Python project, anything, it will
exclude some people for whom that is not familiar.
And let's say, yeah, fine, well, those people
can go and look it up.
It just complicates things.
It's not necessary.
I wanted someone to just be able to focus
on learning the concepts of Git, rather than
having to also learn other tech concepts,
which are not relevant to just learning Git.
So, yeah, that's kind of...that was my way
of approaching how to teach this stuff.
That's great.
And I think what's really helpful and insightful
is, the very deliberate decisions that you
made along the way.
You know, this didn't just happen by accident,
you made a very deliberate choice that I'm
going to represent hashes with blobs of color,
and therefore I'm going to refer to those.
And you made a very deliberate choice to use
txt files, a concept that is going to be familiar
to your entire audience.
So I really liked that you made those conscious
decisions upfront to create the learning journey
that, you know, you said yourself you wanted
when you first started working with Git.
YOne more I can add is screenshots.
I decided not a single screenshot in my book.
Because I thought to myself, the minute I
add a screenshot, the book is out of date.
And since I was able to do with...
So true.
Helen knows this because she wrote a book
with lots of screenshots.
But you have to have screenshots in yours.
Sorry, off topic.
I think that was another really, really conscious
decision of mine, of, since it's not necessary,
don't include screenshots.
Because again, they're not relevant to everyone.
Everyone has a different operating system,
and a different version, and UI changes.
And the minute that the book goes to print,
it would be out of date.
So, that was another really conscious decision
I made for this book.
Good call out.
Good call out.
And yes, the pain is real.
Just ask my co-author, Trisha Gee, who updated
them all.
Okay, so the next question is kind of a double
question.
You can answer it in any order you like.
But it's, who should read this book?
And equally as importantly, who is this book
not for?
So who should read this book?
I think anyone that wants to learn Git.
So they've never used Git, and someone's told
them they have to, or they realize they have
to for their work or for their personal project.
So anyone that wants to learn Git.
Anyone that's confused by Git.
I have talked to developers with 10 years'
experience, that still are afraid of Git and
don't have a mental model that they can reliably,
like, use and feel confident with.
And they even tell me, this book helped me
to put things together.
So, yeah, junior developers, anyone that doesn't
yet really understand how the pieces come
together.
Because what happens is, when you don't have
a good mental model of Git, once you start...
Like, maybe you're okay with doing the Git
add, Git commit, Git push.
But once you start going to the more advanced
topics, you don't really understand how they
work.
And that's where you start getting really
confused and scared.
And it all becomes a bit challenging.
So that's who I think could benefit from this
book.
Also, I would say anyone that's a visual learner,
and anyone that's kind of more like a tactile,
like, wants to like see...kind of make things
tangible type of learner.
I do have to say, you know, this book isn't
for everyone.
Not everyone's a visual learner.
And I totally appreciate that.
And that means that this book will appeal
to a certain type of learner, but not to another.
Now, let's get to the topic of who this book
is not for.
It's not for anyone that uses Git, and really
has their own mental model of how it works,
and it works for them.
And they never really struggle with understanding
things.
I mean, they don't need it.
It's not for anyone that's looking for an
advanced guide to Git, you know, that goes
over advanced features of Git, more niche
commands.
It's not for them.
That stuff is not in the book, so they'll
not get anything from it.
It's also not for anyone that's looking for
a resource that will teach them what their
Git workflow should be, or what best practices
should be.
In the book, I don't teach you, like, oh,
well, the best branching workflow is this,
and this, and this.
Or, yeah, I don't know, this is how you should
set things up.
Like I said, the book is not at all prescriptive.
And actually, to be honest, the rainbow project
is not a really realistic project of a software
project.
I mean, you're listing the colors of the rainbow.
That's not really what you're usually doing
when you're building software.
So for a lot of developers, for example, the
example in the book is not so realistic.
It's more about building that mental model.
Although I do have a second example in the
book, which is called the example book project,
which is a little bit more realistic, because
it uses storytelling to provide a bit more
of a realistic use of Git.
But again, it's not.
And the other thing is, anyone looking for
other best practices, like, how should I be
naming my branches?
What should I be including in a commit?
Let me think, what else?
So those kinds of things, I don't provide
any guidance on that.
Because, like I said, I focus on teaching
a mental model.
And those things are really up to...they're
really kind of dependent on your opinion,
on which company you work in, which sector
you work in, what your background is, what
you personally like in branch names, and in,
yeah...or commit messages.
So, it was not something that I wanted to
confuse people with and clutter up the book
with.
I think there's plenty of other resources
in the world that provide guidance on that.
And the final thing is that this book is not
a reference.
So it's really kind of a learning journey.
It's a hands-on learning journey.
But it's not the kind of book that you would
be like, oh, you know, each chapter...
I don't know, it's not a reference guide.
So, to be honest, the Git website is the best
reference.
I mean, it is a reference.
They have a reference.
So, Git-scm.com, you got a reference there.
And other people have built a reference.
So yeah, I think that's kind of who the book
is for and who the book isn't for.
Okay.
So, we've mentioned previously that the book 
is designed to be a sequential 
learning experience.
Start at the beginning, progress, especially
if you're new to Git.
But there's going to be people out there that
will definitely have the question of, how
much Git experience do I need if I'm going
to buy this book?
And what's the answer to that one?
Zero.
That's an easy answer.
I could just, like, zero.
I have nothing else to say.
No, zero, really, like I mentioned in the
very first chapter, we go over the process
of downloading Git, and kind of setting it
up, setting up like the basics.
And I even introduce the command line.
Like, I tell you, this is the app that is
the command line, open the command line.
This is the command prompt.
This is where you enter commands.
You write a command and you press enter.
And I introduce some, like, very basic commands
like CD, change directory, or LS, like, list
the files, the visible files.
I introduce the concept of visible files and
hidden files.
I introduce, like, I don't know, mkdirs, or
make directory.
Just a couple of super basic commands.
So, yeah, and we go from the very start.
Like, chapter two is, you know, introducing
that Git diagram.
And so, zero, zero, zero.
I user tested.
We'll get into that later.
But I literally gave this book to my brother,
my dad, people that are...well, at least my
brother, not at all in the tech space.
Or at least, you know, not developers of any
kind, at the moment, at least.
So, yeah, it's zero.
Brilliant.
We're gonna get to that now.
So people who get value from this book, no
Git experience is necessarily, none whatsoever.
Road tested with your dad and brother, amongst
other people.
And tells a sequential journey.
Anybody who is looking to understand the mental
model of Git, anybody who perhaps has been
using Git, but is a little bit less confident
around some of the operations.
Whether they're the more advanced ones like
rebase, or the spicy three-way merge, which
is absolutely what I'm always going to call
the three-way merge from this point forward
in my career, always going to prefix it with
spicy.
And anybody who just needs to brush up on
some of those underlying concepts, because
Git is very...
What are the words?
You need to build on top of the basics.
If you don't understand the basics in Git,
then the more advanced stuff, the more advanced
commands tend to be more challenging than
perhaps they would be if you had a good grasp
of the underlying mental model.
It's true.
Awesome.
Okay.
So let's stick with advanced topics.
Is there anything that you really, really,
really wanted to get into the book, but just,
you know, you've got to draw the line somewhere?
Is there anything that you're like, "Oh, I
really wanted to put that in, but I just..."
You know, you had a cut off for it.
That's a really good question.
And I've been asked this before.
But to be honest, I think I am in quite a
unique position, that unlike many other authors,
who had a lot more in their books, and needed
to cut down, or just, yeah, the books got
way too big.
I had a very clear idea of what the basics
were, maybe because I'd already made the online
course, which already had the basics.
And so, I didn't really have that situation.
I didn't have anything that I was like, "Oh,
but I really wish I could fit this in."
The only thing I would say maybe that I was
considering squeezing into one of the chapters
was stashing.
But I mean, it's not like a huge, massive
thing that I really, really, you know, was
like, "Oh, I can't believe this won't fit
in."
Because to be honest, you know, the book is
still pretty lean.
It's very minimalist.
So, if I had ultimately decided that this
is essential, I would have included it.
Actually, in my case, I think the pull request,
merge request chapter was actually not even
part of my original plan.
And I think later on, I realized, no, this
is really important.
I need to add this on.
So actually, my book was like too lean.
And I was, like...
I think you've been rebasing, I was kind of...
No, rebasing, I think I had from the beginning.
But yeah, I was like, super lean.
And I was, like, okay, but maybe I should
add a little bit extra.
So, I think, I felt that the most generous
thing that I could do for my readers, and
for my learners, was to be 
as simple and minimalist 
as possible, and just give the bare basics.
And in that way, just make it less overwhelming.
Because tech is so overwhelming.
There's just so much information, and so much
going on.
So if I could just create one learning experience,
which would not be overwhelming, I was like,
yes, I shall do this.
Fantastic.
Thank you.
So I'm wondering if in our audience today,
we might have some budding authors, other
people out there who perhaps want to author
a book.
I know I've co-authored a book.
So I absolutely appreciate the effort that
goes into the process and then some.
Do you have any advice for anybody who might
be listening, who's thinking, "Oh, yeah, I
know a thing.
I'm gonna write a book about that thing."
So anything I'm about to say is just gonna
come from my own experience.
So you never know, it might not apply to others.
And anything I'm gonna say is probably just
going to refer to technical book authors,
because I have not yet written another kind
of book.
So I don't want to say that I can speak on
that.
But for technical book potential authors,
I would say one of the things that I really
appreciated in my journey was, that I sort
of wrote this book as if it was an app.
Like, I made a prototype and then I user tested
that prototype.
And then I took all the feedback and I iterated.
And then I user tested again.
And then I took all the feedback, and I iterated.
And the first prototype was ugly.
It was very, very ugly.
And it was very rough.
And it was not at all the finished product.
I actually user tested when I just had four
chapters ready, because I thought, well, let
me check if this is making any sense to anyone.
Because if I've done four chapters, and it
doesn't make any sense, there's no point in
writing the other eight chapters.
Or, you know, it's better for me to figure
out what they are, what works for my audience,
before I write those other eight chapters.
So I had up to, like, 30 user testers throughout
the two years that I was working on this book.
And I think that was invaluable.
The experience of having a diverse user testing
audience.
I mean, it went from the ages of 18 to 60.
And from various professions, from all over
the world.
It was all remote user testing.
Wait, actually no, I also did some in person.
But most of it, almost all of it was remote.
So I got people from all over the world.
And it really did make my book a lot better.
There were a lot of things I needed to change.
I think sometimes, yeah, we just want to think
that we're like this genius that, like, can
create the best thing out of the outset.
And, like, oh, my God, my ideas are so good.
But sometimes, our creations need to have
that chemical reaction with the audience in
order to really become what they need.
Just one funny story, since we're at it.
At the beginning, I thought that my book would
actually...people would have, like, a pencil
case with them and paper, and they would actually
draw out the commits.
So in the beginning of the preface, I was
like, "Oh, you have to buy colored pencils,
and you should, like, follow along, and you
should draw the commits."
And when I did that user testing, nobody did
that.
And I had the colored pencils with me, I brought
them, and nobody did it.
So then I was like, okay, well, I guess this
isn't something...
You know, if somebody does want to do it,
they can.
But it wasn't something that made sense.
I mean, that's just a non-technical example.
But there were plenty of other things where,
yeah, I got feedback of, this doesn't make
sense to me, or you forgot about this.
I'm a Mac user.
So there were a lot of Windows users that
told me, like, "Hey, you need to mention this,
you need to mention that because this doesn't
make sense for me, or this doesn't apply to
me."
Or, you know, you need to simplify this.
You know, I'm not aware of this concept.
So, yeah, my monologue about user testing
and how good it is, is over now.
But yeah, I really recommend.
The importance of user testing.
And yeah, colored pencils, or not, in your
case.
Fantastic.
So we've spoken a little bit about your Udemy
course, we've spoken a little...well, a lot,
about your book.
What's next for you?
Is there anything else in the pipeline?
That's a good question, Helen.
I don't know what's next.
I do know that I've started working on a book.
I cannot share anything about it.
It shall remain a mystery.
People can...
You know, at the end, we'll talk about links
where people can find me.
But it may not end up being a technical book.
So I'm not sure yet.
I'm still exploring...
Well, I've started working on something.
But until I commit to a creative idea or a
creative project, I flirt with a lot of different
creative ideas and creative projects.
So, I definitely want to keep creating.
I love the process of creating something that
helps people and that explains things in a
simple way.
But what that next thing is going to be is
still under wraps.
So, we'll see.
For now, I'm just adjusting to my new technical
writer position, and enjoying kind of sharing
the journey I had with this book.
Wonderful.
A mystery.
Yes, a mystery.
I do like a mystery.
Maybe that's how I get people to follow me.
It's kind of self-serving.
If I make it mysterious, people have to come
and follow me to find out what it is when
I'm ready to share it.
I think that is the perfect segue.
So, if people want to find out about this
mystery in time, or learn more about your
book, or your courses, or whatever is coming
next for you, Anna, where should they go?
So the first thing I'll say is that I'm the
only  Anna Skoulikari on this planet.
So, just look me up,  Anna Skoulikari, two
Ns, A-N-N-A.
Last name S-K-O-U-L-I-K-A-R-I.
Just google me.
But other than that, I have a 
website, annaskoulikari.com.
So that is one good place.
The platform that I'm currently most active
on is LinkedIn.
So you can connect with me or follow me there.
And then other than that, you can find my
book all over the place, Amazon, and plenty
of other places.
You can find my online course on Udemy, at
the moment.
It's difficult to tell people where to find
you, because you always think, well, the social
media landscape changes all the time.
So this might not be relevant in a year or
two years from now.
And maybe I'll start using something else.
So I do have a Twitter.
I don't use it much.
Oh, I'm sorry, X.
Anyways, so I'd say my website and LinkedIn
are currently the best places to find out
more about me, and what I'm doing.
And then other than that, I just want to share
with the audience that I do have a link that
for the rest of 2024 is still going to be
active, which we can leave in the notes, which
gives people 30 day free access to the O'Reilly
online learning platform.
You can read my book, basically, in 30 days.
We can leave that in the notes.
And you actually have access to all the resources
on there.
If you want to take a peek at anything else,
feel free.
And it doesn't automatically subscribe, you
just get those 30 days, and then you can choose
whether you want to continue.
I think those are the best places.
Perfect.
And we will, of course, 
put all of that information 
in the show notes, including that link for
30 day access to the writing platform.
So I think that just about brings us to the
end of this interview.
So, Anna, do you have any final words that
you'd like to share today?
Oh, that's a good question.
Well, since the audience, there's going to
be a mix of junior developers, senior developers,
and various other people in the tech profession.
I'd say, if you yourself, maybe understand
Git, but you have a friend or anyone you know
that is struggling with it, feel free to recommend
them to take a peek at the book and see whether
it is the right learning journey and learning
resource for them.
It might be, it might not.
But if you do have anyone that is struggling
with Git, which many of us do, feel free to
kind of just share it with them, in case you
think it can serve them.
Fantastic.
Fantastic.
And I'd second that, especially with the 30
day free access.
I mean, that's a win-win.
You can check out the book.
I am fortunate enough to have this lovely
physical copy of it.
But, Anna, thank you.
Thank you for coming to this interview, for
sharing your knowledge and writing this book.
I've got a lot of value from it.
And I know that our audience will as well.
Thanks to the GOTO platform as well.
And yeah, thank you to you, the audience,
for tuning in and coming on this journey.
All the show notes will be available to you.
And that's it from us.
Thank you very much.
Bye.
Thanks, Helen.

File addition: 20240213_crdt_survey_algorithmic_techniques.md (----------)

[0.1]

# Document Title
CRDT Survey, Part 3: Algorithmic Techniques
Matthew Weidner | Feb 13th, 2024
Home | RSS Feed
Keywords: CRDTs, optimization, state-based CRDTs
    This blog post is Part 3 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
# Algorithmic Techniques
Let’s say you have chosen your collaborative app’s semantics in style of Part 2. That is, you’ve chosen a pure function that inputs an operation history and outputs the intended state of a user who is aware of those operations.
Here is a simple protocol to turn those semantics into an actual collaborative app:
    Each user’s state is a literal operation history, i.e., a set of operations, with each operation labeled by a Unique ID (UID).
    When a user performs an operation, they generate a new UID, then add the pair (id, op) to their local operation history.
    To synchronize their states, users share the pairs (id, op) however they like. For example, users could broadcast pairs as soon as they are created, periodically share entire histories peer-to-peer, or run a clever protocol to send a peer only the pairs that it is missing. Recipients always ignore redundant pairs (duplicate UIDs).
    Whenever a user’s local operation history is updated - either by a local operation or a remote message - they apply the semantics (pure function) to their new history, to yield the current app-visible state.
    Technicalities:
        It is the translated operations that get stored in operation histories and sent over the network. E.g., convert list indices to list CRDT positions before storing & sending.
        The history should also include causal ordering metadata - the arrows in Part 2’s operation histories. When sharing an operation, also share its incoming arrows, i.e., the UIDs of its immediate causal predecessors.
        Optionally enforce causal order delivery, by waiting to add a received operation to your local operation history until after you have added all of its immediate causal predecessors.
This post describes CRDT algorithmic techniques that help you implement more efficient versions of the simple protocol above. We start with some Prerequisites that are also useful in general distributed systems. Then Sync Strategies describes the traditional “types” of CRDTs - op-based, state-based, and others - and how they relate to the simple protocol.
The remaining sections describe specific algorithms. These algorithms are largely independent of each other, so you can skip to whatever interests you. Misc Techniques fills in some gaps from Part 2, e.g., how to generate logical timestamps. Optimized CRDTs describes nontrivial optimized algorithms, including classic state-based CRDTs.
# Table of Contents
    Prerequisites
        Replicas and Replica IDs • Unique IDs: Dots • Tracking Operations: Vector Clocks 1
    Sync Strategies
        Op-Based CRDTs • State-Based CRDTs • Other Sync Strategies
    Misc Techniques
        LWW: Lamport Timestamps • LWW: Hybrid Logical Clocks • Querying the Causal Order: Vector Clocks 2
    Optimized CRDTs
        List CRDTs • Formatting Marks (Rich Text) • State-Based Counter • Delta-State Based Counter • State-Based Unique Set • Delta-State Based Unique Set
# Prerequisites
# Replicas and Replica IDs
A replica is a single copy of a collaborative app’s state, in a single thread on a single device. For web-based apps, there is usually one replica per browser tab; when the user (re)loads a tab, a new replica is created.
You can also call a replica a client, session, actor, etc. However, a replica is not synonymous with a device or a user. Indeed, a user can have multiple devices, and a device can have multiple independently-updating replicas - for example, a user may open the same collaborative document in multiple browser tabs.
    In previous posts, I often said “user” out of laziness - e.g., “two users concurrently do X and Y”. But technically, I always meant “replica” in the above sense. Indeed, a single user might perform concurrent operations across different devices.
The importance of a replica is that everything inside a replica happens in a sequential order, without any concurrency between its own operations. This is the fundamental principle behind the next two techniques.
It is usually convenient to assign each replica a unique replica ID (client ID, session ID, actor ID), by generating a random string when the replica is created. The replica ID must be unique among all replicas of the same collaborative state, including replicas created concurrently, which is why they are usually random instead of “the highest replica ID so far plus 1”. Random UUIDs (v4) are a safe choice. You can potentially use fewer random bits (shorter replica IDs) if you are willing to tolerate a higher chance of accidental non-uniqueness (cf. the birthday problem).
    For reference, a UUID v4 is 122 random bits, a Collabs replicaID is 60 random bits (10 base64 chars), and a Yjs clientID is 32 random bits (a uint32).
Avoid the temptation to reuse a replica ID across replicas on the same device, e.g., by storing it in window.localStorage. That can cause problems if the user opens multiple tabs, or if there is a crash failure and the old replica did not record all of its actions to disk.
In Collabs: ReplicaIDs
# Unique IDs: Dots
Recall from Part 2 that to refer to a piece of content, you should assign it an immutable Unique ID (UID). UUIDs work, but they are long (32 chars) and don’t compress well.
Instead, you can use dot IDs: pairs of the form (replicaID, counter), where counter is a local variable that is incremented each time. So a replica with ID "n48BHnsi" uses the dot IDs ("n48BHnsi", 1), ("n48BHnsi", 2), ("n48BHnsi", 3), …
In pseudocode:
// Local replica ID.
const replicaID = <sufficiently long random string>;
// Local counter value (specific to this replica). Integer.
let counter = 0;
function newUID() {
    counter++;
    return (replicaID, counter);
}
The advantage of dot IDs is that they compress well together, either using plain GZIP or a dot-aware encoding. For example, a vector clock (below) represents a sequence of dots ("n48BHnsi", 1), ("n48BHnsi", 2), ..., ("n48BHnsi", 17) as the single map entry { "n48BHnsi": 17 }.
You have some flexibility for how you assign the counter values. For example, you can use a logical clock value instead of a counter, so that your UIDs are also logical timestamps for LWW. Or, you can use a separate counter for each component CRDT in a composed construction, instead of one for the whole replica. The important thing is that you never reuse a UID in the same context, where the two uses could be confused.
Example: An append-only log could choose to use its own counter when assigning events’ dot IDs. That way, it can store its state as a Map<ReplicaID, T[]>, mapping each replica ID to an array of that replica’s events indexed by (counter - 1). If you instead used a counter shared by all component CRDTs, then the arrays could have gaps, making a Map<ReplicaID, Map<number, T>> preferable.
    In previous blog posts, I called these “causal dots”, but I cannot find that name used elsewhere; instead, CRDT papers just use “dot”.
Collabs: Each transaction has an implicit dot ID (senderID, senderCounter).
Refs: Preguiça et al. 2010
# Tracking Operations: Vector Clocks 1
Vector clocks are a theoretical technique with multiple uses. In this section, I’ll focus on the simplest one: tracking a set of operations. (Vector Clocks 2 is later.)
Suppose a replica is aware of the following operations - i.e., this is its current local view of the operation history:
An operation history with labels ("A84nxi", 1), ("A84nxi", 2), ("A84nxi", 3), ("A84nxi", 4), ("bu2nVP", 1), ("bu2nVP", 2).
I’ve labeled each operation with a dot ID, like ("A84nxi", 2).
In the future, the replica might want to know which operations it is already aware of. For example:
    When the replica receives a new operation in a network message, it will look up whether it has already received that operation, and if so, ignore it.
    When a replica syncs with another collaborator (or a storage server), it might first send a description of the operations it’s already aware of, so that the collaborator can skip sending redundant operations.
One way to track the operation history is to store the operations’ unique IDs as a set: { ("A84nxi", 1), ("A84nxi", 2), ("A84nxi", 3), ("A84nxi", 4), ("bu2nVP", 1), ("bu2nVP", 2)}. But it is cheaper to store the “compressed” representation
{
    "A84nxi": 4,
    "bu2nVP": 2
}
This representation is called a vector clock. Formally, a vector clock is a Map<ReplicaID, number> that sends a replica ID to the maximum counter value received from that replica, where each replica assigns counters to its own operations in order starting at 1 (like a dot ID). Missing replica IDs implicitly map to 0: we haven’t received any operations from those replicas. The above example shows that a vector clock efficiently summarizes a set of operation IDs.
    The previous paragraph implicitly assumes that you process operations from each other replica in order: first ("A84nxi", 1), then ("A84nxi", 2), etc. That always holds when you enforce causal-order delivery. If you don’t, then a replica’s counter values might have gaps (1, 2, 4, 7, ...); you can still encode those efficiently, using a run-length encoding, or a vector clock plus a set of “extra” dot IDs (known as a dotted vector clock).
Like dot IDs, vector clocks are flexible. For example, instead of a per-replica counter, you could store the most recent logical timestamp received from each replica. That is a reasonable choice if each operation already contains a logical timestamp for LWW.
Collabs: CausalMessageBuffer
Refs: Baquero and Preguiça 2016; Wikipedia
# Sync Strategies
We now turn to sync strategies: ways to keep collaborators in sync with each other, so that they eventually see the same states.
# Op-Based CRDTs
An operation-based (op-based) CRDT keeps collaborators in sync by broadcasting operations as they happen. This sync strategy is especially useful for live collaboration, where users would like to see each others’ operations quickly.
Current state (op history) + single operation -> new state (op history including operation).
In the simple protocol at the top of this post, a user processes an individual operation by adding it to their operation history, then re-running the semantic function to update their app-visible state. An op-based CRDT instead stores a state that is (usually) smaller than the complete operation history, but it still contains enough information to render the app-visible state, and it can be updated incrementally in response to a received operation.
Example: The op-based counter CRDT from Part 1 has an internal state that is merely the current count. When a user receives (or performs) an inc() operation, they increment the count.
Formally, an op-based CRDT consists of:
    A set of allowed CRDT states that a replica can be in.
    A query that returns the app-visible state. (The CRDT state often includes extra metadata that is not visible to the rest of the app, such as LWW timestamps.)
    For each operation (insert, set, etc.):
        A prepare function that inputs the operation’s parameters and outputs a message describing that operation. An external protocol promises to broadcast this message to all collaborators. (Usually this message is just the translated form of the operation. prepare is allowed to read the current state but not mutate it.)
        An effect function that processes a message, updating the local state. An external protocol promises to call effect:
            Immediately for each locally-prepared message, so that local operations update the state immediately.
            Eventually for each remotely-prepared message, exactly once and (optionally) in causal order.
    In addition to updating their internal state, CRDT libraries’ effect functions usually also emit events that describe how the state changed. Cf. Views in Part 2.
To claim that an op-based CRDT implements a given CRDT semantics, you must prove that the app-visible state always equals the semantics applied to the set of operations effected so far.
    As an example, let’s repeat the op-based unique set CRDT from Part 2.
        Per-user CRDT state: A set of pairs (id, x).
        Query: Return the CRDT state directly, since in this case, it coincides with the app-visible state.
        Operation add:
            prepare(x): Generate a new UID id, then return the message ("add", (id, x)).
            effect("add", (id, x)): Add (id, x) to your local state.
        Operation delete:
            prepare(id): Return the message ("delete", id).
            effect("delete", id): Delete the pair with the given id from your local state, if it is still present.
    It is easy to check that this op-based CRDT has the desired semantics: at any time, the query returns the set of pairs (id, x) such that you have an effected an add(id, x) operation but no delete(id) operations.
    Observe that the CRDT state is a lossy representation of the operation history: we don’t store any info about delete operations or deleted add operations.
How can the “external protocol” (i.e., the rest of the app) guarantee that messages are effected at-most-once and in causal order? Using a history-tracking vector clock:
    On each replica, store a vector clock tracking the operations that have been effected so far.
    When a replica asks to send a prepared message, attach a new dot ID to that message before broadcasting it. Also attach the dot IDs of its immediate causal predecessors.
    When a replica receives a message:
        Check if its dot ID is redundant, according to the local vector clock. If so, stop.
        Check if the immediate causal predecessors’ dot IDs have been effected, according to the local vector clock. If not, block until they are.
        Deliver the message to effect and update the local vector clock. Do the same for any newly-unblocked messages.
To ensure that messages are eventually delivered at-least-once to each replica (the other half of exactly-once), you generally need some help from the network. E.g., have a server store all messages and retry delivery until every client confirms receipt.
As a final note, suppose that two users concurrently perform operations o and p. You are allowed to deliver their op-based messages to effect in either order without violating the causal-order delivery guarantee. Semantically, the two delivery orders must result in equivalent internal states: both results correspond to the same operation history, containing both o and p. Thus for an op-based CRDT, concurrent messages commute.
Conversely, you can prove that if an algorithm has the API of an op-based CRDT and concurrent messages commute, then its behavior corresponds to some CRDT semantics (i.e., some pure function of the operation history). This leads to the traditional definition of an op-based CRDT in terms of commuting concurrent operations. Of course, if you only prove commutativity, there is no guarantee that the corresponding semantics are reasonable in the eyes of your users.
Collabs: sendCRDT and receiveCRDT in PrimitiveCRDT
Refs: Shapiro et al. 2011a
# State-Based CRDTs
A state-based CRDT keeps users in sync by occasionally exchanging entire states, “merging” their operation histories. This sync strategy is useful in peer-to-peer networks (peers occasionally exchange states in order to bring each other up-to-date) and for the initial sync between a client and a server (the client merges its local state with the server’s latest state, and vice-versa).
Current state (op history) + other state (overlapping op history) -> merged state (union of op histories).
In the simple protocol at the top of this post, the “entire state” is the literal operation history, and merging is just the set-union of operations (using the UIDs to filter duplicates). A state-based CRDT instead stores a state that is (usually) smaller than the complete operation history, but it still contains enough information to render the app-visible state, and it can be “merged” with another state.
Formally, a state-based CRDT consists of:
    A set of allowed CRDT states that a replica can be in.
    A query that returns the app-visible state. (The CRDT state often includes extra metadata that is not visible to the rest of the app, such as LWW timestamps.)
    For each operation, a state mutator that updates the local state.
    A merge function that inputs two states and outputs a “merged” state. The app using the CRDT may set local state = merge(local state, other state) at any time, where the other state usually comes from a remote collaborator or storage.
To claim that a state-based CRDT implements a given CRDT semantics, you must prove that the app-visible state always equals the semantics applied to the set of operations that contribute to the current state. Here an operation “contributes” to the output of its state mutator, plus future states resulting from that state (e.g., the merge of that state with another).
    As an example, let’s repeat just the state-based part of the LWW Register from Part 2.
        Per-user state: state = { value, time }, where time is a logical timestamp.
        Query: Return state.value.
        Operation set(newValue): Set state = { value: newValue, time: newTime }, where newTime is the current logical time.
        Merge in other state: Pick the state with the greatest logical timestamp. That is, if other.time > state.time, set state = other.
    It is easy to check that this state-based CRDT has the desired semantics: at any time, the query returns the value corresponding to the set operation with the greatest logical timestamp that contributes to the current state.
As a final note, observe that for any CRDT states s, t, u, the following algebraic rules hold, because the same set of operations contributes to both sides of each equation:
    (Idempotence) merge(s, s) = s.
    (Commutativity) merge(s, t) = merge(t, s).
    (Associativity) merge(s, (merge(t, u))) = merge(merge(s, t), u).
Thus for a state-based CRDT, the merge function is Associative, Commutative, and Idempotent (ACI).
Conversely, you can prove that if an algorithm has the API of a state-based CRDT and it satisfies ACI, then its behavior corresponds to some CRDT semantics (i.e., some pure function of the operation history). This leads to the traditional definition of a state-based CRDT in terms of an ACI merge function. Of course, if you only prove these algebraic rules, there is no guarantee that the corresponding semantics are reasonable in the eyes of your users.
Collabs: saveCRDT and loadCRDT in PrimitiveCRDT
Refs: Shapiro et al. 2011a
# Other Sync Strategies
In a real collaborative app, it is inconvenient to choose op-based or state-based synchronization. Instead, it’s nice to use both, potentially within the same session.
Example: When the user launches your app, first do a state-based sync with a storage server to become in sync. Then use op-based messages over TCP to stay in sync until the connection drops.
Thus hybrid op-based/state-based CRDTs that support both sync strategies are popular in practice. Typically, these look like either state-based CRDTs with op-based messages tacked on (Yjs, Collabs), or they use an op-based CRDT alongside a complete operation history (Automerge).
    To perform a state-based merge in the latter approach, you look through the received state’s history for operations that are not already in your history, and deliver those to the op-based CRDT. This approach is simple, and it comes with a built-in version history, but it requires more effort to make efficient (as Automerge has been pursuing).
Other sync strategies use optimized peer-to-peer synchronization. Traditionally, peer-to-peer synchronization uses state-based CRDTs: each peer sends a copy of its own state to the other peer, then merges in the received state. This is inefficient if the states overlap a lot - e.g., the two peers just synced one minute ago and have only updated their states slightly since then. Optimized protocols like Yjs’s sync potocol or Byzantine causal broadcast instead use back-and-forth messages to determine what info the other peer is missing and send just that.
    For academic work on hybrid or optimized sync strategies, look up delta-state based CRDTs (also called delta CRDTs). These are like hybrid CRDTs, with the added technical requirement that op-based messages are themselves states (in particular, they are input to the state-based merge function instead of a separate effect function). Note that some papers focus on novel sync strategies, while others focus on the orthogonal problem of how to tolerate non-causally-ordered messages.
Collabs: Updates and Sync - Patterns
Refs: Yjs document updates; Almeida, Shoker, and Baquero 2016; Enes et al. 2019
# Misc Techniques
The rest of this post describes specific algorithms. We start with miscellaneous techniques that are needed to implement some of the semantics from Part 2: two kinds of logical timestamps for LWW, and ways to query the causal order. These are all traditional distributed systems techniques that are not specific to CRDTs.
# LWW: Lamport Timestamps
Recall that you should use a logical timestamp instead of wall-clock time for Last-Writer Wins (LWW) values. A Lamport timestamp is a simple and common logical timestamp, defined by:
    Each replica stores a clock value time, an integer that is initially 0.
    Whenever you perform an LWW set operation, increment time and attach its new value to the operation, as part of a pair (time, replicaID). This pair is called a Lamport timestamp.
    Whenever you receive a Lamport timestamp from another replica as part of an operation, set time = max(time, received time).
    The total order on Lamport timestamps is given by: (t1, replica1) < (t2, replica2) if t1 < t2 or (t1 = t2 and replica1 < replica2). That is, the operation with a greater time “wins”, with ties broken using an arbitrary order on replica IDs.
Operations A-F with arrows A to B, A to E, B to C, C to D, C to F. The labels are: (1, "A84nxi"); (2, "A84nxi"); (3, "A84nxi"); (5, "A84nxi"); (2, "bu2nVP"); (4, "bu2nVP").
Figure 1. An operation history with each operation labeled by a Lamport timestamp.
Lamport timestamps have two important properties (mentioned in Part 2):
    If o < p in the causal order, then (o's Lamport timestamp) < (p's Lamport timestamp). Thus a new LWW set always wins over all causally-prior sets.
        Note that the converse does not hold: it’s possible that (q's Lamport timestamp) < (r's Lamport timestamp) but q and r are concurrent.
    Lamport timestamps created by different users are always distinct (because of the replicaID tiebreaker). Thus the winner is never ambiguous.
Collabs: lamportTimestamp
Refs: Lamport 1978; Wikipedia
# LWW: Hybrid Logical Clocks
Hybrid logical clocks are another kind of logical timestamp that combine features of Lamport timestamps and wall-clock time. I am not qualified to write about these, but Jared Forsyth gives a readable description here: https://jaredforsyth.com/posts/hybrid-logical-clocks/.
# Querying the Causal Order: Vector Clocks 2
One of Part 2’s “Other Techniques” was Querying the Causal Order. For example, an access-control CRDT could include a rule like “If Alice performs an operation but an admin banned her concurrently, then treat Alice’s operation as if it had not happened”.
I mentioned in Part 2 that I find this technique too complicated for practical use, except in some special cases. Nevertheless, here are some ways to implement causal-order queries.
Formally, our goal is: given two operations o and p, answer the query “Is o < p in the causal order?”. More narrowly, a CRDT might query whether o and p are concurrent, i.e., neither o < p nor p < o.
Recall from above that a vector clock is a map that sends a replica ID to the maximum counter value received from that replica, where each replica assigns counters to its own operations in order starting at 1 (like a dot ID). Besides storing a vector clock on each replica, we can also attach a vector clock to each operation: namely, the sender’s vector clock at the time of sending. (The sender’s own entry is incremented to account for the operation itself.)
Operations A-F with arrows A to B, A to E, B to C, C to D, C to F. The labels are: ("A84nxi", 1) / { A84nxi: 1 }; ("A84nxi", 2) / { A84nxi: 2 }; ("A84nxi", 3) / { A84nxi: 3 }; ("A84nxi", 4) / { A84nxi: 4, bu2nVP: 2 }; ("bu2nVP", 1) / { A84nxi: 1, bu2nVP: 1 }; ("bu2nVP", 2) / { A84nxi: 3, bu2nVP: 2 }.
Figure 2. An operation history with each operation labeled by its dot ID (blue italics) and vector clock (red normal text).
Define a partial order on vector clocks by: v < w if for every replica ID r, v[r] <= w[r], and for at least one replica ID, v[r] < w[r]. (If r is not present in a map, treat its value as 0.) Then it is a classic result that the causal order on operations matches the partial order on their vector clocks. Thus storing each operation’s vector clock lets you query the causal order later.
Example: In the above diagram, { A84nxi: 1, bu2nVP: 1 } < { A84nxi: 4, bu2nVP: 2 }, matching the causal order on their operations. Meanwhile, { A84nxi: 1, bu2nVP: 1 } and { A84nxi: 2 } are incomparable, matching the fact that their operations are concurrent.
Often you only need to query the causal order on new operations. That is, you just received an operation p, and you want to compare it to an existing operation o. For this, it suffices to know o’s dot ID (replicaID, counter): If counter <= p.vc[replicaID], then o < p, else they are concurrent. Thus in this case, you don’t need to store each operation’s vector clock, just their dot IDs (though you must still send vector clocks over the network).
    The above discussion changes slightly if you do not assume causal order delivery. See Wikipedia’s update rules.
We have a performance problem: The size of a vector clock is proportional to the number of past replicas. In a collaborative app, this number tends to grow without bound: each browser tab creates a new replica, including refreshes. Thus if you attach a vector clock to each op-based message, your network usage also grows without bound.
Some workarounds:
    Instead of attaching the whole vector clock to each op-based message, just attached the UIDs of the operation’s immediate causal predecessors. (You are probably attaching these anyway for causal-order delivery.) Then on the receiver side, look up the predecessors’ vector clocks in your stored state, take their entry-wise max, add one to the sender’s entry, and store that as the operation’s vector clock.
    Same as 1, but instead of storing the whole vector clock for each operation, just store its immediate causal predecessors - the arrows in the operation history. This uses less space, and it still contains enough information to answer causal order queries: o < p if and only if there is a path of arrows from o to p. However, I don’t know of a way to perform those queries quickly.
    Instead of referencing the causal order directly, the sender can list just the “relevant” o < p as part of p’s op-based message. For example, when you set the value of a multi-value register, instead of using a vector clock to indicate which set(x) operations are causally prior, just list the UIDs of the current multi-values (cf. the multi-value register on top of a unique set).
Collabs: vectorClock
Refs: Baquero and Preguiça 2016; Wikipedia; Automerge issue discussing workarounds
# Optimized CRDTs
We now turn to optimizations. I focus on algorithmic optimizations that change what state you store and how you access it, as opposed to low-level code tricks. Usually, the optimizations reduce the amount of metadata that you need to store in memory and on disk, at least in the common case.
These optimizations are the most technical part of the blog series. You may wish to skip them for now and come back only when you are implementing one, or trust someone else to implement them in a library.
# List CRDTs
I assume you’ve understood Lists and Text Editing from Part 2.
There are too many list CRDT algorithms and optimizations to survey here, but I want to briefly introduce one key problem and solution.
When you use a text CRDT to represent a collaborative text document, the easy way to represent the state is as an ordered map (list CRDT position) -> (text character). Concretely, this map could be a tree with one node per list CRDT position, like in Fugue: A Basic List CRDT.
Example of a Fugue tree.
Figure 3. Example of a Fugue tree with corresponding text "abcde". Each node's UID is a dot.
In such a tree, each tree node contains at minimum (1) a UID and (2) a pointer to its parent node. That is a lot of metadata for a single text character! Plus, you often need to store this metadata even for deleted characters (tombstones).
Here is an optimization that dramatically reduces the metadata overhead in practice:
    When a replica inserts a sequence of characters from left to right (the common case), instead of creating a new UID for each character, only create a UID for the leftmost character.
    Store the whole sequence as a single object (id, parentId etc, [char0, char1, ..., charN]). So instead of one tree node per character, your state has one tree node per sequence, storing an array of characters.
    To address individual characters, use list CRDT positions of the form (id, 0), (id, 1), …, (id, N).
It’s possible to later insert characters in the middle of a sequence, e.g., between char1 and char2. That’s fine; the new characters just need to indicate the corresponding list CRDT positions (e.g. “I am a left child of (id, 2)”).
Applying this optimization to Fugue gives you trees like so, where only the filled nodes are stored explicitly (together with their children’s characters):
Example of an optimized Fugue tree.
Figure 4. Example of an optimized Fugue tree with corresponding text "abcdefg".
Collabs: Waypoints in CTotalOrder
Refs: Yu 2012; Jahns 2020 (Yjs blog post)
# Formatting Marks (Rich Text)
Recall the inline formatting CRDT from Part 2. Its internal CRDT state is an append-only log of formatting marks
type Mark = {
    key: string;
    value: any;
    timestamp: LogicalTimestamp;
    start: { pos: Position, type: "before" | "after" }; // type Anchor
    end: { pos: Position, type: "before" | "after" }; // type Anchor
}
Its app-visible state is the view of this log given by: for each character c, for each format key key, find the mark with the largest timestamp satisfying
    mark.key = key, and
    the interval (mark.start, mark.end) contains c’s position.
Then c’s format value at key is mark.value.
For practical use, we would like a view that represents the same state but uses less memory. In particular, instead of storing per-character formatting info, it should look more like a Quill delta. E.g., the Quill delta representing “Quick brown fox” is
{
    ops: [
        { insert: "Quick " },
        { insert: "brow", attributes: { bold: true, italic: true } },
        { insert: "n fox", attributes: { italic: true } }
    ]
}
Here is such a view. Its state is a map: Map<Anchor, Mark[]>, given by:
    For each anchor that appears as the start or end of any mark in the log,
        The value at anchor contains pointers to all marks that start at or strictly contain anchor. That is, { mark in log | mark.start <= anchor < mark.end }.
Given this view, it is easy to look up the format of any particular character. You just need to go left until you reach an anchor that is in map, then interpret map.get(anchor) in the usual way: for each key, find the LWW winner at key and use its value. (If you reach the beginning of the list, the character has no formatting.)
I claim that with sufficient coding effort, you can also do the following tasks efficiently:
    Convert the whole view into a Quill delta or similar rich-text-editor state.
    Update the view to reflect a new mark added to the log.
    After updating the view, emit events describing what changed, in a form like Collabs’s RichTextFormatEvent:
type RichTextFormatEvent = {
    // The formatted range is [startIndex, endIndex).
    startIndex: number;
    endIndex: number;
    key: string;
    value: any; // null if unformatted
    previousValue: any;
    // The range's complete new format.
    format: { [key: string]: any };
}
Let’s briefly discuss Task 2; there’s more detail in the Peritext essay. When a new mark mark is added to the log:
    If mark.start is not present in map, go to the left of it until you reach an anchor prev that is in map, then do map.set(mark.start, copy of map.get(prev)). (If you reach the beginning of the list, do map.set(mark.start, []).)
    If mark.end is not present in map, do likewise.
    For each entry (anchor, array) in map such that mark.start <= anchor < mark.end, append mark to array.
Some variations on this section:
    Collabs represents the Map<Anchor, Mark[]> literally using a LocalList - a local data structure that lets you build an ordered map on top of a separate list CRDT’s positions. Alternatively, you can store each Mark[] inline with the list CRDT, at its anchor’s location; that is how the Peritext essay does it.
    When saving the state to disk, you can choose whether to save the view, or forget it and recompute it from the mark log on next load. Likewise for state-based merging: you can try to merge the views directly, or just process the non-redundant marks one at a time.
    In each map value (a Mark[]), you can safely forget marks that have LWW-lost to another mark in the same array. Once a mark has been deleted from every map value, you can safely forget it from the log.
Collabs: CRichText; its source code shows all three tasks
Refs: Litt et al. 2021 (Peritext)
# State-Based Counter
The easy way to count events in a collaborative app is to store the events in an append-only log or unique set. This uses more space than the count alone, but you often want that extra info anyway - e.g., to display who liked a post, in addition to the like count.
Nonetheless, the optimized state-based counter CRDT is both interesting and traditional, so let’s see it.
The counter’s semantics are as in Part 1’s passenger counting example: its value is the number of +1 operations in the history, regardless of concurrency.
You can obviously achieve this semantics by storing an append-only log of +1 operations. To merge two states, take the union of log entries, skipping duplicate UIDs.
    In other words, store the entire operation history, following the simple protocol from the top of this post.
Suppose the append-only log uses dot IDs as its UIDs. Then the log’s state will always look something like this:
[
    ((a6X7fx, 1), "+1"), ((a6X7fx, 2), "+1"), ((a6X7fx, 3), "+1"), ((a6X7fx, 4), "+1"),
    ((bu91nD, 1), "+1"), ((bu91nD, 2), "+1"), ((bu91nD, 3), "+1"),
    ((yyn898, 1), "+1"), ((yyn898, 2), "+1")
]
You can compress this state by storing, for each replicaID, only the range of dot IDs received from that replica. For example, the above log compresses to
{
    a6X7fx: 4,
    bu91nD: 3,
    yyn898: 2
}
This is the same trick we used in Vector Clocks 1.
Compressing the log in this way leads to the following algorithm, the state-based counter CRDT.
    Per-user state: A Map<ReplicaID, number>, mapping each replicaID to the number of +1 operations received from that replica. (Traditionally, this state is called a vector instead of a map.)
    App-visible state (the count): Sum of the map values.
    Operation +1: Add 1 to your own map entry, treating a missing entry as 0.
    Merge in other state: Take the entry-wise max of values, treating missing entries as 0. That is, for all r, set this.state[r] = max(this.state[r] ?? 0, other.state[r] ?? 0).
For example:
// Starting local state:
{
    a6X7fx: 2, // Implies ops ((a6X7fx, 1), "+1"), ((a6X7fx, 2), "+1")
    bu91nD: 3,
}
// Other state:
{
    a6X7fx: 4, // Implies ops ((a6X7fx, 1), "+1"), ..., ((a6X7fx, 4), "+1")
    bu91nD: 1,
    yyn898: 2
}
// Merged result:
{
    a6X7fx: 4, // Implies ops ((a6X7fx, 1), "+1"), ..., ((a6X7fx, 4), "+1"): union of inputs
    bu91nD: 3,
    yyn898: 2
}
You can generalize the state-based counter to handle +x operations for arbitrary positive values x. However, to handle both positive and negative additions, you need to use two counters: P for positive additions and N for negative additions. The actual value is (P's value) - (N's value). This is the state-based PN-counter.
Collabs: CCounter
Refs: Shapiro et al. 2011a
# Delta-State Based Counter
You can modify the state-based counter CRDT to also support op-based messages:
    When a user performs a +1 operation, broadcast its dot ID (r, c).
    Recipients add this dot their compresed log map, by setting map[r] = max(map[r], c).
This hybrid op-based/state-based CRDT is called the delta-state based counter CRDT.
Technically, the delta-state based counter CRDT assumes causal-order delivery for op-based messages. Without this assumption, a replica’s uncompressed log might contain gaps like
[
    ((a6X7fx, 1), "+1"), ((a6X7fx, 2), "+1"), ((a6X7fx, 3), "+1"), ((a6X7fx, 6), "+1")
]
which we can’t represent as a Map<ReplicaID, number>.
    You could argue that the operation ((a6X7fx, 6), "+1") lets you “infer” the prior operations ((a6X7fx, 4), "+1") and ((a6X7fx, 5), "+1"), hence you can just set the map entry to { a6X7fx: 6 }. However, this will give an unexpected counter value if those prior operations were deliberately undone, or if they’re tied to some other change that can’t be inferred (e.g., a count of comments vs the actual list of comments).
Luckily, you can still compress ranges within the dot IDs that you have received. For example, you could use a run-length encoding:
{
    a6X7fx: [1 through 3, 6 through 6]
}
or a map plus a set of “extra” dot IDs:
{
    map: {
        a6X7fx: 3
    },
    dots: [["a6X7fx", 6]]
}
This idea leads to a second delta-state based counter CRDT. Its state-based merge algorithm is somewhat complicated, but it has a simple spec: decompress both inputs, take the union, and re-compress.
Refs: Almeida, Shoker, and Baquero 2016
# State-Based Unique Set
Here is a straightforward state-based CRDT for the unique set:
    Per-user state:
        A set of elements (id, x), which is the set’s literal (app-visible) state.
        A set of tombstones id, which are the UIDs of all deleted elements.
    App-visible state: Return elements.
    Operation add(x): Generate a new UID id, then add (id, x) to elements.
    Operation delete(id): Delete the pair with the given id from elements, and add id to tombstones.
    State-based merge: To merge in another user’s state other = { elements, tombstones },
        For each (id, x) in other.elements, if it is not already present in this.elements and id is not in this.tombstones, add (id, x) to this.elements.
        For each id in other.tombstones, if it is not already present in this.tombstones, add id to this.tombstones and delete the pair with the given id from this.elements (if present).
For example:
// Starting local state:
{
    elements: [ (("A84nxi", 1), "milk"), (("A84nxi", 3), "eggs") ],
    tombstones: [ ("A84nxi", 2), ("bu2nVP", 1), ("bu2nVP", 2) ]
}
// Other state:
{
    elements: [
        (("A84nxi", 3), "eggs"), (("bu2nVP", 1), "bread"), (("bu2nVP", 2), "butter"),
        (("bu2nVP", 3), "cereal")
    ],
    tombstones: [ ("A84nxi", 1), ("A84nxi", 2) ]
}
// Merged result:
{
    elements: [ (("A84nxi", 3), "eggs"), (("bu2nVP", 3), "cereal") ],
    tombstones: [ ("A84nxi", 1), ("A84nxi", 2), ("bu2nVP", 1), ("bu2nVP", 2) ]
}
The problem with this straightforward algorithm is the tombstone set: it stores a UID for every deleted element, potentially making the CRDT state much larger than the app-visible state.
Luckily, when your UIDs are dot IDs, you can use a “compression” trick similar to the state-based counter CRDT: in place of the tombstone set, store the range of dot IDs received from each replica (deleted or not), as a Map<ReplicaID, number>. That is, store a modified vector clock that only counts add operations. Any dot ID that lies within this range, but is not present in elements, must have been deleted.
For example, the three states above compress to:
// Starting local state:
{
    elements: [ (("A84nxi", 1), "milk"), (("A84nxi", 3), "eggs") ],
    vc: { A84nxi: 3, bu2nVP: 2 }
}
// Other state:
{
    elements: [
        (("A84nxi", 3), "eggs"), (("bu2nVP", 1), "bread"), (("bu2nVP", 2), "butter"),
        (("bu2nVP", 3), "cereal")
    ],
    vc: { A84nxi: 3, bu2nVP: 3 }
}
// Merged result:
{
    elements: [ (("A84nxi", 3), "eggs"), (("bu2nVP", 3), "cereal") ],
    vc: { A84nxi: 3, bu2nVP: 3 }
}
Compressing the tombstone set in this way leads to the following algorithm, the optimized state-based unique set:
    Per-user state:
        A set of elements (id, x) = ((r, c), x).
        A vector clock vc: Map<ReplicaID, number>.
        A local counter counter, used for dot IDs.
    App-visible state: Return elements.
    Operation add(x): Generate a new dot id = (local replicaID, ++counter), then add (id, x) to elements. Also increment vc[local replicaID].
    Operation delete(id): Merely delete the pair with the given id from elements.
    State-based merge: To merge in another user’s state other = { elements, vc },
        For each ((r, c), x) in other.elements, if it is not already present in this.elements and c > this.vc[r], add ((r, c), x) to this.elements. (It’s new and has not been deleted locally.)
        For each entry (r, c) in other.vc, for each pair ((r, c'), x) in this.elements, if c >= c' and the pair is not present in other.elements, delete it from this.elements. (It must have been deleted from the other state.)
        Set this.vc to the entry-wise max of this.vc and other.vc, treating missing entries as 0. That is, for all r, set this.vc[r] = max(this.vc[r] ?? 0, other.vc[r] ?? 0).
    You can re-use the same vector clock that you use for tracking operations, if your unique set’s dot IDs use the same counter. Nitpicks:
        Your unique set might skip over some dot IDs, because they are used for other operations; that is fine.
        If a single operation can add multiple elements at once, consider using UIDs of the form (dot ID, within-op counter) = (replicaID, per-op counter, within-op counter).
The optimized unique set is especially important because you can implement many other CRDTs on top, which are then also optimized (they avoid tombstones). In particular, Part 2 describes unique set algorithms for the multi-value register, multi-value map, and add-wins set (all assuming causal-order delivery). You can also adapt the optimized unique set to manage deletions in the unique set of CRDTs.
Collabs: CMultiValueMap, CSet
Refs: Based on the Optimized OR-Set from Bieniusa et al. 2012
# Delta-State Based Unique Set
Similar to the second delta-state based counter CRDT, you can create a delta-state based unique set that is a hybrid op-based/state-based CRDT and allows non-causal-order message delivery. I’ll leave it as an exercise.
Refs: Causal δ-CRDTs in Almeida, Shoker, and Baquero 2016
# Conclusion
The last two posts surveyed the two “topics” from Part 1:
    Semantics: An abstract description of what a collaborative app’s state should be, given its concurrency-aware operation history.
    Algorithms: Algorithms to efficiently compute the app’s state in specific, practical situations.
Together, they covered much of the CRDT theory that I know and use. I hope that you now know it too!
However, there are additional CRDT ideas outside my focus area. I’ll give a bibliography for those in the next and final post, Part 4: Further Topics.
    This blog post is Part 3 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
Home • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub

File addition: 20240202_jjinit.md (----------)

[0.1]


Sym·poly·mathesy
by Chris Krycho
jj init
What if we actually could replace Git? Jujutsu might give us a real shot.
Assumed audience: People who have worked with Git or other modern version control systems like Mercurial, Darcs, Pijul, Bazaar, etc., and have at least a basic idea of how they work.
Jujutsu is a new version control system from a software engineer at Google, where it is on track to replace Google’s existing version control systems (historically: Perforce, Piper, and Mercurial). I find it interesting both for the approach it takes and for its careful design choices in terms of both implementation details and user interface. It offers one possible answer to a question I first started asking most of a decade ago: What might a next-gen version control system look like — one which actually learned from the best parts of all of this generation’s systems, including Mercurial, Git, Darcs, Fossil, etc.?
To answer that question, it is important to have a sense of what those lessons are. This is trickier than it might seem. Git has substantially the most “mind-share” in the current generation; most software developers learn it and use it not because they have done any investigation of the tool and its alternatives but because it is a de facto standard: a situation which arose in no small part because of its “killer app” in the form of GitHub. Developers who have been around for more than a decade or so have likely seen more than one version control system — but there are many, many developers for whom Git was their first and, so far, last VCS.
The problems with Git are many, though. Most of all, its infamously terrible command line interface results in a terrible user experience. In my experience, very few working developers have a good mental model for Git. Instead, they have a handful of commands they have learned over the years: enough to get by, and little more. The common rejoinder is that developers ought to learn how Git works internally — that everything will make more sense that way.
This is nonsense. Git’s internals are interesting on an implementation level, but frankly add up to an incoherent mess in terms of a user mental model. This is a classic mistake for software developers, and one I have fallen prey to myself any number of times. I do not blame the Git developers for it, exactly. No one should have to understand the internals of the system to use it well, though; that is a simple failure of software design. Moreover, even those internals do not particularly cohere. The index, the number of things labeled “-ish” in the glossary, the way that a “detached HEAD” interacts with branches, the distinction between tags and branches, the important distinctions between commits, refs, and objects… It is not that any one of those things is bad in isolation, but as a set they do not amount to a mental model I can describe charitably. Put in programming language terms: One of the reasons the “surface syntax” of Git is so hard is that its semantics are a bit confused, and that inevitably shows up in the interface to users.
Still, a change in a system so deeply embedded in the software development ecosystem is not cheap. Is it worth the cost of adoption? Well, Jujutsu has a trick up its sleeve: there is no adoption cost. You just install it — brew install jj will do the trick on macOS — and run a single command in an existing Git repository, and… that’s it. (“There is no step 3.”) I expect that mode will always work, even though there will be a migration step at some point in the future, when Jujutsu’s own, non-Git backend becomes a viable — and ultimately the recommended — option. I am getting ahead of myself though. The first thing to understand is what Jujutsu is, and is not.
Jujutsu is two things:
    It is a new front-end to Git. This is by far the less interesting of the two things, but in practice it is a substantial part of the experience of using the tool today. In this regard, it sits in the same notional space as something like gitoxide. Jujutsu’s jj is far more usable for day to day work than gitoxide’s gix and ein so far, though, and it also has very different aims. That takes us to:
    It is a new design for distributed version control. This is by far the more interesting part. In particular, Jujutsu brings to the table a few key concepts — none of which are themselves novel, but the combination of which is really nice to use in practice:
        Changes are distinct from revisions: an idea borrowed from Mercurial, but quite different from Git’s model.
        Conflicts are first-class items: an idea borrowed from Pijul and Darcs.
        The user interface is not only reasonable but actually really good: an idea borrowed from… literally every VCS other than Git.
The combo of those means that you can use it today in your existing Git repos, as I have been for the past six months, and that it is a really good experience using it that way. (Better than Git!) Moreover, given it is being actively developed at and by Google for use as a replacement for its current custom VCS setup, it seems like it has a good future ahead of it. Net: at a minimum you get a better experience for using Git with it. At a maximum, you get an incredibly smooth and shallow on-ramp to what I earnestly hope is the future of version control.
Jujutsu is not trying to do every interesting thing that other Git-alternative DVCS systems out there do. Unlike Pijul, for example, it does not work from a theory of patches such that the order changes are applied is irrelevant. However, as I noted above and show in detail below, jj does distinguish between changes and revisions, and has first-class support for conflicts, which means that many of the benefits of Pijul’s handling come along anyway. Unlike Fossil, Jujutsu is also not trying to be an all-in-one tool. Accordingly: It does not come with a replacement for GitHub or other such “forges”. It does not include bug tracking. It does not support chat or a forum or a wiki. Instead, it is currently aimed at just doing the base VCS operations well.
Finally, there is a thing Jujutsu is not yet: a standalone VCS ready to use without Git. It supports its own, “native” backend for the sake of keeping that door open for future capabilities, and the test suite exercises both the Git and the “native” backend, but the “native” one is not remotely ready for regular use. That said, this one I do expect to see change over time!
One of the really interesting bits about picking up Jujutsu is realizing just how weirdly Git has wired your brain, and re-learning how to think about how a version control system can work. It is one thing to believe — very strongly, in my case! — that Git’s UI design is deeply janky (and its underlying model just so-so); it is something else to experience how much better a VCS UI can be (even without replacing the underlying model!).
Yoda saying “You must unlearn what you have learned.”
Time to become a Jedi Knight. Jujutsu Knight? Jujutsu Master? Jujutsu apprentice, at least. Let’s dig in!
Outline
Using Jujutsu
That is all interesting enough philosophically, but for a tool that, if successful, will end up being one of a software developer’s most-used tools, there is an even more important question: What is it actually like to use?
Setup is painless. Running brew install jj did everything I needed. As with most modern Rust-powered CLI tools,1 Jujutsu comes with great completions right out of the box. I did make one post-install tweak, since I am going to be using this on existing Git projects: I updated my ~/.gitignore_global to ignore .jj directories anywhere on disk.2
Using Jujutsu in an existing Git project is also quite easy.3 You just run jj git init --git-repo <path to repo>.4 That’s the entire flow. After that you can use git and jj commands alike on the repository, and everything Just Works™, right down to correctly handling .gitignore files. I have since run jj git init in every Git repository I am actively working on, and have had no issues in many months. It is also possible to initialize a Jujutsu copy of a Git project without having an existing Git repo, using jj git clone, which I have also done, and which works well.
Cloning true-myth and initializing it as a Jujutsu repo
Once a project is initialized, working on it is fairly straightforward, though there are some significant adjustments required if you have deep-seated habits from Git!
Revisions and revsets
One of the first things to wrap your head around when first coming to Jujutsu is its approach to its revisions and revsets, i.e. “sets of revision”. Revisions are the fundamental elements of changes in Jujutsu, not “commits” as in Git. Revsets are then expressions in a functional language for selecting a set of revisions. Both the idea and the terminology are borrowed directly from Mercurial, though the implementation is totally new. (Many things about Jujutsu borrow from Mercurial — a decision which makes me quite happy.) The vast majority of Jujutsu commands take a --revision/-r command to select a revision. So far that might not sound particularly different from Git’s notion of commits and commit ranges, and they are indeed similar at a surface level. However, the differences start showing up pretty quickly, both in terms of working with revisions and in terms of how revisions are a different notion of change than a Git commit.
The first place you are likely to experience how revisions and revsets are different — and neat! — is with the log command, since looking at the commit log is likely to be something you do pretty early in using a new version control tool. (Certainly it was for me.) When you clone a repo and initialize Jujutsu in it and then run jj log, you will see something rather different from what git log would show you — indeed, rather different from anything I even know how to get git log to show you. For example, here’s what I see today when running jj log on the Jujutsu repository, limiting it to show just the last 10 revisions:
> jj log --limit 10
@  ukvtttmt hello@chriskrycho.com 2024-02-03 09:37:24.000 -07:00 1a0b8773
│  (empty) (no description set)
◉  qppsqonm essiene@google.com 2024-02-03 15:06:09.000 +00:00 main* HEAD@git bcdb9beb
·  cli: Move git_init() from init.rs to git.rs
· ◉  rzwovrll ilyagr@users.noreply.github.com 2024-02-01 14:25:17.000 -08:00
┌─┘  ig/contributing@origin 01e0739d
│    Update contributing.md
◉  nxskksop 49699333+dependabot[bot]@users.noreply.github.com 2024-02-01 08:56:08.000
·  -08:00 fb6c834f
·  cargo: bump the cargo-dependencies group with 3 updates
· ◉  tlsouwqs jonathantanmy@google.com 2024-02-02 21:26:23.000 -08:00
· │  jt/missingop@origin missingop@origin 347817c6
· │  workspace: recover from missing operation
· ◉  zpkmktoy jonathantanmy@google.com 2024-02-02 21:16:32.000 -08:00 2d0a444e
· │  workspace: inline is_stale()
· ◉  qkxullnx jonathantanmy@google.com 2024-02-02 20:58:21.000 -08:00 7abf1689
┌─┘  workspace: refactor for_stale_working_copy
◉  yyqlyqtq yuya@tcha.org 2024-01-31 09:40:52.000 +09:00 976b8012
·  index: on reinit(), delete all segment files to save disk space
· ◉  oqnvqzzq martinvonz@google.com 2024-01-23 10:34:16.000 -08:00
┌─┘  push-oznkpsskqyyw@origin 54bd70ad
│    working_copy: make reset() take a commit instead of a tree
◉  rrxuwsqp stephen.g.jennings@gmail.com 2024-01-23 08:59:43.000 -08:00 57d5abab
·  cli: display which file's conflicts are being resolved
Here’s the output for the same basic command in Git — note that I am not trying to get a similar output from Git, just asking what it shows by default (and warning: wall of log output!):
> git log -10
commit: bcdb9beb6ce5ba625ae73d4839e4574db3d9e559     HEAD -> main, origin/main
date:   Mon, 15 Jan 2024 22:31:33 +0000
author: Essien Ita Essien 
    cli: Move git_init() from init.rs to git.rs
    * Move git_init() to cli/src/commands/git.rs and call it from there.
    * Move print_trackable_remote_branches into cli_util since it's not git specific,
      but would apply to any backend that supports remote branches.
    * A no-op change. A follow up PR will make use of this.
commit: 31e4061bab6cfc835e8ac65d263c29e99c937abf
date:   Mon, 8 Jan 2024 10:41:07 +0000
author: Essien Ita Essien 
    cli: Refactor out git_init() to encapsulate all git related work.
    * Create a git_init() function in cli/src/commands/init.rs where all git related work is done.
      This function will be moved to cli/src/commands/git.rs in a subsequent PR.
commit: 8423c63a0465ada99c81f87e06f833568a22cb48
date:   Mon, 8 Jan 2024 10:41:07 +0000
author: Essien Ita Essien 
    cli: Refactor workspace root directory creation
    * Add file_util::create_or_reuse_dir() which is needed by all init
      functionality regardless of the backend.
commit: b3c47953e807bef202d632c4e309b9a8eb814fde
date:   Wed, 31 Jan 2024 20:53:23 -0800
author: Ilya Grigoriev 
    config.md docs: document `jj config edit` and `jj config path`
    This changes the intro section to recommend using `jj config edit` to
    edit the config instead of looking for the files manually.
commit: e9c482c0176d5f0c0c28436f78bd6002aa23a5e2
date:   Wed, 31 Jan 2024 20:53:23 -0800
author: Ilya Grigoriev 
    docs: mention in `jj help config edit` that the command can create a file
commit: 98948554f72d4dc2d5f406da36452acb2868e6d7
date:   Wed, 31 Jan 2024 20:53:23 -0800
author: Ilya Grigoriev 
    cli `jj config`: add `jj config path` command
commit: 8a4b3966a6ff6b9cc1005c575d71bfc7771bced1
date:   Fri, 2 Feb 2024 22:08:00 -0800
author: Ilya Grigoriev 
    test_global_opts: make test_version just a bit nicer when it fails
commit: 42e61327718553fae6b98d7d96dd786b1f050e4c
date:   Fri, 2 Feb 2024 22:03:26 -0800
author: Ilya Grigoriev 
    test_global_opts: extract --version to its own test
commit: 42c85b33c7481efbfec01d68c0a3b1ea857196e0
date:   Fri, 2 Feb 2024 15:23:56 +0000
author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
    cargo: bump the cargo-dependencies group with 1 update
    Bumps the cargo-dependencies group with 1 update: [tokio](https://github.com/tokio-rs/tokio).
    Updates `tokio` from 1.35.1 to 1.36.0
    - [Release notes](https://github.com/tokio-rs/tokio/releases)
    - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.35.1...tokio-1.36.0)
    ---
    updated-dependencies:
    - dependency-name: tokio
      dependency-type: direct:production
      update-type: version-update:semver-minor
      dependency-group: cargo-dependencies
    ...
    Signed-off-by: dependabot[bot] 
commit: 32c6406e5f04d2ecb6642433b0faae2c6592c151
date:   Fri, 2 Feb 2024 15:22:21 +0000
author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
    github: bump the github-dependencies group with 1 update
    Bumps the github-dependencies group with 1 update: [DeterminateSystems/magic-nix-cache-action](https://github.com/determinatesystems/magic-nix-cache-action).
    Updates `DeterminateSystems/magic-nix-cache-action` from 1402a2dd8f56a6a6306c015089c5086f5e1ca3ef to eeabdb06718ac63a7021c6132129679a8e22d0c7
    - [Release notes](https://github.com/determinatesystems/magic-nix-cache-action/releases)
    - [Commits](https://github.com/determinatesystems/magic-nix-cache-action/compare/1402a2dd8f56a6a6306c015089c5086f5e1ca3ef...eeabdb06718ac63a7021c6132129679a8e22d0c7)
    ---
    updated-dependencies:
    - dependency-name: DeterminateSystems/magic-nix-cache-action
      dependency-type: direct:production
      dependency-group: github-dependencies
    ...
    Signed-off-by: dependabot[bot] 
What’s happening in the Jujutsu log output? Per the tutorial’s note on the log command specifically:
    By default, jj log lists your local commits, with some remote commits added for context. The ~ indicates that the commit has parents that are not included in the graph. We can use the -r flag to select a different set of revisions to list.
What jj log does show by default was still a bit non-obvious to me, even after that. Which remote commits added for context, and why? The answer is in the help output for jj log’s -r/--revisions option:
    Which revisions to show. Defaults to the ui.default-revset setting, or @ | ancestors(immutable_heads().., 2) | heads(immutable_heads()) if it is not set
I will come back to this revset in a moment to explain it in detail. First, though, this shows a couple other interesting features of Jujutsu’s approach to revsets and thus the log command. First, it treats some of these operations as functions (ancestors(), immutable_heads(), etc.). There is a whole list of these functions! This is not a surprise if you think about what “expressions in a functional language” implies… but it was a surprise to me because I had not yet read that bit of documentation. Second, it makes “operators” a first-class idea. Git has operators, but this goes a fair bit further:
    It includes - for the parent and + for a child, and these stack and compose, so writing @-+-+ is the same as @ as long as the history is linear. (That is an important distinction!)
    It supports union |, intersection &, and difference ~ operators.
    A leading ::, which means “ancestors”. A trailing :: means “descendants”. Using :: between commits gives a view of the directed acyclic graph range between two commits. Notably, <id1>::<id2> is just <id1>:: & ::<id2>.
    There is also a .. operator, which also composes appropriately (and, smartly, is the same as .. in Git when used between two commits, <id1>..<id2>). The trailing version, <id>.., is interesting: it is “revisions that are not ancestors of <id>”. Likewise, the leading version ..<id> is all revisions which are ancestors of <id>
Now, I used <id> here, but throughout these actually operate on revsets, so you could use them with any revset. For example, ..tags() will give you the ancestors of all tags. This strikes me as extremely interesting: I think it will dodge a lot of pain in dealing with Git histories, because it lets you ask questions about the history in a compositional way using normal set logic. To make that concrete: back in October, Jujutsu contributor @aseipp pointed out how easy it is to use this to get a log which excludes gh-pages. (Anyone who has worked on a repo with a gh-pages branch knows how annoying it is to have it cluttering up your view of the rest of your Git history!) First, you define an alias for the revset that only includes the gh-pages branch: 'gh-pages' = 'remote_branches(exact:"gh-pages")'. Then you can exclude it from other queries with the ~ negation operator: jj log -r "all() ~ ancestors(gh-pages)" would give you a log view for every revision with all() and then exclude every ancestor of the gh-pages branch.
Jujutsu also provides a really capable templating system, which uses “a functional language to customize output of commands”. That functional language is built on top of the functional language that the whole language uses for describing revisions (described in brief above!), so you can use the same kinds of operators in templates for output as you do for navigating and manipulating the repository. The template format is still evolving, but you can use it to customize the output today… while being aware that you may have to update it in the future. Keywords include things like description and change_id, and these can be customized in Jujutsu’s config. For example, I made this tweak to mine, overriding the built-in format_short_id alias:
[template-aliases]
'format_short_id(id)' = 'id.shortest()'
This gives me super short names for changes and commits, which makes for a much nicer experience when reading and working with both in the log output: Jujutsu will give me the shortest unique identifier for a given change or commit, which I can then use with commands like jj new. Additionally, there are a number of built-in templates. For example, to see the equivalent of Git’s log --pretty you can use Jujutsu’s log -T builtin_log_detailed (-T for “template”; you can also use the long from --template). You can define your own templates in a [templates] section, or add your own [template-aliases] block, using the template language and any combination of further functions you define yourself.
That’s all well and good, but even with reading the docs for the revset language and the templating language, it still took me a bit to actually quite make sense out of the default output, much less to get a handle on how to customize the output. Right now, the docs have a bit of a flavor of explanations for people who already have a pretty good handle on version control systems, and the description of what you get from jj log is a good example of that. As the project gains momentum, it will need other kinds of more-introductory material, but the current status is totally fair and reasonable for the stage the project is at. And, to be fair to Jujutsu, both the revset language and the templating language are incredibly easier to understand and work with than the corresponding Git materials.
Returning to the difference between the default output from jj log and git log, the key is that unless you pass -r, Jujutsu uses the ui.default-revset selector to provide a much more informative view than git log does. Again, the default is @ | ancestors(immutable_heads().., 2) | heads(immutable_heads()). Walking through that:
    The @ operator selects the current head revision.
    The | union operator says “or this other revset”, so this will show @ itself and the result of the other two queries.
    The immutable_heads() function gets the list of head revisions which are, well, immutable. By default, this is trunk() | tags(), so whatever the trunk branch is (most commonly main or master) and also any tags in the repository.
    Adding .. to the first immutable_heads() function selects revisions which are not ancestors of those immutable heads. This is basically asking for branches which are not the trunk and which do not end at a tag.
    Then ancestors(immutable_heads().., 2) requests the ancestors of those branches, but only two deep.
    Finally, heads() gets the tips of all branches which appear in the revset passed to it: a head is a commit with no children. Thus, heads(immutable_heads()) gets just the branch tips for the list of revisions computed by immutable_heads().5
When you put those all together, your log view will always show your current head change, all the open branches which have not been merged into your trunk branch, and whatever you have configured to be immutable — out of the box, trunk and all tags. That is vastly more informative than git log’s default output, even if it is a bit surprising the first time you see it. Nor is it particularly possible to get that in a single git log command. By contrast, getting the equivalent of git log is trivial.
To show the full history for a given change, you can use the :: ancestors operator. Since jj log always gives you the identifier for a revision, you can follow it up with jj log --revision ::<change id>, or jj log -r ::<change id> for short. For example, in one repo where I am trying this, the most recent commit identifier starts with mwoq (Jujutsu helpfully highlights the segment of the change identifier you need to use), so I could write jj log -r ::mwoq, and this will show all the ancestors of mwoq, or jj log -r ..mwoq to get all the ancestors of the commit except the root. (The root is uninteresting.) Net, the equivalent command for “show me all the history for this commit” is:
$ jj log -r ..@
Revsets are very powerful, very flexible, and yet much easier to use than Git’s operators. That is in part because of the language used to express them. It is also in part because revsets build on a fundamentally different view of the world than Git commits: Jujutsu’s idea of changes.
Changes
In Git, as in Subversion and Mercurial and other version control systems before them, when you finish with a change, you commit it. In Jujutsu, there is no first-class notion of “committing” code. This took me a fair bit to wrap my head around! Instead, Jujutsu has two discrete operations: describe and new. jj describe lets you provide a descriptive message for any change. jj new starts a new change. You can think of git commit --message "something I did" as being equivalent to jj describe --message "some I did" && jj new. This falls out of the fact that jj describe and jj new are orthogonal, and much more capable than git commit as a result.
The describe command works on any commit. It defaults to the commit that is the current working copy. If you want to rewrite a message earlier in your commit history, though, that is not a special operation like it is in Git, where you have to perform an interactive rebase to do it. You just call jj describe with a --revision (or -r for short, as everywhere in Jujutsu) argument. For example:
# long version
$ jj describe --revision abcd --message "An updated message."
# short version
$ jj describe -r abcd -m "An updated message."
That’s it. How you choose to integrate that into your workflow is a matter for you and your team to decide, of course. Jujutsu understands that some branches should not have their history rewritten this way, though, and lets you specify what the “immutable heads” revset should be accordingly. This actually makes it safer than Git, where the tool itself does not understand that kind of immutability and we rely on forges to protect certain branches from being targeted by a force push.
The new command is the core of creating any new change, and it does not require there to be only a single parent. You can create a new change with as many parents as is appropriate! Is a given change logically the child of four other changes, with identifiers a, b, c, and d? jj new a b c d. That’s it. One neat consequence that falls out of this: a merge in Jujutsu is just jj new with the requirement that it have at least two parents. (“At least two parents” because having multiple parents for a merge is not a special case as with Git’s “octopus” merges.) Likewise, you do not need a commit command, because you can describe a given change at any time with describe, and you can create a new change at any time with new. If you already know the next thing you are going to do, you can even describe it by passing -m/--message to new when creating the new change!6
A demo of using jj new to create a three-parent merge
Most of the time with Git, I am doing one of two things when I go to commit a change:
    Committing everything that is in my working copy: git commit --all7 is an extremely common operation for me.
    Committing a subset of it, not by using Git’s -p to do it via that atrocious interface, but instead opening Fork and doing it with Fork’s staging UI.
In the first case, Jujutsu’s choice to skip Git’s “index” looks like a very good one. In the second case, I was initially skeptical. Once I got the hang of working this way, though, I started to come around. My workflow with Fork looks an awful lot like the workflow that Jujutsu pushes you toward with actually using a diff tool. With Jujutsu, though, any diff tool can work. Want to use Vim? Go for it.
What is more, Jujutsu’s approach to the working copy results in a really interesting shift. In every version control system I have worked with previously (including CVS, PVCS, SVN), the workflow has been some variation on:
    Make a bunch of changes.
    Create a commit and write a message to describe it.
With both Mercurial and Git, it also became possible to rewrite history in various ways. I use Git’s rebase --interactive command extensively when working on large sets of changes. (I did the same with Mercurial’s history rewriting when I was using it a decade ago.) That expanded the list of common operations to include two more:
    Possibly directly amend that set of changes and/or its description.
    Possibly restructure history: breaking apart changes, reordering them, rewriting their message, changing what commit they land on top of, and more.
Jujutsu flips all of that on its head. A change, not a commit, is the fundamental element of the mental and working model. That means that you can describe a change that is still “in progress” as it were. I discovered this while working on a little example code for a blog post I plan to publish later this month: you can describe the change you are working on and then keep working on it. The act of describing the change is distinct from the act of “committing” and thus starting a new change. This falls out naturally from the fact that the working copy state is something you can operate on directly: akin to Git’s index, but without its many pitfalls. (This simplification affects a lot of things, as I will discuss further below; but it is especially important for new learners. Getting my head around the index was one of those things I found quite challenging initially with Git a decade ago.)
When you are ready to start a new change, you use either jj commit to “finalize” this commit with a message, or jj new to “Create a new, empty change and edit it in the working copy”. Implied: jj commit is just a convenience for jj describe followed by jj new. And a bonus: this means that rewording a message earlier in history does not involve some kind of rebase operation; you just jj describe --revision <target>.
What is more, jj new lets you create a new commit anywhere in the history of your project, trivially:
-A, --insert-after
      Insert the new change between the target commit(s) and their children
      [aliases: after]
-B, --insert-before
      Insert the new change between the target commit(s) and their parents
      [aliases: before]
You can do this using interactive rebasing with Git (or with history rewriting with Mercurial, though I am afraid my hg is rusty enough that I do not remember the details). What you cannot do in Git specifically is say “Start a new change at point x” unless you are in the middle of a rebase operation, which makes it inherently somewhat fragile. To be extra clear: Git allows you to check out make a new change at any point in your graph, but it creates a branch at that point, and none of the descendants of that original point in your commit graph will come along without explicitly rebasing. Moreover, even once you do an explicit rebase and cherry-pick in the commit, the original commit is still hanging out, so you likely need to delete that branch. With jj new -A <some change ID>, you just insert the change directly into the history. Jujutsu will rebase every child in the history, including any merges if necessary; it “just works”. That does not guarantee you will not have conflicts, of course, but Jujutsu also handles conflicts better — way better — than Git. More on that below.
I never use git reflog so much as when doing interactive rebases. Once I got the hang of Jujutsu’s ability to jj new anywhere, it basically obviates most of the places I have needed Git’s interactive rebase mode, especially when combined with Jujutsu’s aforementioned support for “first-class conflicts”. There is still an escape hatch for mistakes, though: jj op log shows all the operations you have performed on the repo — and frankly, is much more useful and powerful than git reflog, because it logs all the operations, including whenever Jujutsu updates its view of your working copy via jj status, when it fetches new revisions from a remote.
Additionally, Jujutsu allows you to see how any change has evolved over time. This handily solves multiple pain points in Git. For example, if you have made changes in your working copy, and would like to split it into multiple changes, Git only has a binary state to let you tease those apart: staged, or not. As a result, that kind of operation ranges in difficulty from merely painful to outright impossible. With its obslog command,8 Jujutsu allows you to see how a change has evolved over time. Since the working copy is just one more kind of “change”, you can very easily retrieve earlier state — any time you did a jj status check, or any other command which snapshotted the state of the repository (which is most of them). That applies equally to earlier changes. If you just rebased, for example, and realize you moved some changes to code into the wrong revision, you can use the combination of obslog and new and restore (or move) to pull it back apart into the desired sequence of changes. (This one is hard to describe, so I may put up a video of it later!)
Split
This also leads to another significant difference with Git: around breaking up your current set of changes on disk. As I noted above, Jujutsu treats the working copy itself as a commit instead of having an “index” like Git. Git really only lets you break apart a set of changes with the index, using git add --patch. Jujutsu instead has a split command, which launches a diff editor and lets you select what you want to incorporate — rather like git add --patch does. As with all of its commands, though, jj split works exactly the same way on any commit; the working copy commit gets it “for free”.
Philosophically, I really like this. Practically, though, it is a slightly bumpier experience for me than the Git approach at the moment. Recall that I do not use git add --patch directly. Instead, I always stage changes into the Git index using a graphical tool like Fork. That workflow is slightly nicer than editing a diff — at least, as Jujutsu does it today. In Fork (and similar tools), you start with no changes and add what you want to the change set you want. By contrast, jj split launches a diff view with all the changes from a given commit present: splitting the commit involves removing changes from the right side of the diff so that it has only the changes you want to be present in the first of two new commits; whatever is not present in the final version of the right side when you close your diff editor ends up in the second commit.
If this sounds a little complicated, that is because it is — at least for today. That qualifier is important, because a lot of this is down to tooling, and we have about as much dedicated tooling for Jujutsu as Git had in 2007, which is to say: not much. Qualifier notwithstanding, and philosophical elegance notwithstanding, the complexity is still real here in early 2024. There are two big downsides as things stand. First, I find it comes with more cognitive load. It requires thinking in terms of negation rather than addition, and the “second commit” becomes less and less visible over time as you remove it from the first commit. Second, it requires you to repeat the operation when breaking up something into more than two commits. I semi-regularly take a single bucket of changes on disk and chunk it up into many more than just 2 commits, though! That significantly multiplies the cognitive overhead.
Now, since I started working with Jujutsu, the team has switched the default view for working with these kinds of diffs to using scm-diff-editor, a TUI which has a first-class notion of this kind of workflow.9 That TUI works reasonably well, but is much less pleasant to use than something like the nice GUIs of Fork or Tower.
The net is: when I want to break apart changes, at least for the moment I find myself quite tempted to go back to Fork and Git’s index. I do not think this problem is intractable, and I think the idea of jj split is right. It just — “just”! — needs some careful design work. Preferably, the split command would make it straightforward to generate an arbitrary number of commits from one initial commit, and it would allow progressive creation of each commit from a “vs. the previous commit” baseline. This is the upside of the index in Git: it does actually reflect the reality that there are three separate “buckets” in view when splitting apart a change: the baseline before all changes, the set of all the changes, and the set you want to include in the commit. Existing diff tools do not really handle this — other than the integrated index-aware diff tools in Git clients, which then have their own oddities when interacting with Jujutsu, since it ignores the index.
First-class conflicts
Another huge feature of Jujutsu is its support for first-class conflicts. Instead of a conflict resulting in a nightmare that has to be resolved before you can move on, Jujutsu can incorporate both the merge and its resolution (whether manual or automatic) directly into commit history. Just having the conflicts in history does not seem that weird. “Okay, you committed the text conflict markers from git, neat.” But: having the conflict and its resolution in history, especially when Jujutsu figured out how to do that resolution for you, as part of a rebase operation? That is just plain wild.
A while back, I was working on a change to a library I maintain10 and decided to flip the order in which I landed two changes to package.json. Unfortunately, those changes were adjacent to each other in the file and so flipping the order they would land in seemed likely to be painfully difficult. It was actually trivial. First of all, the flow itself was great: instead of launching an editor for interactive rebase, I just explicitly told Jujutsu to do the rebases: jj rebase --revision <source> --destination <target>. I did that for each of the items I wanted to reorder and I was done. (I could also have rebased a whole series of commits; I just did not need to in this case.) Literally, that was it: because Jujutsu had agreed with me that JSON is a terrible format for changes like this and committed a merge conflict, then resolved the merge conflict via the next rebase command, and simply carried on.
At a mechanical level, Jujutsu will add conflict markers to a file, not unlike those Git adds in merge conflicts. However, unlike Git, those are not just markers in a file. They are part of a system which understands what conflicts are semantically, and therefore also what resolving a conflict is semantically. This not only produces nice automatic outcomes like the one I described with my library above; it also means that you have more options for how to accomplish a resolution, and for how to treat a conflict. Git trains you to see a conflict between two branches as a problem. It requires you to solve that problem before moving on. Jujutsu allows you to treat a conflict as a problem which much be resolved, but it does not require it. Resolving conflicts in merges in Git is often quite messy. It is even worse when rebasing. I have spent an incredibly amount of time attempting merges only to give up and git reset --hard <before the merge>, and possibly even more time trying to resolve a conflicting in a rebase only to bail with git rebase --abort. Jujutsu allows you to create a merge, leave the conflict in place, and then introduce a resolution in the next commit, telling the whole story with your change history.
Conflict resolution with merges
Likewise with a rebase: depending on whether you require all your intermediate revisions to be able to be built or would rather show a history including conflicts, you could choose to rebase, leave all the intermediate changes conflicted, and resolve it only at the end.
Conflict resolution with rebases
Conflicts are inevitable when you have enough people working on a repository. Honestly: conflicts happen when I am working alone in a repository, as suggested by my anecdote above. Having this ability to keep working with the repository even in a conflicted state, as well as to resolve the conflicts in a more interactive and iterative way is something I now find difficult to live without.
Changing changes
There are a few other niceties which fall out of Jujutsu’s distinction between changes and commits, especially when combined with first-class conflicts.
First up, jj squash takes all the changes in a given commit and, well, squashes them into the parent of that commit.11 Given a working copy with a bunch of changes, you can move them straight into the parent by just typing jj squash. If you want to squash some change besides the one you are currently editing, you just pass the -r/--revision flag, as with most Jujutsu commands: jj squash -r abc will squash the change identified by abc into its parent. You can also use the --interactive (-i for short) argument to move just a part of a change into its parent. Using that flag will pop up your configured diff editor just like jj split will and allow you to select which items you want to move into the parent and which you want to keep separate. Or, for an even faster option, if you have specific files to move while leaving others alone, and you do not need to handle subsections of those files, you can pass them as the final arguments to the command, like jj squash ./path/a ./path/c.
As it turns out, this ability to move part of one change into a different change is a really useful thing to be able to do in general. I find it particularly handy when building up a set of changes where I want each one to be coherent — say, for the sake of having a commit history which is easy for others to review. You could do that by doing some combination of jj split and jj new --after <some change ID> and then doing jj rebase to move around the changes… but as usual, Jujutsu has a better way. The squash command is actually just a shortcut for Jujutsu’s move command with some arguments filled in. The move command has --from and --to arguments which let you specify which revisions you want to move between. When you run jj squash with no other arguments, that is the equivalent of jj move --from @ --to @-. When you run jj squash -r abc, that is the equivalent of jj move --from abc --to abc-. Since it takes those arguments explicitly, though, move lets you move changes around between any changes. They do not need to be anywhere near each other in history.
A demo of using jj move
This eliminates another entire category of places I have historically had to reach for git rebase --interactive. While there are still a few times where I think Jujutsu could use something akin to Git’s interactive rebase mode, they are legitimately few, and mostly to do with wanting to be able to do batch reordering of commits. To be fair, though, I only want to do that perhaps a few times a year.
Branches
Branches are another of the very significant differences between Jujutsu and Git — another place where Jujutsu acts a bit more like Mercurial, in fact. In Git, everything happens on named branches. You can operate on anonymous branches in Git, but it will yell at you constantly about being on a “detached HEAD”. Jujutsu inverts this. The normal working mode in Jujutsu is just to make a series of changes, which then naturally form “branches” in the change graph, but which do not require a name out of the gate. You can give a branch a name any time, using jj branch create. That name is just a pointer to the change you pointed it at, though; it does not automatically “follow” you as you do jj new to create new changes. (Readers familiar with Mercurial may recognize that this is very similar to its bookmarks), though without the notion of “active” and “inactive” bookmarks.)
To update what a branch name points to, you use the branch set command. To completely get rid of a branch, including removing it from any remotes you have pushed the branch to, you use the branch delete command. Handily, if you want to forget all your local branch operations (though not the changes they apply to), you can use the branch forget command. That can come in useful when your local copy of a branch has diverged from what is on the remote and you don’t want to reconcile the changes and just want to get back to whatever is on the remote for that branch. No need for git reset --hard origin/<branch name>, just jj branch forget <branch name> and then the next time you pull from the remote, you will get back its view of the branch!
It’s not just me who wants this!
Jujutsu’s defaulting to anonymous branches took me a bit to get used to, after a decade of doing all of my work in Git and of necessity having to do my work on named branches. As with so many things about Jujutsu, though, I have very much come to appreciate this default. In particular,I find this approach makes really good sense for all the steps where I am not yet sharing a set of changes with others. Even once I am sharing the changes with others, Git’s requirement of a branch name can start to feel kind of silly at times. Especially for the case where I am making some small and self-contained change, the name of a given branch is often just some short, snake-case-ified version of the commit message. The default log template shows me the current set of branches, and their commit messages are usually sufficiently informative that I do not need anything else.
However, there are some downsides to this approach in practice, at least given today’s ecosystem. First, the lack of a “current branch” makes for some extra friction when working with tools like GitHub, GitLab, Gitea, and so on. The GitHub model (which other tools have copied) treats branches as the basis for all work. GitHub displays warning messages about commits which are not on a branch, and will not allow you to create a pull request from an anonymous branch. In many ways, this is simply because Git itself treats branches as special and important. GitHub is just following Git’s example of loud warnings about being on a “detached HEAD” commit, after all.
What this means in practice, though, is that there is an extra operation required any time you want to push your changes to GitHub or a similar forge. With Git, you simply git push after making your changes. (More on Git interop below.) Since Git keeps the current branch pointing at the current HEAD, Git aliases git push with no arguments to git push <configured remote for current branch> <current branch>. Jujutsu does not do this, and given how its branching model works today, cannot do this, because named branches do not “follow” your operations. Instead, you must first explicitly set the branch to the commit you want to push. In the most common case, where you are pushing your latest set of changes, that is just jj branch set <branch name>; it takes the current change automatically. Only then can you run jj git push to actually get an update. This is only a paper cut, but it is a paper cut. It is one extra command every single time you go to push a change to share with others, or even just to get it off of your machine.12 That might not seem like a lot, but it adds up.
There is a real tension in the design space here, though. On the one hand, the main time I use branches in Jujutsu at this point is for pushing to a Git forge like GitHub. I rarely feel the need for them for just working on a set of changes, where jj log and jj new <some revision> give me everything I need. In that sense, it seems like having the branch “follow along” with my work would be natural: if I have gone to the trouble of creating a name for a branch and pushing it to some remote, then it is very likely I want to keep it up to date as I add changes to the branch I named. On the other hand, there is a big upside to not doing that automatically: pushing changes becomes an intentional act. I cannot count the number of times I have been working on what is essentially just an experiment in a Git repo, forgotten to change from the foo-feature to a new foo-feature-experiment branch, and then done a git push. Especially if I am collaborating with others on foo-feature, now I have to force push back to the previous to reset things, and let others know to wait for that, etc. That never happens with the Jujutsu model. Since updating a named branch is always an intentional act, you can experiment to your heart’s content, and know you will never accidentally push changes to a branch that way. I go back and forth: Maybe the little bit of extra friction when you do want to push a branch is worth it for all the times you do not have to consciously move a branch backwards to avoid pushing changes you are not yet ready to share.
(As you might expect, the default of anonymous branches has some knock-on effects for how it interacts with Git tooling in general; I say more on this below.)
Jujutsu also has a handy little feature for when you have done a bunch of work on an anonymous branch and are ready to push it to a Git forge. The jj git push subcommand takes an optional --change/-c flag, which creates a branch based on your current change ID. It works really well when you only have a single change you are going to push and then continually work on, or any time you are content that your current change will remain the tip of the branch. It works a little less well when you are going to add further changes later, because you need to then actually use the branch name with jj branch set push/<change ID> -r <revision>.
Taking a step back, though, working with branches in Jujutsu is great overall. The branch command is a particularly good lens for seeing what a well-designed CLI is like and how it can make your work easier. Notice that the various commands there are all of the form jj branch <do something>. There are a handful of other branch subcommands not mentioned so far: list, rename, track, and untrack. Git has slowly improved its design here over the past few years, but still lacks the straightforward coherence of Jujutsu’s design. For one thing, all of these are subcommands in Jujutsu, not like Git’s mishmash of flags which can be combined in some cases but not others, and have different meanings depending on where they are deployed. For another, as with the rest of Jujutsu’s CLI structure, they use the same options to mean the same things. If you want to list all the branches which point to a given set of revisions, you use the -r/--revisions flag, exactly like you do with any other command involving revisions in Jujutsu. In general, Jujutsu has a very strong and careful distinction between commands (including subcommands) and options. Git does not. The track and untrack subcommands are a perfect example. In Jujutsu, you track a remote branch by running a command like jj branch track <branch>@<remote>. The corresponding Git command is git branch --set-upstream-to <remote>/<branch>. But to list and filter branches in Git, you also pass flags, e.g. git branch --all is the equivalent of jj branch list --all. The Git one is shorter, but also notably less coherent; there is no way to build a mental model for it. With Jujutsu, the mental model is obvious and consistent: jj <command> <options> or jj <context> <command> <options>, where <context> is something like branch or workspace or op (for operation).
Git interop
Jujutsu’s native backend exists, and every feature has to work with it, so it will some day be a real feature of the VCS. Today, though, the Git backend is the only one you should use. So much so that if you try to run jj init without passing --git, Jujutsu won’t let you by default:
> jj init
Error: The native backend is disallowed by default.
Hint: Did you mean to pass `--git`?
Set `ui.allow-init-native` to allow initializing a repo with the native backend.
In practice, you are going to be using the Git backend. In practice, I have been using the Git backend for the last seven months, full time, on every one of my personal repositories and all the open source projects I have contributed to. With the sole exception of someone watching me while we pair, no one has noticed, because the Git integration is that solid and robust. This interop means that adoption can be very low friction. Any individual can simply run jj git init --git-repo . in a given Git repository, and start doing their work with Jujutsu instead of Git, and all that work gets translated directly into operations on the Git repository.
Interoperating with Git also means that there is a two way-street between Jujutsu and Git. You can do a bunch of work with jj commands, and then if you hit something you don’t know how to do with Jujutsu yet, you can flip over and do it the way you already know with a git command. When you next run a jj command, like jj status, it will (very quickly!) import the updates from Git and go back about its normal business. The same thing happens when you run commands like jj git fetch to get the latest updates from a Git remote. All the explicit Git interop commands live under a git subcommand: jj git push, jj git fetch, etc. There are a handful of these, including the ability to explicitly ask to synchronize with the Git repository, but the only ones I use on a day to day basis are jj git push and jj git fetch. Notably, there is no jj git pull, because Jujutsu keeps a distinction between getting the latest changes from the server and changing your local copy’s state. I have not missed git pull at all.
This clean interop does not mean that Git sees everything Jujutsu sees, though. Initializing a Jujutsu repo adds a .jj directory to your project, which is where it stores its extra metadata. This, for example, is where Jujutsu keeps track of its own representation of changes, including how any given change has evolved, in terms of the underlying revisions. In the case of a Git repository, those revisions just are the Git commits, and although you rarely need to work with or name them directly, they have the same SHAs, so any time you would name a specific Git commit, you can reference it directly as a Jujutsu revision as well. (This is particularly handy when bouncing between jj commands and Git-aware tools which know nothing of Jujutsu’s change identifiers.) The .jj directory also includes the operation log, and in the case of a fresh Jujutsu repo (not one created from an existing Git repository), is where the backing Git repo lives.
This Git integration currently runs on libgit2, so there is effectively no risk of breaking your repo because of a Jujutsu – Git interop issue. To be sure, there can be bugs in Jujutsu itself, and you can do things using Jujutsu that will leave you in a bit of a mess, but the same is true of any tool which works on your Git repository. The risk might be very slightly higher here than with your average GUI Git client, since Jujutsu is mapping different semantics onto the repository, but I have extremely high confidence in the project at this point, and I think you can too.
Is it ready?
Unsurprisingly, given the scale of the problem domain, there are still some rough edges and gaps. For example: commit signing with GPG or SSH does not yet work. There is an open PR for the basics of the feature with GPG support, and SSH support will be straightforward to add once the basics, but landed it has not.13 The list of actual gaps or missing features is getting short, though. When I started using Jujutsu back in July 2023, there was not yet any support for sparse checkouts or for workspaces (analogous to Git worktrees). Both of those landed in the interval, and there is consistent forward motion from both Google and non-Google contributors. In fact, the biggest gap I see as a regular user in Jujutsu itself is the lack of the kinds of capabilities that will hopefully come once work starts in earnest on the native backend.
The real gaps and rough edges at this point are down to the lack of an ecosystem of tools around Jujutsu, and the ways that existing Git tools interact with Jujutsu’s design for Git interop. The lack of tooling is obvious: no one has built the equivalent of Fork or Tower, and there is no native integration in IDEs like IntelliJ or Visual Studio or in editors like VS Code or Vim. Since Jujutsu currently works primarily in terms of Git, you will get some useful feedback. All of those tools expect to be working in terms of Git’s index and not in terms of a Jujutsu-style working copy, though. Moreover, most of them (unsurprisingly!) share Git’s own confusion about why you are working on a detached HEAD nearly all the time. On the upside, viewing the history of a repo generally works well, with the exception that some tools will not show anonymous branches/detached HEADs other than one you have actively checked out. Detached heads also tend to confuse tools like GitHub’s gh; you will often need to do a bit of extra manual argument-passing to get them to work. (gh pr create --web --head <name> is has been showing up in my history a lot for exactly this reason.)
Some of Jujutsu’s very nice features also make other parts of working on mainstream Git forges a bit wonky. For example, notice what each of these operations has in common:
    Inserting changes at arbitrary points.
    Rewording a change description.
    Rebasing a series of changes.
    Splitting apart commits.
    Combining existing commits.
They are all changes to history. If you have pushed a branch to a remote, doing any of these operations with changes on that branch and pushing to a remote again will be a force push. Most mainstream Git forges handle force pushing pretty badly. In particular, GitHub has some support for showing diffs between force pushes, but it is very basic and loses all conversational context. As a result, any workflow which makes heavy use of force pushes will be bumpy. Jujutsu is not to blame for the gaps in those tools, but it certainly does expose them.14 Nor do I not blame GitHub for the quirks in interop, though. It is not JujutsuLab after all, and Jujutsu is doing things which do not perfectly map onto the Git model. Since most open source software development happens on forges like GitHub and GitLab, though, these things do regularly come up and cause some friction.
The biggest place I feel this today is in the lack of tools designed to work with Jujutsu around splitting, moving, and otherwise interactively editing changes. Other than @arxanas’ excellent scm-diff-editor, the TUI, which Jujutsu bundles for editing diffs on the command line, there are zero good tools for those operations. I mean it when I say scm-diff-editor is excellent, but I also do not love working in a TUI for this kind of thing, so I have cajoled both Kaleidoscope and BBEdit into working to some degree. As I noted when describing how jj split works, though, it is not a particularly good experience. These tools are simply not designed for this workflow. They understand an index, and they do not understand splitting apart changes. Net, we are going to want new tooling which actually understands Jujutsu.
There are opportunities here beyond implementing the same kinds of capabilities that many editors, IDEs, and dedicated VCS viewers provide today for Git. Given a tool which makes rebasing, merging, re-describing changes, etc. are all normal and easy operations, GUI tools could make all of those much easier. Any number of the Git GUIs have tried, but Git’s underlying model simply makes it clunky. That does not have to be the case with Jujutsu. Likewise, surfacing things like Jujutsu’s operation and change evolution logs should be much easier than surfacing the Git reflog, and provide easier ways to recover lost work or simply to change one’s mind.
Conclusion
Jujutsu has become my version control tool of choice since I picked it up over the summer. The rough edges and gaps I described throughout this write-up notwithstanding, I much prefer it to working with Git directly. I do not hesitate to recommend that you try it out on personal or open source projects. Indeed, I actively recommend it! I have used Jujutsu almost exclusively for the past seven months, and I am not sure what would make me go back to using Git other than Jujutsu being abandoned entirely. Given its apparently-bright future at Google, that seems unlikely.15 Moreover, because using it in existing Git repositories is transparent, there is no inherent reason individual developers or teams cannot use it today. (Your corporate security policy might have be a different story.)
Is Jujutsu ready for you to roll out at your Fortune 500 company? Probably not. While it is improving at a steady clip — most of the rough edges I hit in mid-2023 are long since fixed — it is still undergoing breaking changes in design here and there, and there is effectively no material out there about how to use it yet. (This essay exists, in part, as an attempt to change that!) Beyond Jujutsu itself, there is a lot of work to be done to build an ecosystem around it. Most of the remaining rough edges are squarely to do with the lack of understanding from other tools. The project is marching steadily toward a 1.0 release… someday. As for when that might be, there are as far as I know no plans: there is still too much to do. Above all, I am very eager to see what a native Jujutsu backend would look like. Today, it is “just” a much better model for working with Git repos. A world where the same level of smarts being applied to the front end goes into the backend too is a world well worth looking forward to.
Thoughts, comments, or questions? Discuss:
    Hacker News
    lobste.rs
    LinkedIn
    Mastodon
    Threads
    Bluesky
    Twitter/X
Appendix: Kaleidoscope setup and tips
As alluded to above, I have done my best to make it possible to use Kaleidoscope, my beloved diff-and-merge tool, with Jujutsu. I have had only mixed success. The appropriate setup that gives the best results so far:
    Add the following to your Jujutsu config (jj config edit --user) to configure Kaleidoscope for the various diff and merge operations:
    [ui]
    diff-editor = ["ksdiff", "--wait", "$left", "--no-snapshot", "$right", "--no-snapshot"]
    merge-editor = ["ksdiff", "--merge", "--output", "$output", "--base", "$base", "--", "$left", "--snapshot", "$right", "--snapshot"]
    I will note, however, that I have still not been 100% successful using Kaleidoscope this way. In particular, jj split does not give me the desired results; it often ends up reporting “Nothing changed” when I close Kaleidoscope.
    When opening a file diff, you must Option⎇-double-click, not do a normal double-click, so that it will preserve the --no-snapshot behavior. That --no-snapshot argument to ksdiff is what makes the resulting diff editable, which is what Jujutsu needs for its just-edit-a-diff workflow. I have been in touch with the Kaleidoscope folks about this, which is how I even know about this workaround; they are evaluating whether it is possible to make the normal double-click flow preserve the --no-snapshot in this case so you do not have to do the workaround.
Notes
    Yes, it is written in Rust, and it is pretty darn fast. But Git is written in C, and is also pretty darn fast. There are of course some safety upsides to using Rust here, but Rust is not particularly core to Jujutsu’s “branding”. It was just a fairly obvious choice for a project like this at this point — which is exactly what I have long hoped Rust would become! ↩︎
    Pro tip for Mac users: add .DS_Store to your ~/.gitignore_global and live a much less annoyed life — whether using Git or Jujutsu. ↩︎
    I did have one odd hiccup along the way due to a bug (already fixed, though not in a released version) in how Jujutsu handles a failure when initializing in a directory. While confusing, the problem was fixed in the next release… and this is what I expected of still-relatively-early software. ↩︎
    The plain jj init command is reserved for initializing with the native backend… which is currently turned off. This is absolutely the right call for now, until the native backend is ready, but it is a mild bit of extra friction (and makes the title of this essay a bit amusing until the native backend comes online…). ↩︎
    This is not quite the same as Git’s HEAD or as Mercurial’s “tip” — there is only one of either of those, and they are not the same as each other! ↩︎
    If you look at the jj help output today, you will notice that Jujutsu has checkout, merge, and commit commands. Each is just an alias for a behavior using new, describe, or both, though:
        checkout is just an alias for new
        commit is just a shortcut for jj describe -m "<some message>" && jj new
        merge is just jj new with an implicit @ as the first argument.
    All of these are going to go away in the medium term with both documentation and output from the CLI that teach people to use new instead. ↩︎
    Actually it is normally git ci -am "<message>" with -a for “all” (--all) and -m for the message, and smashed together to avoid any needless extra typing. ↩︎
    The name is from Mercurial’s evolution feature, where it refers to changes which have become obsolescent, thus obslog is the “obsolescent changes log”. I recently suggested to the Jujutsu maintainers that renaming this might be helpful, because it took me six months of daily use to discover this incredibly helpful tool. ↩︎
    They also enabled support for a three-pane view in Meld, which allegedly makes it somewhat better. However, Meld is pretty janky on macOS (as GTK apps basically always are), and it has a terrible startup time for reasons that are unclear at this point, which means this was not a great experience in the first place… and Meld crashes on launch on the current version of macOS. ↩︎
    Yes, this is what I do for fun on my time off. At least: partially. ↩︎
    For people coming from Git, there is also an amend alias, so you can use jj amend instead, but it does the same thing as squash and in fact the help text for jj amend makes it clear that it just is squash. ↩︎
    If that sounds like paranoia, well, you only have to lose everything on your machine once due to someone spilling a whole cup of water on it at a coffee shop to learn to be a bit paranoid about having off-machine backups of everything. I git push all the time. ↩︎
    I care about about this feature and have some hopes of helping get it across the line myself here in February 2024, but we will see! ↩︎
    There are plenty of interesting arguments out there about the GitHub collaboration design, alternatives represented by the Phabricator or Gerrit review models, and so on. This piece is long enough without them! ↩︎
    Google is famous for killing products, but less so developer tools. ↩︎
Thanks:
Waleed Khan (@arxanas), Joy Reynolds (@joyously), and Isabella Basso (@isinyaaa) all took time to read and comment on earlier drafts of this mammoth essay, and it is substantially better for their feedback!
Posted:
This entry was originally published in Essays on February 2, 2024, and last updated on February 8, 2024 (you can see the full revision history here); it was started on July 1, 2023.
Meaningful changes since creating this page:
    February 8, 2024: Updated to use jj git init instead of plain jj init, to match the 0.14 release.
    February 4, 2024: Filled out the section on Git interop. (How did I miss that before publishing?!?)
    February 3, 2024: Added an example of the log format right up front in that section.
    February 2, 2024: Reworked section on revsets and prepared for publication!
    February 1, 2024: Finished a draft! Added one and updated another asciinema for some of the basics, finished up the tooling section, and made a bunch of small edits.
    January 31, 2024: Finished describing first-class conflicts and added asciinema recordings showing conflicts with merges and rebases.
    January 30, 2024: Added a ton of material on branches and on CLI design.
    January 29, 2024: Wrote up a section on “changing changes”, focused on the squash and move commands.
    January 29, 2024: Added a section on obslog, and made sure the text was consistent on use of “Jujutsu” vs. jj for the name of the tool vs. command-line invocations.
    January 18, 2024: Made some further structural revisions, removing some now-defunct copy about the original plan, expanded on the conclusion, and substantially expanded the conclusion.
    January 16, 2024: jj init is an essay, and I am rewriting it—not a dev journal, but an essay introduction to the tool.
    November 2, 2023: Added a first pass at a conclusion, and started on the restructuring this needs.
    November 1, 2023: Describing a bit about how jj new -A works and integrates with its story for clean rebases.
    October 31, 2023: Filling in what makes jj interesting, and explaining templates a bit.
    August 7, 2023: A first pass at jj describe and jj new.
    August 7, 2023: YODA! And an introduction to the “Rewiring your Git brain” section.
    August 7, 2023: Adding more structure to the piece, and identifying the next pieces to write.
    July 31, 2023: Starting to do some work on the introduction.
    July 24, 2023: Correcting my description of revision behavior per discussion with the maintainer.
    July 24, 2023: Describing my current feelings about the jj split and auto-committed working copy vs. git add --patch (as mediated by a UI).
    July 13, 2023: Elaborated on the development of version control systems (both personally and in general!)… and added a bunch of <abbr> tags.
    July 12, 2023: Added a section on the experience of having first-class merging Just Work™, added an appendix about Kaleidoscope setup and usage, rewrote the paragraph where I previously mentioned the issues about Kaleidoscope, and iterated on the commit-vs.-change distinction.
    July 9, 2023: Rewrote the jj log section to incorporate info about revsets, rewrote a couple of the existing sections now that I have significantly more experience, and added a bunch of notes to myself about what to tackle next in this write-up.
    July 3, 2023: Wrote up some experience notes on actually using jj describe and jj new: this is pretty wild, and I think I like it?
    July 3, 2023: Reorganized and clarified the existing material a bit.
    July 2, 2023: Added some initial notes about initial setup bumps. And a lot of notes on the things I learned in trying to fix those! 
Spotted a typo? Submit a correction!
Topics:
    software development tools version control Jujutsu Git
Respond:
Thoughts, comments, or questions? Shoot me an email (it’s way better than traditional comments), or leave a comment on Hacker News or lobste.rs.
About:
I’m Chris Krycho—a follower of Christ, a husband, and a dad. I’m a software engineer by trade; a theologian by vocation; and a writer, runner and cyclist, composer, and erstwhile podcaster by hobby.
Support:
If you especially like what I’m doing here, you can buy me a book, or click the affiliate links in book reviews!

File addition: 20240111_patch_terminology.md (----------)

[0.1]

# Document Title
Steno & PL
About
Patch terminology
Jan 11, 2024
Intended audience 	
    Developers of version control systems, specifically jj.
    Those interested in the version control pedagogy.
Origin 	
    Reprint of research originally published on Google Docs.
    Surveyed five participants from various "big tech" companies (>$1B valuation).
Mood 	Investigative.
    Methodology
    Results
    Remarks
    Conclusions
    Related posts
    Comments
Methodology
Q: I am doing research for a source control project, can you answer what the following nouns mean to you in the context of source control, if anything? (Ordered alphabetically)
    a “change”
    a “commit”
    a “patch”
    a “revision”
Results
P1 (Google, uses Piper + CLs):
    Change: a difference in code. What isn’t committed yet.
    Commit: a cl. Code that is ready to push and has a description along with it
    Patch: a commit number that that someone made that may or may not be pushed yet […] A change that’s not yours
    Revision: a change to a commit?
P2 (big tech, uses GitLab + MRs):
    Change: added/removed/updated files
    Commit: a group of related changes with a description
    Patch: a textual representation of changes between two versions
    Revision: a version of the repository, like the state of all files after a set of commits
P3 (Google, uses Fig + CLs):
    Change: A change to me is any difference in code. Uncommitted to pushed. I’ve heard people say I’ve pushed the change.
    Commit: A commit is a saved code diff with a description.
    Patch: A patch is a diff between any two commits how to turn commit a into into b.
    Revision: Revisions idk. I think at work they are snapshots of a code base so all changes at a point in time.
P4 (Microsoft, uses GitHub + PRs):
    Change: the entire change I want to check into the codebase this can be multiple commits but it’s what I’m putting up for review
    Commit: a portion of my change
    Patch: a group of commits or a change I want to use on another repo/branch
    Revision: An id for a group of commits or a single commit
P5 (big tech, uses GitHub + PRs):
    Change: your update to source files in a repository
    Commit: description of change
    Patch: I don’t really use this but I would think a quick fix (image, imports, other small changes etc)
    Revision: some number or set of numbers corresponding to change
Remarks
Take-aways:
    Change: People largely don’t think of a “change” as an physical object, rather just a diff or abstract object.
        It can potentially range from uncommitted to committed to pushed (P1–P5).
        Unlike others, P4 thinks of it as a larger unit than a commit (more like a “review”), probably due to the GitHub PR workflow.
    Commit: Universally, commits are considered to have messages. However, the interpretation of a commit as a snapshot vs diff appears to be implicit (compare P2’s “commit” vs “revision”).
    Patch: Split between interpretations:
        Either it represents a diff between two versions of the code (P2, P3).
        Or it’s a higher-level interpretation of a patch as a transmissible change. Particularly for getting a change from someone else (P1), but can also refer to a change that you want to use on a different branch (P4).
        P5 merely considers a “patch” to be a “small fix”, which is also a generally accepted meaning, although a little imprecise in terms of source control (refers to the intention of the patch, rather than the mechanics of the patch itself).
    Revision: This is really interesting. The underlying mental models are very different, but the semantic implications end up aligning, more so than for the term “commit”!
        P1: Not a specific source control term, just “the effect of revising”.
        P2, P3: Effect of “applying all commits”. This implies that they consider “commits” as diffs and “revisions” as snapshots.
        P4, P5, Some notions that it’s specifically the identifier of a change/commit. It’s something that you can reference or send to others.
        Surprisingly to me, P2–P5 actually all essentially agree that “revision” means a snapshot of the codebase. The mental models are quite different (“accumulation of diffs” vs “stable identifier”) but they refer to the same ultimate result: a specific state of the codebase (…or a way to refer to it — what’s in a name?). This is essentially the opposite of “commit”, where everyone thinks that they agree on what they are, but they’re actually split — roughly evenly? — into snapshot vs diff mental models.
Conclusions
Conclusions for jj:
    We already knew that “change” is a difficult term, syntactically speaking. It’s also now apparent that it’s semantically unclear. Only P4 thought of it as a “reviewable unit”, which would probably most closely match the jj interpretation. We should switch away from this term.
    People are largely settled on what “commits” are in the ways that we thought.
        There are two main mental models, where participants appear to implicitly consider them to be either snapshots or diffs, as we know.
        They have to have messages according to participants (unlike in jj, where a commit/change may not yet have a message).
            It’s possible this is an artifact of the Git mental model, rather than fundamental. We don’t see a lot of confusion when we tell people “your commits can have empty messages”.
            I think the real implication is that the set of changes is packaged/finalized into one unit, as opposed to “changes”, which might be in flux or not properly packaged into one unit for publishing/sharing.
    Half of respondents think that “patch” primarily refers to a diff, while half think that it refers to a transmissible change.
        In my opinion, the “transmissible change” interpretation aligns most closely with jj changes at present. In particular, you put those up for review and people can download them.
        I also think the “diff” interpretation aligns with jj interpretation (as you can rebase patches around, and the semantic content of the patch doesn’t change); however, there is a great deal of discussion on Discord suggesting that people think of “patches” as immutable, and this doesn’t match the jj semantics where you can rebase them around (IIUC).
        Overall, I think “patch” is still the best term we have as a replacement for jj “changes” (unless somebody can propose a better one), and it’s clear that we should move away from “change” as a term.
    “Revision” is much more semantically clear than I thought it was. This means that we can adopt/coopt the existing term and ascribe the specific “snapshot” meaning that we do today.
        We already do use “revision” in many places, most notably “revsets”. For consistency, we likely want to standardize “revision” instead of “commit” as a term.
Related posts
The following are hand-curated posts which you might find interesting.
Date 		Title
19 Jun 2021 		git undo: We can do better
12 Oct 2021 		Lightning-fast rebases with git-move
19 Oct 2022 		Build-aware sparse checkouts
16 Nov 2022 		Bringing revsets to Git
05 Jan 2023 		Where are my Git UI features from the future?
11 Jan 2024 	(this post) 	Patch terminology
Want to see more of my posts? Follow me on Twitter or subscribe via RSS.
Comments
Steno & PL
subscribe via RSS
    Waleed Khan
    me@waleedkhan.name
arxanas
    arxanas
This is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.

File addition: 20231204_mounting_git_commits_as_folders_with_nfs.md (----------)

[0.1]

# Document Title
Mounting git commits as folders with NFS
• git •
Hello! The other day, I started wondering – has anyone ever made a FUSE filesystem for a git repository where all every commit is a folder? It turns out the answer is yes! There’s giblefs, GitMounter, and git9 for Plan 9.
But FUSE is pretty annoying to use on Mac – you need to install a kernel extension, and Mac OS seems to be making it harder and harder to install kernel extensions for security reasons. Also I had a few ideas for how to organize the filesystem differently than those projects.
So I thought it would be fun to experiment with ways to mount filesystems on Mac OS other than FUSE, so I built a project that does that called git-commit-folders. It works (at least on my computer) with both FUSE and NFS, and there’s a broken WebDav implementation too.
It’s pretty experimental (I’m not sure if this is actually a useful piece of software to have or just a fun toy to think about how git works) but it was fun to write and I’ve enjoyed using it myself on small repositories so here are some of the problems I ran into while writing it.
goal: show how commits are like folders
The main reason I wanted to make this was to give folks some intuition for how git works under the hood. After all, git commits really are very similar to folders – every Git commit contains a directory listing of the files in it, and that directory can have subdirectories, etc.
It’s just that git commits aren’t actually implemented as folders to save disk space.
So in git-commit-folders, every commit is actually a folder, and if you want to explore your old commits, you can do it just by exploring the filesystem! For example, if I look at the initial commit for my blog, it looks like this:
$ ls commits/8d/8dc0/8dc0cb0b4b0de3c6f40674198cb2bd44aeee9b86/
README
and a few commits later, it looks like this:
$ ls /tmp/git-homepage/commits/c9/c94e/c94e6f531d02e658d96a3b6255bbf424367765e9/
_config.yml  config.rb  Rakefile  rubypants.rb  source
branches are symlinks
In the filesystem mounted by git-commit-folders, commits are the only real folders – everything else (branches, tags, etc) is a symlink to a commit. This mirrors how git works under the hood.
$ ls -l branches/
lr-xr-xr-x 59 bork bazil-fuse -> ../commits/ff/ff56/ff563b089f9d952cd21ac4d68d8f13c94183dcd8
lr-xr-xr-x 59 bork follow-symlink -> ../commits/7f/7f73/7f73779a8ff79a2a1e21553c6c9cd5d195f33030
lr-xr-xr-x 59 bork go-mod-branch -> ../commits/91/912d/912da3150d9cfa74523b42fae028bbb320b6804f
lr-xr-xr-x 59 bork mac-version -> ../commits/30/3008/30082dcd702b59435f71969cf453828f60753e67
lr-xr-xr-x 59 bork mac-version-debugging -> ../commits/18/18c0/18c0db074ec9b70cb7a28ad9d3f9850082129ce0
lr-xr-xr-x 59 bork main -> ../commits/04/043e/043e90debbeb0fc6b4e28cf8776e874aa5b6e673
$ ls -l tags/
lr-xr-xr-x - bork 31 Dec  1969 test-tag -> ../commits/16/16a3/16a3d776dc163aa8286fb89fde51183ed90c71d0
This definitely doesn’t completely explain how git works (there’s a lot more to it than just “a commit is like a folder!”), but my hope is that it makes thie idea that every commit is like a folder with an old version of your code” feel a little more concrete.
why might this be useful?
Before I get into the implementation, I want to talk about why having a filesystem with a folder for every git commit in it might be useful. A lot of my projects I end up never really using at all (like dnspeep) but I did find myself using this project a little bit while I was working on it.
The main uses I’ve found so far are:
    searching for a function I deleted – I can run grep someFunction branch_histories/main/*/commit.go to find an old version of it
    quickly looking at a file on another branch to copy a line from it, like vim branches/other-branch/go.mod
    searching every branch for a function, like grep someFunction branches/*/commit.go
All of these are through symlinks to commits instead of referencing commits directly.
None of these are the most efficient way to do this (you can use git show and git log -S or maybe git grep to accomplish something similar), but personally I always forget the syntax and navigating a filesystem feels easier to me. git worktree also lets you have multiple branches checked out at the same time, but to me it feels weird to set up an entire worktree just to look at 1 file.
Next I want to talk about some problems I ran into.
problem 1: webdav or NFS?
The two filesystems I could that were natively supported by Mac OS were WebDav and NFS. I couldn’t tell which would be easier to implement so I just tried both.
At first webdav seemed easier and it turns out that golang.org/x/net has a webdav implementation, which was pretty easy to set up.
But that implementation doesn’t support symlinks, I think because it uses the io/fs interface and io/fs doesn’t support symlinks yet. Looks like that’s in progress though. So I gave up on webdav and decided to focus on the NFS implementation, using this go-nfs NFSv3 library.
Someone also mentioned that there’s FileProvider on Mac but I didn’t look into that.
problem 2: how to keep all the implementations in sync?
I was implementing 3 different filesystems (FUSE, NFS, and WebDav), and it wasn’t clear to me how to avoid a lot of duplicated code.
My friend Dave suggested writing one core implementation and then writing adapters (like fuse2nfs and fuse2dav) to translate it into the NFS and WebDav verions. What this looked like in practice is that I needed to implement 3 filesystem interfaces:
    fs.FS for FUSE
    billy.Filesystem for NFS
    webdav.Filesystem for webdav
So I put all the core logic in the fs.FS interface, and then wrote two functions:
    func Fuse2Dav(fs fs.FS) webdav.FileSystem
    func Fuse2NFS(fs fs.FS) billy.Filesystem
All of the filesystems were kind of similar so the translation wasn’t too hard, there were just 1 million annoying bugs to fix.
problem 3: I didn’t want to list every commit
Some git repositories have thousands or millions of commits. My first idea for how to address this was to make commits/ appear empty, so that it works like this:
$ ls commits/
$ ls commits/80210c25a86f75440110e4bc280e388b2c098fbd/
fuse  fuse2nfs  go.mod  go.sum  main.go  README.md
So every commit would be available if you reference it directly, but you can’t list them. This is a weird thing for a filesystem to do but it actually works fine in FUSE. I couldn’t get it to work in NFS though. I assume what’s going on here is that if you tell NFS that a directory is empty, it’ll interpret that the directory is actually empty, which is fair.
I ended up handling this by:
    organizing the commits by their 2-character prefix the way .git/objects does (so that ls commits shows 0b 03 05 06 07 09 1b 1e 3e 4a), but doing 2 levels of this so that a 18d46e76d7c2eedd8577fae67e3f1d4db25018b0 is at commits/18/18df/18d46e76d7c2eedd8577fae67e3f1d4db25018b0
    listing all the packed commits hashes only once at the beginning, caching them in memory, and then only updating the loose objects afterwards. The idea is that almost all of the commits in the repo should be packed and git doesn’t repack its commits very often.
This seems to work okay on the Linux kernel which has ~1 million commits. It takes maybe a minute to do the initial load on my machine and then after that it just needs to do fast incremental updates.
Each commit hash is only 20 bytes so caching 1 million commit hashes isn’t a big deal, it’s just 20MB.
I think a smarter way to do this would be to load the commit listings lazily – Git sorts its packfiles by commit ID, so you can pretty easily do a binary search to find all commits starting with 1b or 1b8c. The git library I was using doesn’t have great support for this though, because listing all commits in a Git repository is a really weird thing to do. I spent maybe a couple of days trying to implement it but I didn’t manage to get the performance I wanted so I gave up.
problem 4: “not a directory”
I kept getting this error:
"/tmp/mnt2/commits/59/59167d7d09fd7a1d64aa1d5be73bc484f6621894/": Not a directory (os error 20)
This really threw me off at first but it turns out that this just means that there was an error while listing the directory, and the way the NFS library handles that error is with “Not a directory”. This happened a bunch of times and I just needed to track the bug down every time.
There were a lot of weird errors like this. I also got cd: system call interrupted which was pretty upsetting but ultimately was just some other bug in my program.
Eventually I realized that I could use Wireshark to look at all the NFS packets being sent back and forth, which made some of this stuff easier to debug.
problem 5: inode numbers
At first I was accidentally setting all my directory inode numbers to 0. This was bad because if if you run find on a directory where the inode number of every directory is 0, it’ll complain about filesystem loops and give up, which is very fair.
I fixed this by defining an inode(string) function which hashed a string to get the inode number, and using the tree ID / blob ID as the string to hash.
problem 6: stale file handles
I kept getting this “Stale NFS file handle” error. The problem is that I need to be able to take an opaque 64-byte NFS “file handle” and map it to the right directory.
The way the NFS library I’m using works is that it generates a file handle for every file and caches those references with a fixed size cache. This works fine for small repositories, but if there are too many files then it’ll overflow the cache and you’ll start getting stale file handle errors.
This is still a problem and I’m not sure how to fix it. I don’t understand how real NFS servers do this, maybe they just have a really big cache?
The NFS file handle is 64 bytes (64 bytes! not bits!) which is pretty big, so it does seem like you could just encode the entire file path in the handle a lot of the time and not cache it at all. Maybe I’ll try to implement that at some point.
problem 7: branch histories
The branch_histories/ directory only lists the latest 100 commits for each branch right now. Not sure what the right move is there – it would be nice to be able to list the full history of the branch somehow. Maybe I could use a similar subfolder trick to the commits/ directory.
problem 8: submodules
Git repositories sometimes have submodules. I don’t understand anything about submodules so right now I’m just ignoring them. So that’s a bug.
problem 9: is NFSv4 better?
I built this with NFSv3 because the only Go library I could find at the time was an NFSv3 library. After I was done I discovered that the buildbarn project has an NFSv4 server in it. Would it be better to use that?
I don’t know if this is actually a problem or how big of an advantage it would be to use NFSv4. I’m also a little unsure about using the buildbarn NFS library because it’s not clear if they expect other people to use it or not.
that’s all!
There are probably more problems I forgot but that’s all I can think of for now. I may or may not fix the NFS stale file handle problem or the “it takes 1 minute to start up on the linux kernel” problem, who knows!
Thanks to my friend vasi who explained one million things about filesystems to me.

File addition: 20231125_dependencies_belong_in_version_control.md (----------)

[0.1]

# Document Title
Dependencies Belong in Version Control
November 25th, 2023
I believe that all project dependencies belong in version control. Source code, binary assets, third-party libraries, and even compiler toolchains. Everything.
The process of building any project should be trivial. Clone repo, invoke build command, and that's it. It shouldn't require a complex configure script, downloading Strawberry Perl, installing Conda, or any of that bullshit.
Infact I'll go one step further. A user should be able to perform a clean OS install, download a zip of master, disconnect from the internet, and build. The build process shouldn't require installing any extra tools or content. If it's something the build needs then it belongs in version control.
First Instinct
Your gut reaction may be revulsion. That's it not possible. Or that it's unreasonable.
You're not totally wrong. If you're using Git for version control then committing ten gigabytes of cross-platform compiler toolchains is infeasible.
That doesn't change my claim. Dependencies do belong in version control. Even if it's not practical today due to Git's limitations. More on that later.
Why
Why do dependencies belong in version control? I'll give a few reasons.
    Usability
    Reliability
    Reproducibility
    Sustainability
Usability
Committing dependencies makes projects trivial to build and run. I have regularly failed to build open source projects and given up in a fit of frustrated rage.
My background is C++ gamedev. C++ infamously doesn't have a standard build system. Which means every project has it's own bullshit build system, project generator, dependency manager, scripting runtimes, etc.
ML and GenAI projects are a god damned nightmare to build. They're so terrible to build that there are countless meta-projects that exists solely to provide one-click installers (example: EasyDiffusion). These installers are fragile and sometimes need to be run several times to succeed.
Commit your dependencies and everything "just works". My extreme frustration with trying, and failing, to build open source projects is what inspired this post.
Reliability
Have you ever had a build fail because of a network error on some third-party server? Commit your dependencies and that will never happen.
There's a whole class of problems that simply disappear when depdendencies are committed. Builds won't break because of an OS update. Network errors don't exist. You eliminate "works on my machine" issues because someone didn't have the right version of CUDA installed.
Reproducibility
Builds are much easier to reproduce when version control contains everything. Great build systems are hermetic and allow for determistic builds. This is only possible when your build doesn't depend on your system environment.
Lockfiles are only a partial solution to reproducibility. Docker images are a poor man's VCS.
Sustainability
Committing dependencies makes it trivial to recreate old builds. God help you if you try to build a webdev stack from 2013.
In video games it's not uncommon to release old games on new platforms. These games can easily be 10 or 20 years old. How many modern projects will be easy to build in 20 years? Hell, how many will be easy to build in 5?
Commit your dependencies and ancient code bases will be as easy to rebuild as possible. Although new platforms will require new code, of course.
Proof of Life
To prove that this isn't completely crazy I built a proof of life C++ demo. My program is exceedingly simple:
#include <fmt/core.h>
int main() {
  fmt::print("Hello world from C++ 👋\n");
  fmt::print("goodbye cruel world from C++ ☠️\n");
  return 0;
}
The folder structure looks like this:
\root
    \sample_cpp_app
        - main.cpp
    \thirdparty
        \fmt (3 MB)
    \toolchains
        \win
            \cmake (106 MB)
            \LLVM (2.5 GB)
            \mingw64 (577 MB)
            \ninja (570 KB)
            \Python311 (20.5 MB)
    - CMakeLists.txt
    - build.bat
    - build.py
The toolchains folder contains five dependencies - CMake, LLVM, Ming64, Ninja, and Python 3.11. Their combined size is 3.19 gigabytes. No effort was made to trim these folders down in size.
The build.bat file nukes all environment variables and sets PATH=C:\Windows\System32;. This ensures only the included toolchains are used to compile.
The end result is a C++ project that "just works".
But Wait There's More
Here's where it gets fun. I wrote a Python that script that scans the directory for "last file accessed time" to track "touched files". This let's me check how many toolchain files are actually needed by the build. It produces this output:
Checking initial file access times... 🥸👨‍🔬🔬
Building... 👷‍♂️💪🛠️
Compile success! 😁
Checking new file access times... 🥸👨‍🔬🔬
File Access Stats
    Touched 508 files. Total Size: 272.00 MB
    Untouched 23138 files. Total Size: 2.93 GB
    Touched 2.1% of files
    Touched 8.3% of bytes
Running program...
    Target exe: c:\temp\code\toolchain_vcs\bin\main.exe
Hello world from C++ 👋
goodbye cruel world from C++ ☠️
Built and ran successfully! 😍
Well will you look at that!
Despite committing 3 gigabytes of toolchains we only actually needed a mere 272 megabytes. Well under 10%! Even better we touched just 2.0% of repo files.
The largest files touched were:
clang++.exe     [116.04 MB]
ld.lld.exe      [86.05 MB]
llvm-ar.exe     [28.97 MB]
cmake.exe       [11.26 MB]
libgcc.a        [5.79 MB]
libstdc++.dll.a [5.32 MB]
libmsvcrt.a     [2.00 MB]
libstdc++-6.dll [1.93 MB]
libkernel32.a   [1.27 MB]
My key takeaway is this: toolchain file sizes are tractable for version control if you can trim the fat.
This sparks my joy. Imagine cloning a repo, clicking build, and having it just work. What a wonderful and delightful world that would be!
A Vision for the Future
I'd like to paint a small dream for what I will call Next Gen Version Control Software (NGVCS). This is my vision for a Git/Perforce successor. Here are some of the key featurs I want NGVCS to have:
    virtual file system to fetch only files a user touches
    copy-on-write file storage
    system cache for NGVCS files
Let's pretend for a moment that every open source project commits their dependencies. Each one contains a full copy of Python, Cuda, Clang, MSVC, libraries, etc. What would happen?
First, the user clones a random GenAI repo. This is near instantaneous as files are not prefetched. The user then invokes the build script. As files are accessed they're downloaded. The very first build may download a few hundred megabytes of data. Notably it does NOT download the entire repo. If the user is on Linux it won't download any binaries for macOS or Windows.
Second, the user clones another GenAI repo and builds. Does this need to re-download gigabytes of duplicated toolchain content? No! Both projects use NGVCS which has a system wide file cache. Since we're also using a copy-on-write file system these files instantly materialize in the second repo at zero cost.
The end result is beautiful. Every project is trivial to fetch, build, and run. And users only have to download the minimum set of files to do so.
The Real World and Counter Arguments
Hopefully I've convinced some of you that committing dependencies is at least a good idea in an ideal world.
Now let's consider the real world and a few counter arguments.
The Elephant in the Room - Git
Unfortunately I must admit that committing dependencies is not be practical today. The problem is Git. One of my unpopular opinions is that Git isn't very good. Among its many sins is terrible support for large files and large repositories.
The root issue is that Git's architecture and default behavior expects all users to have a full copy of the entire repo history. Which means every version of every binary toolchain for every platform. Yikes!
There are various work arounds - Git LFS, Git Submodules, shallow clones, partial clones, etc. The problem is these aren't first-class features. They are, imho, second-class hacks. 😓
In theory Git could be updated to more properly support large projects. I believe Git should be shallow and partial by default. Almost all software projects are defacto centralized. Needing full history isn't the default, it's an edge case. Users should opt-in to full history only if they need it.
Containers
An alternative to committing dependencies is to use containers. If you build out of a container you get most, if not all, of the benefits. You can even maintain an archive of docker images that reliably re-build tagged releases.
Congrats, you're now using Docker as your VCS!
My snarky opinion is that Docker and friends primarily exist because modern build systems are so god damned fragile that the only way to reliably build and deploy is to create a full OS image. This is insanity!
Containers shouldn't be required simply to build and run projects. It's embarassing that's the world we live in.
Licensing
Not all dependencies are authorized for redistribution. I believe MSVC and XCode both disallow redistribution of compiler toolchains? Game consoles like Sony PlayStation and Nintendo Switch don't publicly release headers, libs, or compilers.
This is mostly ok. If you're working on a console project then you're already working on a closed source project. Developers already use permission controls to gate access.
The lack of redistribution rights for "normal" toolchains is annoying. However permissive options are available. If committing dependencies becomes common practice then I think it's likely that toolchain licenses will update to accomdate.
Updating Dependencies
Committing library dependencies to version control means they need to be updated. If you have lots of repos to update this could be a moderate pain in the ass.
This is also the opposite of how Linux works. In Linux land you use a hot mess of system libraries sprinkled chaotically across the search path. That way when there is a security fix you update a single .so (or three) and your system is safe.
I think this is largely a non-issue. Are you building and running your services out of Docker? Do you have a fleet of machines? Do you have lockfiles? Do you compile any thirdparty libraries from source? If the answer to any of these questions is yes, and it is, then you already have a non-trivial procedure to apply security fixes.
Committing dependencies to VCS doesn't make security updates much harder. In fact, having a monorepo source of truth can make things easier!
DVCS
One of Git's claims to fame is its distributed nature. At long last developers can commit work from an internetless cafe or airplane!
My NGVCS dream implies defacto centralization. Especially for large projects with large histories. Does that mean an internet connection is required? Absolutely not! Even Perforce, the King of centralized VCS, supports offline mode. Git continues to function locally even when working with shallow and partial Git clones.
Offline mode and decentralization are independent concepts. I don't know why so many people get this wrong.
Libraries
Do I really think that every library, such as fmt, should commit gigabytes of compilers to version control?
That's a good question. For languages like Rust which have a universal build system probably not. For languages like C++ and Python maybe yes! It'd be a hell of a lot easier to contribute to open source projects if step 0 wasn't "spend 8 hours configuring environment to build".
For libraries the answer may be "it depends". For executables I think the answer is "yes, commit everything".
Dreams vs Reality
NGVCS is obviously a dream. It doesn't exist today. Actually, that's not quite true. This is exactly how Google and Meta operate today. Infact numerous large companies have custom NGVCS equivalents for internal use. Unfortunately there isn't a good solution in the public sphere.
Is committing dependencies reasonable for Git users today? The answer is... almost? It's at least closer than most people realize! A full Python deployment is merely tens to hundreds of megabytes. Clang is only a few gigabytes. A 2TB SSD is only $100. I would enthusiastically donate a few gigabytes of hard drive space in exchange for builds that "just work".
Committing dependencies to Git might be possible to do cleanly today with shallow, sparse, and LFS clones. Maybe. It'd be great if you could run git clone --depth=1 --sparse=windows. Maybe someday.
Conclusion
I strongly believe that dependencies belong in version control. I believe it is "The Right Thing". There are significant benefits to usability, reliability, reproducibility, sustainability, and more.
Committing all dependencies to a Git repo may be more practical than you realize. The actual file size is very reasonable.
Improvements to VCS software can allow repos to commit cross-platform dependencies while allowing users to download the bare minimum amount of content. It's the best of everything.
I hope that I have convinced you that committing dependencies and toolchains is "The Right Thing". I hope that version control systems evolve to accomodate this as a best practice.
Thank you.
Bonus Section
If you read it this far, thank you! Here are some extra thoughts I wanted to share but couldn't squeeze into the main article.
Sample Project
The sample project can be downloaded via Dropbox as a 636mb .7zip file. It should be trivial to download and build! Linux and macOS toolchains aren't included because I only have a Windows machine to test on. It's not on GitHub because they have an unnecessary file size limit.
Git LFS
My dream NGVCS has first class support for all the features I mentioned and more.
Git LFS is, imho, a hacky, second class citizen. It works and people use it. But it requires a bunch of extra effort and running extra commands.
Deployment
I have a related rant that not only should all dependencies be checked into the build system, but that deployments should also include all dependencies. Yes, deploy 2gb+ of CUDA dlls so your exe will reliably run. No, don't force me to use Docker to run your simple Python project.
Git Alternatives
There are a handful of interesting Git alternatives in the pipeline.
    Jujutsu - Git but better
    Pijul - Somewhat academic patch-based VCS
    Sapling - Open source version of Meta's VCS. Not fully usable outside of Meta infra.
    Xethub - Git at 100Tb scale to support massive ML models
Git isn't going to be replaced anytime soon, unfortunately. But there are a variety of projects exploring different ideas. VCS is far from a solved problem. Be open minded!
Package Managers
Package managers are not necessarily a silver bullet. Rust's Cargo is pretty good. NPM is fine I guess. Meanwhile Python's package ecosystem is an absolute disaster. There may be a compile-time vs run-time distinction here.
A good package manager is a decent solution. However package managers exist on a largely per-language basis. And sometimes per-platform. Committing dependencies is a guaranteed good solution for all languages on all platforms.
Polyglot projects that involve multiple languages need multiple package managers. Yuck.

File addition: 20231123_git_branches_intuition_reality.md (----------)

[0.1]

# Document Title
git branches: intuition & reality
• git •
Hello! I’ve been working on writing a zine about git so I’ve been thinking about git branches a lot. I keep hearing from people that they find the way git branches work to be counterintuitive. It got me thinking: what might an “intuitive” notion of a branch be, and how is it different from how git actually works?
So in this post I want to briefly talk about
    an intuitive mental model I think many people have
    how git actually represents branches internally (“branches are a pointer to a commit” etc)
    how the “intuitive model” and the real way it works are actually pretty closely related
    some limits of the intuitive model and why it might cause problems
Nothing in this post is remotely groundbreaking so I’m going to try to keep it pretty short.
an intuitive model of a branch
Of course, people have many different intuitions about branches. Here’s the one that I think corresponds most closely to the physical “a branch of an apple tree” metaphor.
My guess is that a lot of people think about a git branch like this: the 2 commits in pink in this picture are on a “branch”.
I think there are two important things about this diagram:
    the branch has 2 commits on it
    the branch has a “parent” (main) which it’s an offshoot of
That seems pretty reasonable, but that’s not how git defines a branch – most importantly, git doesn’t have any concept of a branch’s “parent”. So how does git define a branch?
in git, a branch is the full history
In git, a branch is the full history of every previous commit, not just the “offshoot” commits. So in our picture above both branches (main and branch) have 4 commits on them.
I made an example repository at https://github.com/jvns/branch-example which has its branches set up the same way as in the picture above. Let’s look at the 2 branches:
main has 4 commits on it:
$ git log --oneline main
70f727a d
f654888 c
3997a46 b
a74606f a
and mybranch has 4 commits on it too. The bottom two commits are shared between both branches.
$ git log --oneline mybranch
13cb960 y
9554dab x
3997a46 b
a74606f a
So mybranch has 4 commits on it, not just the 2 commits 13cb960 and 9554dab that are “offshoot” commits.
You can get git to draw all the commits on both branches like this:
$ git log --all --oneline --graph
* 70f727a (HEAD -> main, origin/main) d
* f654888 c
| * 13cb960 (origin/mybranch, mybranch) y
| * 9554dab x
|/
* 3997a46 b
* a74606f a
a branch is stored as a commit ID
Internally in git, branches are stored as tiny text files which have a commit ID in them. That commit is the latest commit on the branch. This is the “technically correct” definition I was talking about at the beginning.
Let’s look at the text files for main and mybranch in our example repo:
$ cat .git/refs/heads/main
70f727acbe9ea3e3ed3092605721d2eda8ebb3f4
$ cat .git/refs/heads/mybranch
13cb960ad86c78bfa2a85de21cd54818105692bc
This makes sense: 70f727 is the latest commit on main and 13cb96 is the latest commit on mybranch.
The reason this works is that every commit contains a pointer to its parent(s), so git can follow the chain of pointers to get every commit on the branch.
Like I mentioned before, the thing that’s missing here is any relationship at all between these two branches. There’s no indication that mybranch is an offshoot of main.
Now that we’ve talked about how the intuitive notion of a branch is “wrong”, I want to talk about how it’s also right in some very important ways.
people’s intuition is usually not that wrong
I think it’s pretty popular to tell people that their intuition about git is “wrong”. I find that kind of silly – in general, even if people’s intuition about a topic is technically incorrect in some ways, people usually have the intuition they do for very legitimate reasons! “Wrong” models can be super useful.
So let’s talk about 3 ways the intuitive “offshoot” notion of a branch matches up very closely with how we actually use git in practice.
rebases use the “intuitive” notion of a branch
Now let’s go back to our original picture.
When you rebase mybranch on main, it takes the commits on the “intuitive” branch (just the 2 pink commits) and replays them onto main.
The result is that just the 2 (x and y) get copied. Here’s what that looks like:
$ git switch mybranch
$ git rebase main
$ git log --oneline mybranch
952fa64 (HEAD -> mybranch) y
7d50681 x
70f727a (origin/main, main) d
f654888 c
3997a46 b
a74606f a
Here git rebase has created two new commits (952fa64 and 7d50681) whose information comes from the previous two x and y commits.
So the intuitive model isn’t THAT wrong! It tells you exactly what happens in a rebase.
But because git doesn’t know that mybranch is an offshoot of main, you need to tell it explicitly where to rebase the branch.
merges use the “intuitive” notion of a branch too
Merges don’t copy commits, but they do need a “base” commit: the way merges work is that it looks at two sets of changes (starting from the shared base) and then merges them.
Let’s undo the rebase we just did and then see what the merge base is.
$ git switch mybranch
$ git reset --hard 13cb960  # undo the rebase
$ git merge-base main mybranch
3997a466c50d2618f10d435d36ef12d5c6f62f57
This gives us the “base” commit where our branch branched off, 3997a4. That’s exactly the commit you would think it might be based on our intuitive picture.
github pull requests also use the intuitive idea
If we create a pull request on GitHub to merge mybranch into main, it’ll also show us 2 commits: the commits x and y. That makes sense and also matches our intuitive notion of a branch.
I assume if you make a merge request on GitLab it shows you something similar.
intuition is pretty good, but it has some limits
This leaves our intuitive definition of a branch looking pretty good actually! The “intuitive” idea of what a branch is matches exactly with how merges and rebases and GitHub pull requests work.
You do need to explicitly specify the other branch when merging or rebasing or making a pull request (like git rebase main), because git doesn’t know what branch you think your offshoot is based on.
But the intuitive notion of a branch has one fairly serious problem: the way you intuitively think about main and an offshoot branch are very different, and git doesn’t know that.
So let’s talk about the different kinds of git branches.
trunk and offshoot branches
To a human, main and mybranch are pretty different, and you probably have pretty different intentions around how you want to use them.
I think it’s pretty normal to think of some branches as being “trunk” branches, and some branches as being “offshoots”. Also you can have an offshoot of an offshoot.
Of course, git itself doesn’t make any such distinctions (the term “offshoot” is one I just made up!), but what kind of a branch it is definitely affects how you treat it.
For example:
    you might rebase mybranch onto main but you probably wouldn’t rebase main onto mybranch – that would be weird!
    in general people are much more careful around rewriting the history on “trunk” branches than short-lived offshoot branches
git lets you do rebases “backwards”
One thing I think throws people off about git is – because git doesn’t have any notion of whether a branch is an “offshoot” of another branch, it won’t give you any guidance about if/when it’s appropriate to rebase branch X on branch Y. You just have to know.
for example, you can do either:
$ git checkout main
$ git rebase mybranch
or
$ git checkout mybranch
$ git rebase main
Git will happily let you do either one, even though in this case git rebase main is extremely normal and git rebase mybranch is pretty weird. A lot of people said they found this confusing so here’s a picture of the two kinds of rebases:
Similarly, you can do merges “backwards”, though that’s much more normal than doing a backwards rebase – merging mybranch into main and main into mybranch are both useful things to do for different reasons.
Here’s a diagram of the two ways you can merge:
git’s lack of hierarchy between branches is a little weird
I hear the statement “the main branch is not special” a lot and I’ve been puzzled about it – in most of the repositories I work in, main is pretty special! Why are people saying it’s not?
I think the point is that even though branches do have relationships between them (main is often special!), git doesn’t know anything about those relationships.
You have to tell git explicitly about the relationship between branches every single time you run a git command like git rebase or git merge, and if you make a mistake things can get really weird.
I don’t know whether git’s design here is “right” or “wrong” (it definitely has some pros and cons, and I’m very tired of reading endless arguments about it), but I do think it’s surprising to a lot of people for good reason.
git’s UI around branches is weird too
Let’s say you want to look at just the “offshoot” commits on a branch, which as we’ve discussed is a completely normal thing to want.
Here’s how to see just the 2 offshoot commits on our branch with git log:
$ git switch mybranch
$ git log main..mybranch --oneline
13cb960 (HEAD -> mybranch, origin/mybranch) y
9554dab x
You can look at the combined diff for those same 2 commits with git diff like this:
$ git diff main...mybranch
So to see the 2 commits x and y with git log, you need to use 2 dots (..), but to look at the same commits with git diff, you need to use 3 dots (...).
Personally I can never remember what .. and ... mean so I just avoid them completely even though in principle they seem useful.
in GitHub, the default branch is special
Also, it’s worth mentioning that GitHub does have a “special branch”: every github repo has a “default branch” (in git terms, it’s what HEAD points at), which is special in the following ways:
    it’s what you check out when you git clone the repository
    it’s the default destination for pull requests
    github will suggest that you protect the default branch from force pushes
and probably even more that I’m not thinking of.
that’s all!
This all seems extremely obvious in retrospect, but it took me a long time to figure out what a more “intuitive” idea of a branch even might be because I was so used to the technical “a branch is a reference to a commit” definition.
I also hadn’t really thought about how git makes you tell it about the hierarchy between your branches every time you run a git rebase or git merge command – for me it’s second nature to do that and it’s not a big deal, but now that I’m thinking about it, it’s pretty easy to see how somebody could get mixed up.

File addition: 20231112_darcs_rosetta_stone.md (----------)

[0.1]

# Document Title
  Login / Register
all pages   recent changes  
    viewedithistorydiscuss
RosettaStone
    Translations with other distributed VCS
        Basic distributed version control
        Branching
        Adding, moving, removing files
        Inspecting the working directory
        Committing
        Inspecting the repository history
        Undoing
        Collaborating with others
        Advanced usage
    Translations from Subversion
    Discrepencies between DVCS
    See also
Translations with other distributed VCS
Basic distributed version control
The following commands have the same name in darcs, git and hg, with minor differences due to difference in concepts:
    init
    clone
    pull
    push
    log
Branching
concept 	darcs 	git 	hg
branch 	na 	branch 	branch
switch branch 	na [1] 	checkout 	update
    [1] No in-repo branching yet, see issue555
Adding, moving, removing files
concept 	darcs 	git 	hg
track file 	add 	add 	add
copy file 	na 	na 	copy
move/rename file 	move 	mv 	rename
Inspecting the working directory
concept 	darcs 	git 	hg
working dir status 	whatsnew -s 	status 	status
high-level local diff 	whatsnew 	na 	na
diff local 	diff [1] 	diff 	diff
    [1] we tend to use the high-level local diff (darcs whatsnew) instead. This displays the patches themselves (eg ‘mv foo bar’) and not just their effects (eg ‘rm foo’ followed by “add bar”)
Committing
concept 	darcs 	git 	hg
commit locally 	record 	commit 	commit
amend commit 	amend 	commit –amend 	commit –amend
tag changes/revisions 	tag 	tag 	tag
Inspecting the repository history
concept 	darcs 	git 	hg
log 	log 	log 	log
log with diffs 	log -v 	log -p 	log -p
manifest 	show files 	ls-files 	manifest
summarise outgoing changes 	push –dry-run 	log origin/master .. 	outgoing
summarise incoming changes 	pull –dry-run 	log ..origin/mast er 	incoming
diff repos or versions 	diff 	diff 	incoming /outgoing/dif f -r
blame/annotate 	annotate 	blame 	annotate
Undoing
concept 	darcs 	git 	hg
revert a file 	revert foo 	checkout foo 	revert foo
revert full working copy 	revert 	reset –hard 	revert -a
undo commit (leaving working copy untouched) 	unrecord 	reset –soft 	rollback
amend commit 	amend 	commit –amend 	commit –amend
destroy last patch/ changeset 	obliterate 	delete the commit 	strip [1]
destroy any patch/ changeset 	obliterate 	rebase -i, delete the commit 	strip [1]
create anti-changeset 	rollback 	revert 	backout
    [1] requires extension (mq for strip)
Collaborating with others
concept 	darcs 	git 	hg
send by mail 	send 	send-email 	email [1]
    [1] requires extension (patchbomb for email)
Advanced usage
concept 	darcs 	git 	hg
port commit to X 	rebase 	rebase/cherry -pick 	transplant
Translations from Subversion
Subversion idiom 	Similar darcs idiom
svn checkout 	darcs clone
svn update 	darcs pull
svn status -u 	darcs pull –dry-run (summarize remote changes)
svn status 	darcs whatsnew –summary (summarize local changes)
svn status | grep ‘?’ 	darcs whatsnew -ls | grep ^a (list potential files to add)
svn revert foo.txt 	darcs revert foo.txt (revert to foo.txt from repo)
svn diff 	darcs whatsnew (for local changes)
svn diff 	darcs diff (for local and recorded changes)
svn commit 	darcs record + darcs push
svn diff | mail 	darcs send
svn add 	darcs add
svn log 	darcs log
Discrepencies between DVCS
Git has the notion of an index (which affects the meanings of some of the commands), Darcs just has its simple branch-is-repo-is-workspace model.
See also
    Features
    Differences from Git
    Differences from Subversion
    Wikipedia’s comparison of revision control systems
Wiki source: darcs get --lazy http://darcs.net/darcs-wiki
Powered by: gitit + darcs

File addition: 20231110_how_git_cherrypick_and_revert_use_3_way_merge.md (----------)

[0.1]

# Document Title
How git cherry-pick and revert use 3-way merge
• git •
Hello! I was trying to explain to someone how git cherry-pick works the other day, and I found myself getting confused.
What went wrong was: I thought that git cherry-pick was basically applying a patch, but when I tried to actually do it that way, it didn’t work!
Let’s talk about what I thought cherry-pick did (applying a patch), why that’s not quite true, and what it actually does instead (a “3-way merge”).
This post is extremely in the weeds and you definitely don’t need to understand this stuff to use git effectively. But if you (like me) are curious about git’s internals, let’s talk about it!
cherry-pick isn’t applying a patch
The way I previously understood git cherry-pick COMMIT_ID is:
    calculate the diff for COMMIT_ID, like git show COMMIT_ID --patch > out.patch
    Apply the patch to the current branch, like git apply out.patch
Before we get into this – I want to be clear that this model is mostly right, and if that’s your mental model that’s fine. But it’s wrong in some subtle ways and I think that’s kind of interesting, so let’s see how it works.
If I try to do the “calculate the diff and apply the patch” thing in a case where there’s a merge conflict, here’s what happens:
$ git show 10e96e46 --patch > out.patch
$ git apply out.patch
error: patch failed: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown:17
error: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown: patch does not apply
This just fails – it doesn’t give me any way to resolve the conflict or figure out how to solve the problem.
This is quite different from what actually happens when run git cherry-pick, which is that I get a merge conflict:
$ git cherry-pick 10e96e46
error: could not apply 10e96e46... wip
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
So it seems like the “git is applying a patch” model isn’t quite right. But the error message literally does say “could not apply 10e96e46”, so it’s not quite wrong either. What’s going on?
so what is cherry-pick doing?
I went digging through git’s source code to see how cherry-pick works, and ended up at this line of code:
res = do_recursive_merge(r, base, next, base_label, next_label, &head, &msgbuf, opts);
So a cherry-pick is a… merge? What? How? What is it even merging? And how does merging even work in the first place?
I realized that I didn’t really know how git’s merge worked, so I googled it and found out that git does a thing called “3-way merge”. What’s that?
how git merges files: the 3-way merge
Let’s say I want to merge these 2 files. We’ll call them v1.py and v2.py.
def greet():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name
def say_hello():
    greeting = "hello"
    name = "aanya"
    return greeting + " " + name
There are two lines that differ: we have
    def greet() and def say_hello
    name = "aanya" and name = "julia"
How do we know what to pick? It seems impossible!
But what if I told you that the original function was this (base.py)?
def say_hello():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name
Suddenly it seems a lot clearer! v1 changed the function’s name to greet and v2 set name = "aanya". So to merge, we should make both those changes:
def greet():
    greeting = "hello"
    name = "aanya"
    return greeting + " " + name
We can ask git to do this merge with git merge-file, and it gives us exactly the result we expected: it picks def greet() and name = "aanya".
$ git merge-file v1.py base.py v2.py -p
def greet():
    greeting = "hello"
    name = "aanya"
    return greeting + " " + name⏎
This way of merging where you merge 2 files + their original version is called a 3-way merge.
If you want to try it out yourself in a browser, I made a little playground at jvns.ca/3-way-merge/. I made it very quickly so it’s not mobile friendly.
git merges changes, not files
The way I think about the 3-way merge is – git merges changes, not files. We have an original file and 2 possible changes to it, and git tries to combine both of those changes in a reasonable way. Sometimes it can’t (for example if both changes change the same line), and then you get a merge conflict.
Git can also merge more than 2 possible changes: you can have an original file and 8 possible changes, and it can try to reconcile all of them. That’s called an octopus merge but I don’t know much more than that, I’ve never done one.
how git uses 3-way merge to apply a patch
Now let’s get a little weird! When we talk about git “applying a patch” (as you do in a rebase or revert or cherry-pick), it’s not actually creating a patch file and applying it. Instead, it’s doing a 3-way merge.
Here’s how applying commit X as a patch to your current commit corresponds to this v1, v2, and base setup from before:
    The version of the file in your current commit is v1.
    The version of the file before commit X is base
    The version of the file in commit X. Call that v2
    Run git merge-file v1 base v2 to combine them (technically git does not actually run git merge-file, it runs a C function that does it)
Together, you can think of base and v2 as being the “patch”: the diff between them is the change that you want to apply to v1.
how cherry-pick works
Let’s say we have this commit graph, and we want to cherry-pick Y on to main:
A - B (main)
 \
  \
   X - Y - Z
How do we turn that into a 3-way merge? Here’s how it translates into our v1, v2 and base from earlier:
    B is v1
    X is the base, Y is v2
So together X and Y are the “patch”.
And git rebase is just like git cherry-pick, but repeated a bunch of times.
how revert works
Now let’s say we want to run git revert Y on this commit graph
X - Y - Z - A - B
    B is v1
    Y is the base, X is v2
This is exactly like a cherry-pick, but with X and Y reversed. We have to flip them because we want to apply a “reverse patch”.
Revert and cherry-pick are so closely related in git that they’re actually implemented in the same file: revert.c.
this “3-way patch” is a really cool trick
This trick of using a 3-way merge to apply a commit as a patch seems really clever and cool and I’m surprised that I’d never heard of it before! I don’t know of a name for it, but I kind of want to call it a “3-way patch”.
The idea is that with a 3-way patch, you specify the patch as 2 files: the file before the patch and after (base and v2 in our language in this post).
So there are 3 files involved: 1 for the original and 2 for the patch.
The point is that the 3-way patch is a much better way to patch than a normal patch, because you have a lot more context for merging when you have both full files.
Here’s more or less what a normal patch for our example looks like:
@@ -1,1 +1,1 @@:
- def greet():
+ def say_hello():
    greeting = "hello"
and a 3-way patch. This “3-way patch” is not a real file format, it’s just something I made up.
BEFORE: (the full file)
def greet():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name
AFTER: (the full file)
def say_hello():
    greeting = "hello"
    name = "julia"
    return greeting + " " + name
“Building Git” talks about this
The book Building Git by James Coglan is the only place I could find other than the git source code explaining how git cherry-pick actually uses 3-way merge under the hood (I thought Pro Git might talk about it, but it didn’t seem to as far as I could tell).
I actually went to buy it and it turned out that I’d already bought it in 2019 so it was a good reference to have here :)
merging is actually much more complicated than this
There’s more to merging in git than the 3-way merge – there’s something called a “recursive merge” that I don’t understand, and there are a bunch of details about how to deal with handling file deletions and moves, and there are also multiple merge algorithms.
My best idea for where to learn more about this stuff is Building Git, though I haven’t read the whole thing.
so what does git apply do?
I also went looking through git’s source to find out what git apply does, and it seems to (unsurprisingly) be in apply.c. That code parses a patch file, and then hunts through the target file to figure out where to apply it. The core logic seems to be around here: I think the idea is to start at the line number that the patch suggested and then hunt forwards and backwards from there to try to find it:
	/*
	 * There's probably some smart way to do this, but I'll leave
	 * that to the smart and beautiful people. I'm simple and stupid.
	 */
	backwards = current;
	backwards_lno = line;
	forwards = current;
	forwards_lno = line;
	current_lno = line;
  for (i = 0; ; i++) {
     ...
That all seems pretty intuitive and about what I’d naively expect.
how git apply --3way works
git apply also has a --3way flag that does a 3-way merge. So we actually could have more or less implemented git cherry-pick with git apply like this:
$ git show 10e96e46 --patch > out.patch
$ git apply out.patch --3way
Applied patch to 'content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown' with conflicts.
U content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown
--3way doesn’t just use the contents of the patch file though! The patch file starts with:
index d63ade04..65778fc0 100644
d63ade04 and 65778fc0 are the IDs of the old/new versions of that file in git’s object database, so git can retrieve them to do a 3-way patch application. This won’t work if someone emails you a patch and you don’t have the files for the new/old versions of the file though: if you’re missing the blobs you’ll get this error:
$ git apply out.patch
error: repository lacks the necessary blob to perform 3-way merge.
3-way merge is old
A couple of people pointed out that 3-way merge is much older than git, it’s from the late 70s or something. Here’s a paper from 2007 talking about it
that’s all!
I was pretty surprised to learn that I didn’t actually understand the core way that git applies patches internally – it was really cool to learn about!
I have lots of issues with git’s UI but I think this particular thing is not one of them. The 3-way merge seems like a nice unified way to solve a bunch of different problems, it’s pretty intuitive for people (the idea of “applying a patch” is one that a lot of programmers are used to thinking about, and the fact that it’s implemented as a 3-way merge under the hood is an implementation detail that nobody actually ever needs to think about).
Also a very quick plug: I’m working on writing a zine about git, if you’re interested in getting an email when it comes out you can sign up to my very infrequent announcements mailing list.

File addition: 20231106_git_rebase_what_can_go_wrong.md (----------)

[0.1]

# Document Title
git rebase: what can go wrong?
Hello! While talking with folks about Git, I’ve been seeing a comment over and over to the effect of “I hate rebase”. People seemed to feel pretty strongly about this, and I was really surprised because I don’t run into a lot of problems with rebase and I use it all the time.
I’ve found that if many people have a very strong opinion that’s different from mine, usually it’s because they have different experiences around that thing from me.
So I asked on Mastodon:
    today I’m thinking about the tradeoffs of using git rebase a bit. I think the goal of rebase is to have a nice linear commit history, which is something I like.
    but what are the costs of using rebase? what problems has it caused for you in practice? I’m really only interested in specific bad experiences you’ve had here – not opinions or general statements like “rewriting history is bad”
I got a huge number of incredible answers to this, and I’m going to do my best to summarize them here. I’ll also mention solutions or workarounds to those problems in cases where I know of a solution. Here’s the list:
    fixing the same conflict repeatedly is annoying
    rebasing a lot of commits is hard
    undoing a rebase is hard
    force pushing to shared branches can cause lost work
    force pushing makes code reviews harder
    losing commit metadata
    more difficult reverts
    rebasing can break intermediate commits
    accidentally run git commit –amend instead of git rebase –continue
    splitting commits in an interactive rebase is hard
    complex rebases are hard
    rebasing long lived branches can be annoying
    rebase and commit discipline
    a “squash and merge” workflow
    miscellaneous problems
My goal with this isn’t to convince anyone that rebase is bad and you shouldn’t use it (I’m certainly going to keep using rebase!). But seeing all these problems made me want to be more cautious about recommending rebase to newcomers without explaining how to use it safely. It also makes me wonder if there’s an easier workflow for cleaning up your commit history that’s harder to accidentally mess up.
my git workflow assumptions
First, I know that people use a lot of different Git workflows. I’m going to be talking about the workflow I’m used to when working on a team, which is:
    the team uses a central Github/Gitlab repo to coordinate
    there’s one central main branch. It’s protected from force pushes.
    people write code in feature branches and make pull requests to main
    The web service is deployed from main every time a pull request is merged.
    the only way to make a change to main is by making a pull request on Github/Gitlab and merging it
This is not the only “correct” git workflow (it’s a very “we run a web service” workflow and open source project or desktop software with releases generally use a slightly different workflow). But it’s what I know so that’s what I’ll talk about.
two kinds of rebase
Also before we start: one big thing I noticed is that there were 2 different kinds of rebase that kept coming up, and only one of them requires you to deal with merge conflicts.
    rebasing on an ancestor, like git rebase -i HEAD^^^^^^^ to squash many small commits into one. As long as you’re just squashing commits, you’ll never have to resolve a merge conflict while doing this.
    rebasing onto a branch that has diverged, like git rebase main. This can cause merge conflicts.
I think it’s useful to make this distinction because sometimes I’m thinking about rebase type 1 (which is a lot less likely to cause problems), but people who are struggling with it are thinking about rebase type 2.
Now let’s move on to all the problems!
fixing the same conflict repeatedly is annoying
If you make many tiny commits, sometimes you end up in a hellish loop where you have to fix the same merge conflict 10 times. You can also end up fixing merge conflicts totally unnecessarily (like dealing with a merge conflict in code that a future commit deletes).
There are a few ways to make this better:
    first do a git rebase -i HEAD^^^^^^^^^^^ to squash all of the tiny commits into 1 big commit and then a git rebase main to rebase onto a different branch. Then you only have to fix the conflicts once.
    use git rerere to automate repeatedly resolving the same merge conflicts (“rerere” stands for “reuse recorded resolution”, it’ll record your previous merge conflict resolutions and replay them). I’ve never tried this but I think you need to set git config rerere.enabled true and then it’ll automatically help you.
Also if I find myself resolving merge conflicts more than once in a rebase, I’ll usually run git rebase --abort to stop it and then squash my commits into one and try again.
rebasing a lot of commits is hard
Generally when I’m doing a rebase onto a different branch, I’m rebasing 1-2 commits. Maybe sometimes 5! Usually there are no conflicts and it works fine.
Some people described rebasing hundreds of commits by many different people onto a different branch. That sounds really difficult and I don’t envy that task.
undoing a rebase is hard
I heard from several people that when they were new to rebase, they messed up a rebase and permanently lost a week of work that they then had to redo.
The problem here is that undoing a rebase that went wrong is much more complicated than undoing a merge that went wrong (you can undo a bad merge with something like git reset --hard HEAD^). Many newcomers to rebase don’t even realize that undoing a rebase is even possible, and I think it’s pretty easy to understand why.
That said, it is possible to undo a rebase that went wrong. Here’s an example of how to undo a rebase using git reflog.
step 1: Do a bad rebase (for example run git rebase -I HEAD^^^^^ and just delete 3 commits)
step 2: Run git reflog. You should see something like this:
ee244c4 (HEAD -> main) HEAD@{0}: rebase (finish): returning to refs/heads/main
ee244c4 (HEAD -> main) HEAD@{1}: rebase (pick): test
fdb8d73 HEAD@{2}: rebase (start): checkout HEAD^^^^^^^
ca7fe25 HEAD@{3}: commit: 16 bits by default
073bc72 HEAD@{4}: commit: only show tooltips on desktop
step 3: Find the entry immediately before rebase (start). In my case that’s ca7fe25
step 4: Run git reset --hard ca7fe25
A couple of other ways to undo a rebase:
    Apparently @ always refers to your current branch in git, so you can run git reset --hard @{1} to reset your branch to its previous location.
    Another solution folks mentioned that avoids having to use the reflog is to make a “backup branch” with git switch -c backup before rebasing, so you can easily get back to the old commit.
force pushing to shared branches can cause lost work
A few people mentioned the following situation:
    You’re collaborating on a branch with someone
    You push some changes
    They rebase the branch and run git push --force (maybe by accident)
    Now when you run git pull, it’s a mess – you get the a fatal: Need to specify how to reconcile divergent branches error
    While trying to deal with the fallout you might lose some commits, especially if some of the people are involved aren’t very comfortable with git
This is an even worse situation than the “undoing a rebase is hard” situation because the missing commits might be split across many different people’s and the only worse thing than having to hunt through the reflog is multiple different people having to hunt through the reflog.
This has never happened to me because the only branch I’ve ever collaborated on is main, and main has always been protected from force pushing (in my experience the only way you can get something into main is through a pull request). So I’ve never even really been in a situation where this could happen. But I can definitely see how this would cause problems.
The main tools I know to avoid this are:
    don’t rebase on shared branches
    use --force-with-lease when force pushing, to make sure that nobody else has pushed to the branch since your last fetch
Apparently the “since your last fetch” is important here – if you run git fetch immediately before running git push --force-with-lease, the --force-with-lease won’t protect you at all.
I was curious about why people would run git push --force on a shared branch. Some reasons people gave were:
    they’re working on a collaborative feature branch, and the feature branch needs to be rebased onto main. The idea here is that you’re just really careful about coordinating the rebase so nothing gets lost.
    as an open source maintainer, sometimes they need to rebase a contributor’s branch to fix a merge conflict
    they’re new to git, read some instructions online that suggested git rebase and git push --force as a solution, and followed them without understanding the consequences
    they’re used to doing git push --force on a personal branch and ran it on a shared branch by accident
force pushing makes code reviews harder
The situation here is:
    You make a pull request on GitHub
    People leave some comments
    You update the code to address the comments, rebase to clean up your commits, and force push
    Now when the reviewer comes back, it’s hard for them to tell what you changed since the last time you saw it – all the commits show up as “new”.
One way to avoid this is to push new commits addressing the review comments, and then after the PR is approved do a rebase to reorganize everything.
I think some reviewers are more annoyed by this problem than others, it’s kind of a personal preference. Also this might be a Github-specific issue, other code review tools might have better tools for managing this.
losing commit metadata
If you’re rebasing to squash commits, you can lose important commit metadata like Co-Authored-By. Also if you GPG sign your commits, rebase loses the signatures.
There’s probably other commit metadata that you can lose that I’m not thinking of.
I haven’t run into this one so I’m not sure how to avoid it. I think GPG signing commits isn’t as popular as it used to be.
more difficult reverts
Someone mentioned that it’s important for them to be able to easily revert merging any branch (in case the branch broke something), and if the branch contains multiple commits and was merged with rebase, then you need to do multiple reverts to undo the commits.
In a merge workflow, I think you can revert merging any branch just by reverting the merge commit.
rebasing can break intermediate commits
If you’re trying to have a very clean commit history where the tests pass on every commit (very admirable!), rebasing can result in some intermediate commits that are broken and don’t pass the tests, even if the final commit passes the tests.
Apparently you can avoid this by using git rebase -x to run the test suite at every step of the rebase and make sure that the tests are still passing. I’ve never done that though.
accidentally run git commit --amend instead of git rebase --continue
A couple of people mentioned issues with running git commit --amend instead of git rebase --continue when resolving a merge conflict.
The reason this is confusing is that there are two reasons when you might want to edit files during a rebase:
    editing a commit (by using edit in git rebase -i), where you need to write git commit --amend when you’re done
    a merge conflict, where you need to run git rebase --continue when you’re done
It’s very easy to get these two cases mixed up because they feel very similar. I think what goes wrong here is that you:
    Start a rebase
    Run into a merge conflict
    Resolve the merge conflict, and run git add file.txt
    Run git commit because that’s what you’re used to doing after you run git add
    But you were supposed to run git rebase --continue! Now you have a weird extra commit, and maybe it has the wrong commit message and/or author
splitting commits in an interactive rebase is hard
The whole point of rebase is to clean up your commit history, and combining commits with rebase is pretty easy. But what if you want to split up a commit into 2 smaller commits? It’s not as easy, especially if the commit you want to split is a few commits back! I actually don’t really know how to do it even though I feel very comfortable with rebase. I’d probably just do git reset HEAD^^^ or something and use git add -p to redo all my commits from scratch.
One person shared their workflow for splitting commits with rebase.
complex rebases are hard
If you try to do too many things in a single git rebase -i (reorder commits AND combine commits AND modify a commit), it can get really confusing.
To avoid this, I personally prefer to only do 1 thing per rebase, and if I want to do 2 different things I’ll do 2 rebases.
rebasing long lived branches can be annoying
If your branch is long-lived (like for 1 month), having to rebase repeatedly gets painful. It might be easier to just do 1 merge at the end and only resolve the conflicts once.
The dream is to avoid this problem by not having long-lived branches but it doesn’t always work out that way in practice.
miscellaneous problems
A few more issues that I think are not that common:
    Stopping a rebase wrong: If you try to abort a rebase that’s going badly with git reset --hard instead of git rebase --abort, things will behave weirdly until you stop it properly
    Weird interactions with merge commits: A couple of quotes about this: “If you rebase your working copy to keep a clean history for a branch, but the underlying project uses merges, the result can be ugly. If you do rebase -i HEAD~4 and the fourth commit back is a merge, you can see dozens of commits in the interactive editor.“, “I’ve learned the hard way to never rebase if I’ve merged anything from another branch”
rebase and commit discipline
I’ve seen a lot of people arguing about rebase. I’ve been thinking about why this is and I’ve noticed that people work at a few different levels of “commit discipline”:
    Literally anything goes, “wip”, “fix”, “idk”, “add thing”
    When you make a pull request (on github/gitlab), squash all of your crappy commits into a single commit with a reasonable message (usually the PR title)
    Atomic Beautiful Commits – every change is split into the appropriate number of commits, where each one has a nice commit message and where they all tell a story around the change you’re making
Often I think different people inside the same company have different levels of commit discipline, and I’ve seen people argue about this a lot. Personally I’m mostly a Level 2 person. I think Level 3 might be what people mean when they say “clean commit history”.
I think Level 1 and Level 2 are pretty easy to achieve without rebase – for level 1, you don’t have to do anything, and for level 2, you can either press “squash and merge” in github or run git switch main; git merge --squash mybranch on the command line.
But for Level 3, you either need rebase or some other tool (like GitUp) to help you organize your commits to tell a nice story.
I’ve been wondering if when people argue about whether people “should” use rebase or not, they’re really arguing about which minimum level of commit discipline should be required.
I think how this plays out also depends on how big the changes folks are making – if folks are usually making pretty small pull requests anyway, squashing them into 1 commit isn’t a big deal, but if you’re making a 6000-line change you probably want to split it up into multiple commits.
a “squash and merge” workflow
A couple of people mentioned using this workflow that doesn’t use rebase:
    make commits
    Run git merge main to merge main into the branch periodically (and fix conflicts if necessary)
    When you’re done, use GitHub’s “squash and merge” feature (which is the equivalent of running git checkout main; git merge --squash mybranch) to squash all of the changes into 1 commit. This gets rid of all the “ugly” merge commits.
I originally thought this would make the log of commits on my branch too ugly, but apparently git log main..mybranch will just show you the changes on your branch, like this:
$ git log main..mybranch
756d4af (HEAD -> mybranch) Merge branch 'main' into mybranch
20106fd Merge branch 'main' into mybranch
d7da423 some commit on my branch
85a5d7d some other commit on my branch
Of course, the goal here isn’t to force people who have made beautiful atomic commits to squash their commits – it’s just to provide an easy option for folks to clean up a messy commit history (“add new feature; wip; wip; fix; fix; fix; fix; fix;“) without having to use rebase.
I’d be curious to hear about other people who use a workflow like this and if it works well.
there are more problems than I expected
I went into this really feeling like “rebase is fine, what could go wrong?” But many of these problems actually have happened to me in the past, it’s just that over the years I’ve learned how to avoid or fix all of them.
And I’ve never really seen anyone share best practices for rebase, other than “never force push to a shared branch”. All of these honestly make me a lot more reluctant to recommend using rebase.
To recap, I think these are my personal rebase rules I follow:
    stop a rebase if it’s going badly instead of letting it finish (with git rebase --abort)
    know how to use git reflog to undo a bad rebase
    don’t rebase a million tiny commits (instead do it in 2 steps: git rebase -i HEAD^^^^ and then git rebase main)
    don’t do more than one thing in a git rebase -i. Keep it simple.
    never force push to a shared branch
    never rebase commits that have already been pushed to main
Thanks to Marco Rogers for encouraging me to think about the problems people have with rebase, and to everyone on Mastodon who helped with this.

File addition: 20231101_confusing_git_terminology.md (----------)

[0.1]

# Document Title
Confusing git terminology
• git •
Hello! I’m slowly working on explaining git. One of my biggest problems is that after almost 15 years of using git, I’ve become very used to git’s idiosyncracies and it’s easy for me to forget what’s confusing about it.
So I asked people on Mastodon:
    what git jargon do you find confusing? thinking of writing a blog post that explains some of git’s weirder terminology: “detached HEAD state”, “fast-forward”, “index/staging area/staged”, “ahead of ‘origin/main’ by 1 commit”, etc
I got a lot of GREAT answers and I’ll try to summarize some of them here. Here’s a list of the terms:
    HEAD and “heads”
    “detached HEAD state”
    “ours” and “theirs” while merging or rebasing
    “Your branch is up to date with ‘origin/main’”
    HEAD^, HEAD~ HEAD^^, HEAD~~, HEAD^2, HEAD~2
    .. and …
    “can be fast-forwarded”
    “reference”, “symbolic reference”
    refspecs
    “tree-ish”
    “index”, “staged”, “cached”
    “reset”, “revert”, “restore”
    “untracked files”, “remote-tracking branch”, “track remote branch”
    checkout
    reflog
    merge vs rebase vs cherry-pick
    rebase –onto
    commit
    more confusing terms
I’ve done my best to explain what’s going on with these terms, but they cover basically every single major feature of git which is definitely too much for a single blog post so it’s pretty patchy in some places.
HEAD and “heads”
A few people said they were confused by the terms HEAD and refs/heads/main, because it sounds like it’s some complicated technical internal thing.
Here’s a quick summary:
    “heads” are “branches”. Internally in git, branches are stored in a directory called .git/refs/heads. (technically the official git glossary says that the branch is all the commits on it and the head is just the most recent commit, but they’re 2 different ways to think about the same thing)
    HEAD is the current branch. It’s stored in .git/HEAD.
I think that “a head is a branch, HEAD is the current branch” is a good candidate for the weirdest terminology choice in git, but it’s definitely too late for a clearer naming scheme so let’s move on.
There are some important exceptions to “HEAD is the current branch”, which we’ll talk about next.
“detached HEAD state”
You’ve probably seen this message:
$ git checkout v0.1
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
[...]
Here’s the deal with this message:
    In Git, usually you have a “current branch” checked out, for example main.
    The place the current branch is stored is called HEAD.
    Any new commits you make will get added to your current branch, and if you run git merge other_branch, that will also affect your current branch
    But HEAD doesn’t have to be a branch! Instead it can be a commit ID.
    Git calls this state (where HEAD is a commit ID instead of a branch) “detached HEAD state”
    For example, you can get into detached HEAD state by checking out a tag, because a tag isn’t a branch
    if you don’t have a current branch, a bunch of things break:
        git pull doesn’t work at all (since the whole point of it is to update your current branch)
        neither does git push unless you use it in a special way
        git commit, git merge, git rebase, and git cherry-pick do still work, but they’ll leave you with “orphaned” commits that aren’t connected to any branch, so those commits will be hard to find
    You can get out of detached HEAD state by either creating a new branch or switching to an existing branch
“ours” and “theirs” while merging or rebasing
If you have a merge conflict, you can run git checkout --ours file.txt to pick the version of file.txt from the “ours” side. But which side is “ours” and which side is “theirs”?
I always find this confusing and I never use git checkout --ours because of that, but I looked it up to see which is which.
For merges, here’s how it works: the current branch is “ours” and the branch you’re merging in is “theirs”, like this. Seems reasonable.
$ git checkout merge-into-ours # current branch is "ours"
$ git merge from-theirs # branch we're merging in is "theirs"
For rebases it’s the opposite – the current branch is “theirs” and the target branch we’re rebasing onto is “ours”, like this:
$ git checkout theirs # current branch is "theirs"
$ git rebase ours # branch we're rebasing onto is "ours"
I think the reason for this is that under the hood git rebase main is repeatedly merging commits from the current branch into a copy of the main branch (you can see what I mean by that in this weird shell script the implements git rebase using git merge. But I still find it confusing.
This nice tiny site explains the “ours” and “theirs” terms.
A couple of people also mentioned that VSCode calls “ours”/“theirs” “current change”/“incoming change”, and that it’s confusing in the exact same way.
“Your branch is up to date with ‘origin/main’”
This message seems straightforward – it’s saying that your main branch is up to date with the origin!
But it’s actually a little misleading. You might think that this means that your main branch is up to date. It doesn’t. What it actually means is – if you last ran git fetch or git pull 5 days ago, then your main branch is up to date with all the changes as of 5 days ago.
So if you don’t realize that, it can give you a false sense of security.
I think git could theoretically give you a more useful message like “is up to date with the origin’s main as of your last fetch 5 days ago” because the time that the most recent fetch happened is stored in the reflog, but it doesn’t.
HEAD^, HEAD~ HEAD^^, HEAD~~, HEAD^2, HEAD~2
I’ve known for a long time that HEAD^ refers to the previous commit, but I’ve been confused for a long time about the difference between HEAD~ and HEAD^.
I looked it up, and here’s how these relate to each other:
    HEAD^ and HEAD~ are the same thing (1 commit ago)
    HEAD^^^ and HEAD~~~ and HEAD~3 are the same thing (3 commits ago)
    HEAD^3 refers the the third parent of a commit, and is different from HEAD~3
This seems weird – why are HEAD~ and HEAD^ the same thing? And what’s the “third parent”? Is that the same thing as the parent’s parent’s parent? (spoiler: it isn’t) Let’s talk about it!
Most commits have only one parent. But merge commits have multiple parents – they’re merging together 2 or more commits. In Git HEAD^ means “the parent of the HEAD commit”. But what if HEAD is a merge commit? What does HEAD^ refer to?
The answer is that HEAD^ refers to the the first parent of the merge, HEAD^2 is the second parent, HEAD^3 is the third parent, etc.
But I guess they also wanted a way to refer to “3 commits ago”, so HEAD^3 is the third parent of the current commit (which may have many parents if it’s a merge commit), and HEAD~3 is the parent’s parent’s parent.
I think in the context of the merge commit ours/theirs discussion earlier, HEAD^ is “ours” and HEAD^2 is “theirs”.
.. and ...
Here are two commands:
    git log main..test
    git log main...test
What’s the difference between .. and ...? I never use these so I had to look it up in man git-range-diff. It seems like the answer is that in this case:
A - B main
  \ 
    C - D test
    main..test is commits C and D
    test..main is commit B
    main...test is commits B, C, and D
But it gets worse: apparently git diff also supports .. and ..., but they do something completely different than they do with git log? I think the summary is:
    git log test..main shows changes on main that aren’t on test, whereas git log test...main shows changes on both sides.
    git diff test..main shows test changes and main changes (it diffs B and D) whereas git diff test...main diffs A and D (it only shows you the diff on one side).
this blog post talks about it a bit more.
“can be fast-forwarded”
Here’s a very common message you’ll see in git status:
$ git status
On branch main
Your branch is behind 'origin/main' by 2 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)
What does “fast-forwarded” mean? Basically it’s trying to say that the two branches look something like this: (newest commits are on the right)
main:        A - B - C
origin/main: A - B - C - D - E
or visualized another way:
A - B - C - D - E (origin/main)
        |
       main
Here origin/main just has 2 extra commits that main doesn’t have, so it’s easy to bring main up to date – we just need to add those 2 commits. Literally nothing can possibly go wrong – there’s no possibility of merge conflicts. A fast forward merge is a very good thing! It’s the easiest way to combine 2 branches.
After running git pull, you’ll end up this state:
main:        A - B - C - D - E
origin/main: A - B - C - D - E
Here’s an example of a state which can’t be fast-forwarded.
             A - B - C - X  (main)
                     |
                     - - D - E  (origin/main)
Here main has a commit that origin/main doesn’t have (X). So you can’t do a fast forward. In that case, git status would say:
$ git status
Your branch and 'origin/main' have diverged,
and have 1 and 2 different commits each, respectively.
“reference”, “symbolic reference”
I’ve always found the term “reference” kind of confusing. There are at least 3 things that get called “references” in git
    branches and tags like main and v0.2
    HEAD, which is the current branch
    things like HEAD^^^ which git will resolve to a commit ID. Technically these are probably not “references”, I guess git calls them “revision parameters” but I’ve never used that term.
“symbolic reference” is a very weird term to me because personally I think the only symbolic reference I’ve ever used is HEAD (the current branch), and HEAD has a very central place in git (most of git’s core commands’ behaviour depends on the value of HEAD), so I’m not sure what the point of having it as a generic concept is.
refspecs
When you configure a git remote in .git/config, there’s this +refs/heads/main:refs/remotes/origin/main thing.
[remote "origin"]
	url = git@github.com:jvns/pandas-cookbook
	fetch = +refs/heads/main:refs/remotes/origin/main
I don’t really know what this means, I’ve always just used whatever the default is when you do a git clone or git remote add, and I’ve never felt any motivation to learn about it or change it from the default.
“tree-ish”
The man page for git checkout says:
 git checkout [-f|--ours|--theirs|-m|--conflict=<style>] [<tree-ish>] [--] <pathspec>...
What’s tree-ish??? What git is trying to say here is when you run git checkout THING ., THING can be either:
    a commit ID (like 182cd3f)
    a reference to a commit ID (like main or HEAD^^ or v0.3.2)
    a subdirectory inside a commit (like main:./docs)
    I think that’s it????
Personally I’ve never used the “directory inside a commit” thing and from my perspective “tree-ish” might as well just mean “commit or reference to commit”.
“index”, “staged”, “cached”
All of these refer to the exact same thing (the file .git/index, which is where your changes are staged when you run git add):
    git diff --cached
    git rm --cached
    git diff --staged
    the file .git/index
Even though they all ultimately refer to the same file, there’s some variation in how those terms are used in practice:
    Apparently the flags --index and --cached do not generally mean the same thing. I have personally never used the --index flag so I’m not going to get into it, but this blog post by Junio Hamano (git’s lead maintainer) explains all the gnarly details
    the “index” lists untracked files (I guess for performance reasons) but you don’t usually think of the “staging area” as including untracked files”
“reset”, “revert”, “restore”
A bunch of people mentioned that “reset”, “revert” and “restore” are very similar words and it’s hard to differentiate them.
I think it’s made worse because
    git reset --hard and git restore . on their own do basically the same thing. (though git reset --hard COMMIT and git restore --source COMMIT . are completely different from each other)
    the respective man pages don’t give very helpful descriptions:
        git reset: “Reset current HEAD to the specified state”
        git revert: “Revert some existing commits”
        git restore: “Restore working tree files”
Those short descriptions do give you a better sense for which noun is being affected (“current HEAD”, “some commits”, “working tree files”) but they assume you know what “reset”, “revert” and “restore” mean in this context.
Here are some short descriptions of what they each do:
    git revert COMMIT: Create a new commit that’s the “opposite” of COMMIT on your current branch (if COMMIT added 3 lines, the new commit will delete those 3 lines)
    git reset --hard COMMIT: Force your current branch back to the state it was at COMMIT, erasing any new changes since COMMIT. Very dangerous operation.
    git restore --source=COMMIT PATH: Take all the files in PATH back to how they were at COMMIT, without changing any other files or commit history.
“untracked files”, “remote-tracking branch”, “track remote branch”
Git uses the word “track” in 3 different related ways:
    Untracked files: in the output of git status. This means those files aren’t managed by Git and won’t be included in commits.
    a “remote tracking branch” like origin/main. This is a local reference, and it’s the commit ID that main pointed to on the remote origin the last time you ran git pull or git fetch.
    “branch foo set up to track remote branch bar from origin”
The “untracked files” and “remote tracking branch” thing is not too bad – they both use “track”, but the context is very different. No big deal. But I think the other two uses of “track” are actually quite confusing:
    main is a branch that tracks a remote
    origin/main is a remote-tracking branch
But a “branch that tracks a remote” and a “remote-tracking branch” are different things in Git and the distinction is pretty important! Here’s a quick summary of the differences:
    main is a branch. You can make commits to it, merge into it, etc. It’s often configured to “track” the remote main in .git/config, which means that you can use git pull and git push to push/pull changes.
    origin/main is not a branch. It’s a “remote-tracking branch”, which is not a kind of branch (I’m sorry). You can’t make commits to it. The only way you can update it is by running git pull or git fetch to get the latest state of main from the remote.
I’d never really thought about this ambiguity before but I think it’s pretty easy to see why folks are confused by it.
checkout
Checkout does two totally unrelated things:
    git checkout BRANCH switches branches
    git checkout file.txt discards your unstaged changes to file.txt
This is well known to be confusing and git has actually split those two functions into git switch and git restore (though you can still use checkout if, like me, you have 15 years of muscle memory around git checkout that you don’t feel like unlearning)
Also personally after 15 years I still can’t remember the order of the arguments to git checkout main file.txt for restoring the version of file.txt from the main branch.
I think sometimes you need to pass -- to checkout as an argument somewhere to help it figure out which argument is a branch and which ones are paths but I never do that and I’m not sure when it’s needed.
reflog
Lots of people mentioning reading reflog as re-flog and not ref-log. I won’t get deep into the reflog here because this post is REALLY long but:
    “reference” is an umbrella term git uses for branches, tags, and HEAD
    the reference log (“reflog”) gives you the history of everything a reference has ever pointed to
    It can help get you out of some VERY bad git situations, like if you accidentally delete an important branch
    I find it one of the most confusing parts of git’s UI and I try to avoid needing to use it.
merge vs rebase vs cherry-pick
A bunch of people mentioned being confused about the difference between merge and rebase and not understanding what the “base” in rebase was supposed to be.
I’ll try to summarize them very briefly here, but I don’t think these 1-line explanations are that useful because people structure their workflows around merge / rebase in pretty different ways and to really understand merge/rebase you need to understand the workflows. Also pictures really help. That could really be its whole own blog post though so I’m not going to get into it.
    merge creates a single new commit that merges the 2 branches
    rebase copies commits on the current branch to the target branch, one at a time.
    cherry-pick is similar to rebase, but with a totally different syntax (one big difference is that rebase copies commits FROM the current branch, cherry-pick copies commits TO the current branch)
rebase --onto
git rebase has an flag called onto. This has always seemed confusing to me because the whole point of git rebase main is to rebase the current branch onto main. So what’s the extra onto argument about?
I looked it up, and --onto definitely solves a problem that I’ve rarely/never actually had, but I guess I’ll write down my understanding of it anyway.
A - B - C (main)
     \
      D - E - F - G (mybranch)
          | 
          otherbranch
Imagine that for some reason I just want to move commits F and G to be rebased on top of main. I think there’s probably some git workflow where this comes up a lot.
Apparently you can run git rebase --onto main otherbranch mybranch to do that. It seems impossible to me to remember the syntax for this (there are 3 different branch names involved, which for me is too many), but I heard about it from a bunch of people so I guess it must be useful.
commit
Someone mentioned that they found it confusing that commit is used both as a verb and a noun in git.
for example:
    verb: “Remember to commit often”
    noun: “the most recent commit on main“
My guess is that most folks get used to this relatively quickly, but this use of “commit” is different from how it’s used in SQL databases, where I think “commit” is just a verb (you “COMMIT” to end a transaction) and not a noun.
Also in git you can think of a Git commit in 3 different ways:
    a snapshot of the current state of every file
    a diff from the parent commit
    a history of every previous commit
None of those are wrong: different commands use commits in all of these ways. For example git show treats a commit as a diff, git log treats it as a history, and git restore treats it as a snapshot.
But git’s terminology doesn’t do much to help you understand in which sense a commit is being used by a given command.
more confusing terms
Here are a bunch more confusing terms. I don’t know what a lot of these mean.
things I don’t really understand myself:
    “the git pickaxe” (maybe this is git log -S and git log -G, for searching the diffs of previous commits?)
    submodules (all I know is that they don’t work the way I want them to work)
    “cone mode” in git sparse checkout (no idea what this is but someone mentioned it)
things that people mentioned finding confusing but that I left out of this post because it was already 3000 words:
    blob, tree
    the direction of “merge”
    “origin”, “upstream”, “downstream”
    that push and pull aren’t opposites
    the relationship between fetch and pull (pull = fetch + merge)
    git porcelain
    subtrees
    worktrees
    the stash
    “master” or “main” (it sounds like it has a special meaning inside git but it doesn’t)
    when you need to use origin main (like git push origin main) vs origin/main
github terms people mentioned being confused by:
    “pull request” (vs “merge request” in gitlab which folks seemed to think was clearer)
    what “squash and merge” and “rebase and merge” do (I’d never actually heard of git merge --squash until yesterday, I thought “squash and merge” was a special github feature)
it’s genuinely “every git term”
I was surprised that basically every other core feature of git was mentioned by at least one person as being confusing in some way. I’d be interested in hearing more examples of confusing git terms that I missed too.
There’s another great post about this from 2012 called the most confusing git terminology. It talks more about how git’s terminology relates to CVS and Subversion’s terminology.
If I had to pick the 3 most confusing git terms, I think right now I’d pick:
    a head is a branch, HEAD is the current branch
    “remote tracking branch” and “branch that tracks a remote” being different things
    how “index”, “staged”, “cached” all refer to the same thing
that’s all!
I learned a lot from writing this – I learned a few new facts about git, but more importantly I feel like I have a slightly better sense now for what someone might mean when they say that everything in git is confusing.
I really hadn’t thought about a lot of these issues before – like I’d never realized how “tracking” is used in such a weird way when discussing branches.
Also as usual I might have made some mistakes, especially since I ended up in a bunch of corners of git that I hadn’t visited before.
Also a very quick plug: I’m working on writing a zine about git, if you’re interested in getting an email when it comes out you can sign up to my very infrequent announcements mailing list.

File addition: 20231023_unified_versus_split_diff.md (----------)

[0.1]

# Document Title
Unified Versus Split Diff
Oct 23, 2023
Which is better for code reviews, a unified diff or a split diff?
A split diff looks like this for me:
And this is a unified one:
If the changes are simple and small, both views are good. But for larger, more complex changes neither works for me.
For a large change, I don’t want to do a “diff review”, I want to do a proper code review of a codebase at a particular instant in time, paying specific attention to the recently changed areas, but mostly just doing general review, as if I am writing the code. I need to run tests, use goto definition and other editor navigation features, apply local changes to check if some things could have been written differently, look at the wider context to notice things that should have been changed, and in general notice anything that might be not quite right with the codebase, irrespective of the historical path to the current state of the code.
So, for me, the ideal diff view would look rather like this:
On the left, the current state of the code (which is also the on-disk state), with changes subtly highlighted in the margins. On the right, the unified diff for the portion of the codebase currently visible on the left.
Sadly, this format of review isn’t well supported by the tools — everyone seems to be happy reviewing diffs, rather than the actual code?
I have a low-tech and pretty inefficient workflow for this style of review. A gpr script for checking out a pull request locally:
$ gpr 1234 --review
Internally, it does roughly
$ git fetch upstream refs/pull/1234/head
$ git switch --detach FETCH_HEAD
$ git reset $(git merge-base HEAD main)
The last line is the key — it erases all the commits from the pull request, but keeps all of the changes. This lets me abuse my workflow for staging&committing to do a code review — edamagit shows the list of changed files, I get “go to next/previous change” shortcuts in the editor, I can even use the staging area to mark hunks I have reviewed.
The only thing I don’t get is automatic synchronization between magit status buffer, and the file that’s currently open in the editor. That is, to view the current file and the diff on the side, I have to manually open the diff and scroll it to the point I am currently looking at.
I wish it was easier to get this close to the code without building custom ad-hoc tools!
P.S. This post talks about how to review code, but reviewing the code is not necessary the primary goal of code review. See this related post: Two Kinds of Code Review.

File addition: 20231020_gitfacts.md (----------)

[0.1]


Some miscellaneous git facts
• git •
I’ve been very slowly working on writing about how Git works. I thought I already knew Git pretty well, but as usual when I try to explain something I’ve been learning some new things.
None of these things feel super surprising in retrospect, but I hadn’t thought about them clearly before.
The facts are:
    the “index”, “staging area” and “–cached” are all the same thing
    the stash is a bunch of commits
    not all references are branches or tags
    merge commits aren’t empty
Let’s talk about them!
the “index”, “staging area” and “–cached” are all the same thing
When you run git add file.txt, and then git status, you’ll see something like this:
$ git add content/post/2023-10-20-some-miscellaneous-git-facts.markdown
$ git status
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   content/post/2023-10-20-some-miscellaneous-git-facts.markdown
People usually call this “staging a file” or “adding a file to the staging area”.
When you stage a file with git add, behind the scenes git adds the file to its object database (in .git/objects) and updates a file called .git/index to refer to the newly added file.
This “staging area” actually gets referred to by 3 different names in Git. All of these refer to the exact same thing (the file .git/index):
    git diff --cached
    git diff --staged
    the file .git/index
I felt like I should have realized this earlier, but I didn’t, so there it is.
the stash is a bunch of commits
When I run git stash to stash my changes, I’ve always been a bit confused about where those changes actually went. It turns out that when you run git stash, git makes some commits with your changes and labels them with a reference called stash (in .git/refs/stash).
Let’s stash this blog post and look at the log of the stash reference:
$ git log stash --oneline
6cb983fe (refs/stash) WIP on main: c6ee55ed wip
2ff2c273 index on main: c6ee55ed wip
... some more stuff
Now we can look at the commit 2ff2c273 to see what it contains:
$ git show 2ff2c273  --stat
commit 2ff2c273357c94a0087104f776a8dd28ee467769
Author: Julia Evans <julia@jvns.ca>
Date:   Fri Oct 20 14:49:20 2023 -0400
    index on main: c6ee55ed wip
 content/post/2023-10-20-some-miscellaneous-git-facts.markdown | 40 ++++++++++++++++++++++++++++++++++++++++
Unsurprisingly, it contains this blog post. Makes sense!
git stash actually creates 2 separate commits: one for the index, and one for your changes that you haven’t staged yet. I found this kind of heartening because I’ve been working on a tool to snapshot and restore the state of a git repository (that I may or may not ever release) and I came up with a very similar design, so that made me feel better about my choices.
Apparently older commits in the stash are stored in the reflog.
not all references are branches or tags
Git’s documentation often refers to “references” in a generic way that I find a little confusing sometimes. Personally 99% of the time when I deal with a “reference” in Git it’s a branch or HEAD and the other 1% of the time it’s a tag. I actually didn’t know ANY examples of references that weren’t branches or tags or HEAD.
But now I know one example – the stash is a reference, and it’s not a branch or tag! So that’s cool.
Here are all the references in my blog’s git repository (other than HEAD):
$ find .git/refs -type f
.git/refs/heads/main
.git/refs/remotes/origin/HEAD
.git/refs/remotes/origin/main
.git/refs/stash
Some other references people mentioned in reponses to this post:
    refs/notes/*, from git notes
    refs/pull/123/head, and `refs/pull/123/head for GitHub pull requests (which you can get with git fetch origin refs/pull/123/merge)
    refs/bisect/*, from git bisect
merge commits aren’t empty
Here’s a toy git repo where I created two branches x and y, each with 1 file (x.txt and y.txt) and merged them. Let’s look at the merge commit.
$ git log --oneline
96a8afb (HEAD -> y) Merge branch 'x' into y
0931e45 y
1d8bd2d (x) x
If I run git show 96a8afb, the commit looks “empty”: there’s no diff!
git show 96a8afb
commit 96a8afbf776c2cebccf8ec0dba7c6c765ea5d987 (HEAD -> y)
Merge: 0931e45 1d8bd2d
Author: Julia Evans <julia@jvns.ca>
Date:   Fri Oct 20 14:07:00 2023 -0400
    Merge branch 'x' into y
But if I diff the merge commit against each of its two parent commits separately, you can see that of course there is a diff:
$ git diff 0931e45 96a8afb   --stat
 x.txt | 1 +
 1 file changed, 1 insertion(+)
$ git diff 1d8bd2d 96a8afb   --stat
 y.txt | 1 +
 1 file changed, 1 insertion(+)
It seems kind of obvious in retrospect that merge commits aren’t actually “empty” (they’re snapshots of the current state of the repo, just like any other commit), but I’d never thought about why they appear to be empty.
Apparently the reason that these merge diffs are empty is that merge diffs only show conflicts – if I instead create a repo with a merge conflict (one branch added x and another branch added y to the same file), and show the merge commit where I resolved the conflict, it looks like this:
$ git show HEAD
commit 3bfe8311afa4da867426c0bf6343420217486594
Merge: 782b3d5 ac7046d
Author: Julia Evans <julia@jvns.ca>
Date:   Fri Oct 20 15:29:06 2023 -0400
    Merge branch 'x' into y
diff --cc file.txt
index 975fbec,587be6b..b680253
--- a/file.txt
+++ b/file.txt
@@@ -1,1 -1,1 +1,1 @@@
- y
 -x
++z
It looks like this is trying to tell me that one branch added x, another branch added y, and the merge commit resolved it by putting z instead. But in the earlier example, there was no conflict, so Git didn’t display a diff at all.
(thanks to Jordi for telling me how merge diffs work)
that’s all!
I’ll keep this post short, maybe I’ll write another blog post with more git facts as I learn them.

File addition: 20231017_crdt_survey_semantic_techniques.md (----------)

[0.1]

# Document Title
CRDT Survey, Part 2: Semantic Techniques
Matthew Weidner | Oct 17th, 2023
Home | RSS Feed
Keywords: CRDTs, collaborative apps, semantics
    This blog post is Part 2 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
# Semantic Techniques
In Part 1, I defined a collaborative app’s semantics as an abstract definition of what the app’s state should be, given the operations that users have performed.
Your choice of semantics should be informed by users’ intents and expectations: if one user does X while an offline user concurrently does Y, what do the users want to happen when they sync up? Even after you figure out specific scenarios, though, it is tricky to design a strategy that is well-defined in every situation (multi-way concurrency, extensive offline work, etc.).
CRDT semantic techniques help you with this goal. Like the data structures and design patterns that you learn about when programming single-user apps, these techniques provide valuable guidance, but they are not a replacement for deciding what your app should actually do.
The techniques come in various forms:
    Specific building blocks - e.g., list CRDT positions. (Single-user app analogy: specific data structures like a hash map.)
    General-purpose ideas that must be applied wisely - e.g., unique IDs. (Single-user analogy: object-oriented programming techniques.)
    Example semantics for specific parts of a collaborative app - e.g., a list with a move operation. (Single-user analogy: Learning from an existing app’s architecture.)
Some of these techniques will be familiar if you’ve read Designing Data Structures for Collaborative Apps, but I promise there are new ones here as well.
# Table of Contents
This post is meant to be usable as a reference. However, some techniques build on prior techniques. I recommend reading linearly until you reach Composed Examples, then hopping around to whatever interests you.
    Describing Semantics
        Causal Order
    Basic Techniques
        Unique IDs (UIDs) • Append-Only Log • Unique Set • Lists and Text Editing • Last Writer Wins (LWW) • LWW Map • Multi-Value Register • Multi-Value Map
    Composition Techniques
        Views • Objects • Nested Objects • Map-Like Object • Unique Set of CRDTs • List of CRDTs
    Composed Examples
        Add-Wins Set • List-with-Move • Internally-Mutable Register • CRDT-Valued Map • Archiving Collections • Update-Wins Collections • Spreadsheet Grid
    Advanced Techniques
        Formatting Marks (Rich Text) • Spreadsheet Formatting • Global Modifiers • Forests and Trees • Undo/Redo
    Other Techniques
        Remove-Wins Set • PN-Set • Observed-Reset Operations • Querying the Causal Order • Topological Sort
    Capstones
        Recipe Editor • Block-Based Rich Text
# Describing Semantics
I’ll describe a CRDT’s semantics by specifying a pure function of the operation history: a function that inputs the history of operations that users have performed, and outputs the current app-visible state.
A box with six "+1"s labeled "Operation history", an arrow labeled "Semantic function", and a large 6 labeled "App state".
Note that I don’t expect you to implement a literal “operation history + pure function”; that would be inefficient. Instead, you are supposed to implement an algorithm that gives the same result. E.g., an op-based CRDT that satisfies: whenever a user has received the messages corresponding to operations S, the user’s state matches the pure function applied to S. I’ll give a few of these algorithms below, and more in Part 3.
More precisely, I’ll describe a CRDT’s semantics as:
    A collection of operations that users are allowed to perform on the CRDT. Example: Call inc() to increment a counter.
    For each operation, a translated version that gets stored in the (abstract) operation history. Example: When a user deletes the ingredient at index 0 in an ingredients list, we might instead store the operation Delete the ingredient with unique ID <xyz>.
    A pure function that inputs a set of translated operations and some ordering metadata (next paragraph), and outputs the intended state of a user who is aware of those operations. Example: A counter’s semantic function inputs the set of inc() operations and outputs its size, ignoring ordering metadata.
The “ordering metadata” is a collection of arrows indicating which operations were aware of each other. E.g., here is a diagram representing the operation history from Part 1:
Operations A-G with arrows A to B, B to C, C to D, D to E, C to F, F to G. The labels are: "Add ingr 'Broc: 1 ct' w/ UID <xyz>"; "Add ingr 'Oil: 15 mL' w/ UID <abc>"; "Add ingr 'Salt: 2 mL' w/ UID <123>"; "Delete ingr <xyz>"; "Set amt <123> to 3 mL"; "Prepend 'Olive ' to ingr <abc>"; "Halve the recipe".
    One user performs a sequence of operations to create the initial recipe.
    After seeing those operations, two users concurrently do Delete ingredient <xyz> and Prepend "Olive " to ingredient <abc>.
    After seeing each other’s operations, the two users do two more concurrent operations.
I’ll use diagrams like this throughout the post to represent operation histories. You can think of them like git commit graphs, except that each point is labeled with its operation instead of its state/hash, and parallel “heads” (the rightmost points) are implicitly merged.
Example: A user who has received the above operation history already sees the result of both heads Set amt <123> to 3 mL and <Halve the recipe>, even though there is no “merge commit”. If that user performs another operation, it will get arrows from both heads, like an explicit merge commit:
Previous figure with an additional operation H labeled "Delete ingr <abc>" and arrows E to H, G to H.
Describing semantics in terms of a pure function of the operation history lets us sidestep the usual CRDT rules like “concurrent messages must commute” and “the merge function must be idempotent”. Indeed, the point of those rules is to guarantee that a given CRDT algorithm corresponds to some pure function of the operation history (cf. Part 1’s definition of a CRDT). We instead directly say what pure function we want, then define CRDTs to match (or trust you to do so).
    Strong convergence is the property that a CRDT’s state is a pure function of the operation history - i.e., users who have received the same set of ops are in equivalent states. Strong Eventual Consistency (SEC) additionally requires that two users who stop performing operations will eventually be in equivalent states; it follows from strong convergence in any network where users eventually exchange operations (Shapiro et al. 2011b).
    These properties are necessary for collaborative apps, but they are not sufficient: you still need to check that your CRDT’s specific semantics are reasonable for your app. It is easy to forget this if you get bogged down in e.g. a proof that concurrent messages commute.
# Causal Order
Formally, arrows in our operation history diagrams indicate the “causal order” on operations. We will use the causal order to define the multi-value register and some later techniques, so if you want a formal definition, read this section first (else you can skip ahead).
The causal order is the partial order < on pairs of operations defined by:
    If a user had received operation o before performing their own operation p, then o < p. This includes the case that they performed both o and p in that order.
    (Transitivity) If o < p and p < q, then o < q.
Our operation histories indicate o < p by drawing an arrow from o to p. Except, we omit arrows that are implied by transitivity - equivalently, by following a sequence of other arrows.
Operations A, B, C, D, with arrows from A to B, B to C, A to D, and D to C.
Figure 1. One user performs operations A, B, C in sequence. After receiving A but not B, another user performs D; the first user receives that before performing D. The causal order is then A < B, A < C, A < D, B < C, D < C. In the figure, the arrow for A < C is implied.
Some derived terms:
    When o < p, we say that o is causally prior to p / o is a causal predecessor of p, and p is causally greater than o.
    When we neither have o < p nor p < o, we say that o and p are concurrent.
    When o < p, you may also see the phrases “o happened-before p” or “p is causally aware of o”.
    o is an immediate causal predecessor of p if o < p and there is no r such that o < r < p. These are precisely the pairs (o, p) connected by an arrow in our operation histories: all non-immediate causal predecessors are implied by transitivity.
In the above figure, B is causally greater than A, causally prior to C, and concurrent to D. The immediate causal predecessors of C are B and D; A is a causal predecessor, but not an immediate one.
It is easy to track the causal order in a CRDT setting: label each operation by IDs for its immediate causal predecessors (the tails of its incoming arrows). Thus when choosing our “pure function of the operation history”, it is okay if that function queries the causal order. We will see an example of this in the multi-value register.
Often, CRDT-based apps choose to enforce causal-order delivery: a user’s app will not process an operation (updating the app’s state) until after processing all causally-prior operations. (An op may be processed in any order relative to concurrent operations.) In other words, operations are processed in causal order. This simplifies programming and makes sense to users, by providing a guarantee called causal consistency. For example, it ensures that if one user adds an ingredient to a recipe and then writes instructions for it, all users will see the ingredient before the instructions. However, there are times when you might choose to forgo causal-order delivery - e.g., when there are undone operations. (More examples in Part 4).
In Collabs: CRuntime (causal-order delivery), vectorClock (causal order access)
References: Lamport 1978
# Basic Techniques
We begin with basic semantic techniques. Most of these were not invented as CRDTs; instead, they are database techniques or programming folklore. It is often easy to implement them yourself or use them outside of a traditional CRDT framework.
# Unique IDs (UIDs)
To refer to a piece of content, assign it an immutable Unique ID (UID). Use that UID in operations involving the content, instead of using a mutable descriptor like its index in a list.
Example: In a recipe editor, assign each ingredient a UID. When a user edits an ingredient’s amount, indicate which ingredient using its UID. This solves Part 1’s example.
By “piece of content”, I mean anything that the user views as a distinct “thing”, with its own long-lived identity: an ingredient, a spreadsheet cell, a document, etc. Note that the content may be internally mutable. Other analogies:
    Anything that would be its own object in a single-user app. Its UID is the distributed version of a “pointer” to that object.
    Anything that would get its own row in a normalized database table. Its UID functions as the primary key.
To ensure that all users agree on a piece of content’s UID, the content’s creator should assign the UID at creation time and broadcast it. E.g., include a new ingredient’s UID in the corresponding “Add Ingredient” operation. The assigned UID must be unique even if multiple users create UIDs concurrently; you can ensure that by using UUIDs, or Part 3’s dot IDs.
UIDs are useful even in non-collaborative contexts. For example, a single-user spreadsheet formula that references cell B3 should store the UIDs of its column (B) and row (3) instead of the literal string “B3”. That way, the formula still references “the same cell” even if a new row shifts the cell to B4.
# Append-Only Log
Use an append-only log to record events indefinitely. This is a CRDT with a single operation add(x), where x is an immutable value to store alongside the event. Internally, add(x) gets translated to an operation add(id, x), where id is a new UID; this lets you distinguish events with the same values. Given an operation history made of these add(id, x) events, the current state is just the set of all pairs (id, x).
Example: In a delivery tracking system, each package’s history is an append-only log of events. Each event’s value describes what happened (scanned, delivered, etc.) and the wall-clock time. The app displays the events directly to the user in wall-clock time order. Conflicting concurrent operations indicate a real-world conflict and must be resolved manually.
I usually think of an append-only log as unordered, like a set (despite the word “append”). If you do want to display events in a consistent order, you can include a timestamp in the value and sort by that, or use a list CRDT (below) instead of an append-only log. Consider using a logical timestamp like in LWW, so that the order is compatible with the causal order: o < p implies o appears before p.
Refs: Log in Shapiro et al. 2011b
# Unique Set
A unique set is like an append-only log, but it also allows deletes. It is the basis for any collection that grows and shrinks dynamically: sets, lists, certain maps, etc.
Its operations are:
    add(x): Adds an operation add(id, x) to the history, where id is a new UID. (This is the same as the append-only log’s add(x), except that we call the entry an element instead of an event.)
    delete(id), where id is the UID of the element to be deleted.
Given an operation history, the unique set’s state is the set of pairs (id, x) such that there is an add(id, x) operation but no delete(id) operations.
Operations A-F with arrows A to B, A to D, B to C, B to E, D to E, E to F. The labels are: add(ac63, "doc/Hund"); add(x72z, "cat/Katze"); delete(ac63); delete(ac63); add(8f8x, "chicken/Huhn"); delete(x72z).
Figure 2. In a collaborative flash card app, you could represent the deck of cards as a unique set, using x to hold the flash card's value (its front and back strings). Users can edit the deck by adding a new card or deleting an existing one, and duplicate cards are allowed. Given the above operation history, the current state is { (8f8x, "chicken/Huhn") }.
You can think of the unique set as an obvious way of working with UID-labeled content. It is analogous to a database table with operations to insert and delete rows, using the UID (= primary key) to identify rows. Or, thinking of UIDs like distributed pointers, add and delete are the distributed versions of new and free.
    It’s easy to convert the unique set’s semantics to an op-based CRDT.
        Per-user state: The literal state, which is a set of pairs (id, x).
        Operation add(x): Generate a new UID id, then broadcast add(id, x). Upon receiving this message, each user (including the initiator) adds the pair (id, x) to their local state.
        Operation delete(id): Broadcast delete(id). Upon receiving this message, each user deletes the pair with the given id, if it is still present. Note: this assumes causal-order delivery - otherwise, you might receive delete(id) before add(id, x), then forget that the element is deleted.
    A state-based CRDT is more difficult; Part 3 will give a nontrivial optimized algorithm.
Refs: U-Set in Shapiro et al. 2011a
# Lists and Text Editing
In collaborative text editing, users can insert (type) and delete characters in an ordered list. Inserting or deleting a character shifts later characters’ indices, in the style of JavaScript’s Array.splice.
The CRDT way to handle this is: assign each character a unique immutable list CRDT position when it’s typed. These positions are a special kind of UID that are ordered: given two positions p and q, you can ask whether p < q or q < p. Then the text’s state is given by:
    Sort all list elements (position, char) by position.
    Display the characters in that order.
Classic list CRDTs have operations insert and delete, which are like the unique set’s add and delete operations, except using positions instead of generic UIDs. A text CRDT is the same but with individual text characters for values. See a previous blog post for details.
But the real semantic technique is the positions themselves. Abstractly, they are “opaque things that are immutable and ordered”. To match users’ expectations, list CRDT positions must satisfy a few rules (Attiya et al. 2016):
    The order is total: if p and q are distinct positions, then either p < q or q < p, even if p and q were created by different users concurrently.
    If p < q on one user’s device at one time, then p < q on all users’ devices at all times. Example: characters in a collaborative text document do not reverse order, no matter what happens to characters around them.
    If p < q and q < r, then p < r. This holds even if q is not currently part of the app’s state.
This definition still gives us some freedom in choosing <. The Fugue paper (myself and Martin Kleppmann, 2023) gives a particular choice of < and motivates why we think you should prefer it over any other. Seph Gentle’s Diamond Types and the Braid group’s Sync9 each independently chose nearly identical semantics (thanks to Michael Toomim and Greg Little for bringing the latter to our attention).
List CRDT positions are our first “real” CRDT technique - they don’t come from databases or programming folklore, and it is not obvious how to implement them. Their algorithms have a reputation for difficulty, but you usually only need to understand the “unique immutable position” abstraction, which is simple. You can even use list CRDT positions outside of a traditional CRDT framework, e.g., using my list-positions library.
Collabs: CValueList, Position.
Refs: Many - see “Background and Related Work” in the Fugue paper
# Last Writer Wins (LWW)
If multiple users set a value concurrently, and there is no better way to resolve this conflict, just pick the “last” value as the winner. This is the Last Writer Wins (LWW) rule.
Example: Two users concurrently change the color of a pixel in a shared whiteboard. Use LWW to pick the final color.
Traditionally, “last” meant “the last value to reach the central database”. In a CRDT setting, instead, when a user performs an operation, their own device assigns a timestamp for that operation. The operation with the greatest assigned timestamp wins: its value is the one displayed to the user.
Formally, an LWW register is a CRDT representing a single variable, with sole operation set(value, timestamp). Given an operation history made of these set operations, the current state is the value with the largest timestamp.
Operations A-E with arrows A to B, A to D, B to C, D to C, and D to E. The labels are: none; set("blue", (3, alice)); set("blue", (6, alice)); set("red", (5, bob)); set("green", (7, bob)).
Figure 3. Possible operation history for an LWW register using logical timestamps (the pairs (3, "alice")). The greatest assigned timestamp is (7, bob), so the current state is "green".
The timestamp should usually be a logical timestamp instead of literal wall-clock time (e.g., a Lamport timestamp. Otherwise, clock skew can cause a confusing situation: you try to overwrite the current local value with a new one, but your clock is behind, so the current value remains the winner. Lamport timestamps also build in a tiebreaker so that the winner is never ambiguous.
    Let’s make these semantics concrete by converting them to a hybrid op-based/state-based CRDT. Specifically, we’ll do an LWW register with value type T.
        Per-user state: state = { value: T, time: LogicalTimestamp }.
        Operation set(newValue): Broadcast an op-based CRDT message { newValue, newTime }, where newTime is the current logical time. Upon receiving this message, each user (including the initiator) does:
            If newTime > state.time, set state = { value: newValue, time: newTime }.
        State-based merge: To merge in another user’s state other = { value, time }, treat it like an op-based message: if other.time > state.time, set state = other.
    You can check that state.value always comes from the received operation with the greatest assigned timestamp, matching our semantics above.
When using LWW, pay attention to the granularity of writes. For example, in a slide editor, suppose one user moves an image while another user resizes it concurrently. If you implement both actions as writes to a single LWWRegister<{ x, y, width, height }>, then one action will overwrite the other - probably not what you want. Instead, use two different LWW registers, one for { x, y } and one for { width, height }, so that both actions can take effect.
Collabs: lamportTimestamp
Refs: Johnson and Thomas 1976; Shapiro et al. 2011a
# LWW Map
An LWW map applies the last-writer-wins rule to each value in a map. Formally, its operations are set(key, value, timestamp) and delete(key, timestamp). The current state is given by:
    For each key, find the operation on key with the largest timestamp.
    If the operation is set(key, value, timestamp), then the value at key is value.
    Otherwise (i.e., the operation is delete(key, timestamp) or there are no operations on key), key is not present in the map.
Observe that a delete operation behaves just like a set operation with a special value. In particular, when implementing the LWW map, it is not safe to forget about deleted keys: you have to remember their latest timestamps as usual, for future LWW comparisons. Otherwise, your semantics might be ill-defined (not a pure function of the operation history), as pointed out by Kleppmann (2022).
In the next section, we’ll see an alternative semantics that does let you forget about deleted keys: the multi-value map.
# Multi-Value Register
This is another “real” CRDT technique, and our first technique that explicitly references the arrows in an operation history (formally, the causal order).
When multiple users set a value concurrently, sometimes you want to preserve all of the conflicting values, instead of just applying LWW.
Example: One user enters a complex, powerful formula in a spreadsheet cell. Concurrently, another user figures out the intended value by hand and enters that. The first user will be annoyed if the second user’s write erases their hard work.
The multi-value register does this, by following the rule: its current value is the set of all values that have not yet been overwritten. Specifically, it has a single operation set(x). Its current state is the set of all values whose operations are at the heads of the operation history (formally, the maximal operations in the causal order). For example, here the current state is { "gray", "blue" }:
Operations A-F with arrows A to B, B to C, B to F, C to D, E to B. The labels are: set("green"); set("red"); set("green"); set("gray"); set("purple"); set("blue").
Multi-values (also called conflicts) are hard to display, so you should have a single value that you show by default. This displayed value can be chosen arbitrarily (e.g. LWW), or by some semantic rule. For example:
    In a bug tracking system, if a bug has multiple conflicting priorities, display the highest one (Zawirski et al. 2016).
    For a boolean value, if any of the multi-values are true, display true. This yields the enable-wins flag CRDT. Alternatively, if any of the multi-values are false, display false (a disable-wins flag).
Other multi-values can be shown on demand, like in Pixelpusher, or just hidden.
As with LWW, pay attention to the granularity of writes.
    The multi-value register sounds hard to implement because it references the causal order. But actually, if your app enforces causal-order delivery, then you can easily implement a multi-value register on top of a unique set.
        Per-user state: A unique set uSet of pairs (id, x). The multi-values are all of the x’s.
        Operation set(x): Locally, loop over uSet calling uSet.delete(id) on every existing element. Then call uSet.add(x).
    Convince yourself that this gives the same semantics as above.
Collabs: CVar
Refs: Shapiro et al. 2011a; Zawirski et al. 2016
# Multi-Value Map
Like the LWW map, a multi-value map applies the multi-value register semantics to each value in a map. Formally, its operations are set(key, value) and delete(key). The current state is given by:
    For each key, consider the operation history restricted to set(key, value) and delete(key) operations.
    Among those operations, restrict to the heads of the operation history. Equivalently, these are the operations that have not been overwritten by another key operation. Formally, they are the maximal operations in the causal order.
    The value at key is the set of all values appearing among the heads’ set operations. If this set is empty, then key is not present in the map.
Operations A-G with arrows A to B, C to D, C to F, E to D, E to F, F to G. The labels are: set("display", "block"); delete("display"); set("margin", "0"); set("margin", "10px"); set("margin", "20px"); set("height", "auto"); delete("margin").
Figure 4. Multi-value map operations on a CSS class. Obviously key "height" maps to the single value "auto", while key "display" is not present in the map. For key "margin", observe that when restricting to its operations, only set("margin", "10px") and delete("margin") are heads of the operation history (i.e., not overwritten); thus "margin" maps to the single value "10px".
As with the multi-value register, each present key can have a displayed value that you show by default. For example, you could apply LWW to the multi-values. That gives a semantics similar to the LWW map, but when you implement it as an op-based CRDT, you can forget about deleted values. (Hint: Implement the multi-value map on top of a unique set like above.)
Collabs: CValueMap
Refs: Kleppmann 2022
# Composition Techniques
We next move on to composition techniques. These create new CRDTs from existing ones.
Composition has several benefits over making a CRDT from scratch:
    Semantically, you are guaranteed that the composed output is actually a CRDT: its state is always a pure function of the operation history (i.e., users who have received the same set of ops are in equivalent states).
    Algorithmically, you get op-based and state-based CRDT algorithms “for free” from the components. Those components are probably already optimized and tested.
    It is much easier to add a new system feature (e.g., undo/redo) to a few basic CRDTs and composition techniques, than to add it to your app’s top-level state directly.
In particular, it is safe to use a composed algorithm that appears to work well in the situations you care about (e.g., all pairs of concurrent operations), even if you are not sure what it will do in arbitrarily complex scenarios. You are guaranteed that it will at least satisfy strong convergence and have equivalent op-based vs state-based behaviors.
Like most of our basic techniques, these composition techniques are not really CRDT-specific, and you can easily use them outside of a traditional CRDT framework. Figma’s collaboration system is a good example of this.
# Views
Not all app states have a good CRDT representation. But often you can store some underlying state as a CRDT, then compute your app’s state as a view (pure function) of that CRDT state.
Example: Suppose a collaborative text editor represents its state as a linked list of characters. Storing the linked list directly as a CRDT would cause trouble: concurrent operations can easily cause broken links, partitions, and cycles. Instead, store a traditional list CRDT, then construct the linked list representation as a view of that at runtime.
At runtime, one way to obtain the view is to apply a pure function to your CRDT state each time that CRDT state changes, or each time the view is requested. This should sound familiar to web developers (React, Elm, …).
Another way is to “maintain” the view, updating it incrementally each time the CRDT state changes. CRDT libraries usually emit “events” that make this possible. View maintenance is a known hard problem, but it is not hard in a CRDT-specific way. Also, it is easy to unit test: you can always compare to the pure-function approach.
Collabs: Events
# Objects
It is natural to wrap a CRDT in an app-specific API: when the user performs an operation in the app, call a corresponding CRDT operation; in the GUI, render an app-specific view of the CRDT’s state.
More generally, you can create a new CRDT by wrapping multiple CRDTs in a single API. I call this object composition. The individual CRDTs (the components) are just used side-by-side; they don’t affect each others’ operations or states.
Example: An ingredient like we saw in Part 1 (reproduced below) can be modeled as the object composition of three CRDTs: a text CRDT for the text, an LWW register for the amount, and another LWW register for the units.
An ingredient with contents Olive Oil, 15, mL.
To distinguish the component CRDTs’ operations, assign each component a distinct name. Then tag each component’s operations with its name:
Operations on an ingredient, labeled by component name. "text: insert(...)", "amount: set(15, (5, alice))", "units: set('mL', (6, alice))", "units: set('g', (3, bob))".
One way to think of the composed CRDT is as a literal CRDT object - a class whose instance fields are the component CRDTs:
class IngredientCRDT extends CRDTObject {
    text: TextCRDT;
    amount: LWWRegister<number>;
    units: LWWRegister<Unit>;
    
    setAmount(newAmount: number) {
        this.amount.set(newAmount);
    }
    
    ...
}
Another way to think of the composed state is as a JSON object mapping names to component states:
{
    text: {<text CRDT state...>},
    amount: { value: number, time: LogicalTimestamp },
    units: { value: Unit, time: LogicalTimestamp }
}
Collabs: CObject
Refs: See Map-Like Object refs below
# Nested Objects
You can nest objects arbitrarily. This leads to layered object-oriented architectures:
class SlideImageCRDT extends CRDTObject {
    dimensions: DimensionCRDT;
    contents: ImageContentsCRDT;
}
class DimensionCRDT extends CRDTObject {
    position: LWWRegister<{ x: number, y: number }>;
    size: LWWRegister<{ width: number, height: number }>;
}
class ImageContentsCRDT ...
or to JSON-like trees:
{
    dimensions: {
        height: { value: number, time: LogicalTimestamp },
        width: { value: number, time: LogicalTimestamp }
    },
    contents: {
        ...
    }
}
Either way, tag each operation with the tree-path leading to its leaf CRDT. For example, to set the width to 75 pixels: { path: "dimensions/width", op: "set('75px', (11, alice))" }.
# Map-Like Object
Instead of a fixed number of component CRDTs with fixed names, you can allow names drawn from some large set (possibly infinite). This gives you a form of CRDT-valued map, which I will call a map-like object. Each map key functions as the name for its own value CRDT.
Example: A geography app lets users add a description to any address on earth. You can model this as a map from address to text CRDT. The map behaves the same as an object that has a text CRDT instance field per address.
The difference from a CRDT object is that in a map-like object, you don’t store every value CRDT explicitly. Instead, each value CRDT exists implicitly, in some default state, until used. In the JSON representation, this leads to behavior like Firebase RTDB, where
{ foo: {/* Empty text CRDT */}, bar: {<text CRDT state...>} }
is indistinguishable from
{ bar: {<text CRDT state...>} }
Note that unlike an ordinary map, a map-like object does not have operations to set/delete a key; each key implicitly always exists, with a pre-set value CRDT. We’ll see a more traditional map with set/delete operations later.
    The map-like object and similar CRDTs are often referred to as “map CRDTs” or “CRDT-valued maps” (I’ve done so myself). To avoid confusion, in this blog series, I will reserve those terms for maps with set/delete operations.
Collabs: CLazyMap
Refs: Riak Map; Conway et al. 2012; Kuper and Newton 2013
# Unique Set of CRDTs
Another composition technique uses UIDs as the names of value CRDTs. This gives the unique set of CRDTs.
Its operations are:
    add(initialState): Adds an operation add(id, initialState) to the history, where id is a new UID. This creates a new value CRDT with the given initial state.
    Value CRDT operations: Any user can perform operations on any value CRDT, tagged with the value’s UID. The UID has the same function as object composition’s component names.
    delete(id), where id is the UID of the value CRDT to be deleted.
Given an operation history, the unique set of CRDT’s current state consists of all added value CRDTs, minus the deleted ones, in their own current states (according to the value CRDT operations). Formally:
for each add(id, initialState) operation:
    if there are no delete(id) operations:
        valueOps = all value CRDT operations tagged with id
        currentState = result of value CRDT's semantics applied to valueOps and initialState
        Add (id, currentState) to the set's current state
Example: In a collaborative flash card app, you could represent the deck of cards as a unique set of “flash card CRDTs”. Each flash card CRDT is an object containing text CRDTs for the front and back text. Users can edit the deck by adding a new card (with initial text), deleting an existing card, or editing a card’s front/back text. This extends our earlier flash card example.
Observe that once a value CRDT is deleted, it is deleted permanently. Even if another user operates on the value CRDT concurrently, it remains deleted. That allows an implementation to reclaim memory after receiving a delete op - it only needs to store the states of currently-present values. But it is not always the best semantics, so we’ll discuss alternatives below.
Like the unique set of (immutable) values, you can think of the unique set of CRDTs as an obvious way of working with UIDs in a JSON tree. Indeed, Firebase RTDB’s push method works just like add.
// JSON representation of the flash card example:
{
    "uid838x": {
        front: {<text CRDT state...>},
        back: {<text CRDT state...>}
    },
    "uid7b9J": {
        front: {<text CRDT state...>},
        back: {<text CRDT state...>}
    },
    ...
}
The unique set of CRDTs also matches the semantics you would get from normalized database tables: UIDs in one table; value CRDT operations in another table with the UID as a foreign key. A delete op corresponds to a foreign key cascade-delete.
    Firebase RTDB differs from the unique set of CRDTs in that its delete operations are not permanent - concurrent operations on a deleted value are not ignored, although the rest of the value remains deleted (leaving an awkward partial object). You can work around this behavior by tracking the set of not-yet-deleted UIDs separately from the actual values. When displaying the state, loop over the not-yet-deleted UIDs and display the corresponding values (only). Firebase already recommends this for performance reasons.
Collabs: CSet
Refs: Yjs’s Y.Array
# List of CRDTs
By modifying the unique set of CRDTs to use list CRDT positions instead of UIDs, we get a list of CRDTs. Its value CRDTs are ordered.
Example: You can model the list of ingredients from Part 1 as a list of CRDTs, where each value CRDT is an ingredient object from above. Note that operations on a specific ingredient are tagged with its position (a kind of UID) instead of its index, as we anticipated in Part 1.
Collabs: CList
Refs: Yjs’s Y.Array
# Composed Examples
We now turn to semantic techniques that can be described compositionally.
In principle, if your app needed one of these behaviors, you could figure it out yourself: think about the behavior you want, then make it using the above techniques. In practice, it’s good to see examples.
# Add-Wins Set
The add-wins set represents a set of (non-unique) values. Its operations are add(x) and remove(x), where x is an immutable value of type T. Informally, its semantics are:
    Sequential add(x) and remove(x) operations behave in the usual way for a set (e.g. Java’s HashSet).
    If there are concurrent operations add(x) and remove(x), then the add “wins”: x is in the set.
Example: A drawing app includes a palette of custom colors, which users can add or remove. You can model this as an add-wins set of colors.
The informal semantics do not actually cover all cases. Here is a formal description using composition:
    For each possible value x, store a multi-value register indicating whether x is currently in the set. Do so using a map-like object whose keys are the values x. In pseudocode: MapLikeObject<T, MultiValueRegister<boolean>>.
    add(x) translates to the operation “set x’s multi-value register to true”.
    remove(x) translates to the operation “set x’s multi-value register to false”.
    If any of x’s multi-values are true, then x is in the set (enable-wins flag semantics). This is how we get the “add-wins” rule.
Operations A-F with arrows A to B, A to D, B to C, B to E, D to E, E to F. The labels are: "add('red') -> red: set(true)"; "add('blue') -> blue: set(true)"; "remove('blue') -> blue: set(false)"; "add('blue') -> blue: set(true)"; "remove('red') -> red: set(false)"; "add('gray') -> gray: set(true)".
Figure 5. Operation history for a color palette's add-wins set of colors, showing (original op) -> (translated op). The current state is { "blue", "gray" }: the bottom add("blue") op wins over the concurrent remove("blue") op.
There is a second way to describe the add-wins set’s semantics using composition, though you must assume causal-order delivery:
    The state is a unique set of entries (id, x). The current state is the set of values x appearing in at least one entry.
    add(x) translates to a unique-set add(x) operation.
    remove(x) translates to: locally, loop over the entries (id, x); for each one, issue delete(id) on the unique set.
    The name observed-remove set - a synonym for add-wins set - reflects how this remove(x) operation works: it deletes all entries (id, x) that the local user has “observed”.
Collabs: CValueSet
Refs: Shapiro et al. 2011a; Leijnse, Almeida, and Baquero 2019
# List-with-Move
The lists above fix each element’s position when it is inserted. This is fine for text editing, but for other collaborative lists, you often want to move elements around. Moving an element shouldn’t interfere with concurrent operations on that element.
Example: In a collaborative recipe editor, users should be able to rearrange the order of ingredients using drag-and-drop. If one user edits an ingredient’s text while someone else moves it concurrently, those edits should show up on the moved ingredient, like the typo fix “Bredd” -> “Bread” here:
An ingredients list starts with "Bredd" and "Peanut butter". One user swaps the order of ingredients. Concurrently, another user corrects the typo "Bredd" to "Bread". In the final state, the ingredients list is "Peanut butter", "Bread".
In the example, intuitively, each ingredient has its own identity. That identity is independent of the ingredient’s current position; instead, position is a mutable property of the ingredient.
Here is a general way to achieve those semantics, the list-with-move:
    Assign each list element a UID, independently of its position. (E.g., store the elements in a unique set of CRDTs, not a list of CRDTs.)
    To each element, add a position property, containing its current position in the list.
    Move an element by setting its position to a new list CRDT position at the intended place. In case of concurrent move operations, apply LWW to their positions.
Sample pseudocode:
class IngredientCRDT extends CRDTObject {
    position: LWWRegister<Position>; // List CRDT position
    text: TextCRDT;
    ...
}
class IngredientListCRDT {
    ingredients: UniqueSetOfCRDTs<IngredientCRDT>;
    
    move(ingr: IngredientCRDT, newIndex: number) {
        const newPos = /* new list CRDT position at newIndex */;
        ingr.position.set(newPos);
    }
}
Collabs: CList.move
Refs: Kleppmann 2020
# Internally-Mutable Register
The registers above (LWW register, multi-value register) each represent a single immutable value. But sometimes, you want a value that is internally mutable, but can still be blind-set like a register - overriding concurrent mutations.
Example: A bulletin board has an “Employee of the Month” section that shows the employee’s name, photo, and a text box. Coworkers can edit the text box to give congratulations; it uses a text CRDT to allow simultaneous edits. Managers can change the current employee of the month, overwriting all three fields. If a manager changes the current employee while a coworker concurrently congratulates the previous employee, the latter’s edits should be ignored.
An internally-mutable register supports both set operations and internal mutations. Its state consists of:
    uSet, a unique set of CRDTs, used to create value CRDTs.
    reg, a separate LWW or multi-value register whose value is a UID from uSet.
The register’s visible state is the value CRDT indicated by reg. You internally mutate the value by performing operations on that value CRDT. To blind-set the value to initialState (overriding concurrent mutations), create a new value CRDT using uSet.add(initialState), then set reg to its UID.
Creating a new value CRDT is how we ensure that concurrent mutations are ignored: they apply to the old value CRDT, which is no longer shown. The old value CRDT can even be deleted from uSet to save memory.
The CRDT-valued map (next) is the same idea applied to each value in a map.
Collabs: CVar with CollabID values
Refs: true blind updates in Braun, Bieniusa, and Elberzhager (2021)
# CRDT-Valued Map
The map-like object above does not have operations to set/delete a key - it is more like an object than a hash map.
Here is a CRDT-valued map that behaves more like a hash map. Its state consists of:
    uSet, a unique set of CRDTs, used to create the value CRDTs.
    lwwMap, a separate last-writer-wins map whose values are UIDs from uSet.
The map’s visible state is: key maps to the value CRDT with UID lwwMap[key]. (If key is not present in lwwMap, then it is also not present in the CRDT-valued map.)
Operations:
    set(key, initialState) translates to { uSet.add(initialState); lwwMap.set(key, (UID from previous op)); }. That is, the local user creates a new value CRDT, then sets key to its UID.
    delete(key) translates to lwwMap.delete(key). Typically, implementations also call uSet.delete on all existing value CRDTs for key, since they are no longer reachable.
Note that if two users concurrently set a key, then one of their set ops will “win”, and the map will only show that user’s value CRDT. (The other value CRDT still exists in uSet.) This can be confusing if the two users meant to perform operations on “the same” value CRDT, merging their edits.
Example: A geography app lets users add a photo and description to any address on earth. Suppose you model the app as a CRDT-valued map from each address to a CRDT object { photo: LWWRegister<Image>, desc: TextCRDT }. If one user adds a photo to an unused address (necessarily calling map.set first), while another user adds a description concurrently (also calling map.set), then one CRDT object will overwrite the other:
The address 5000 Forbes Ave starts with a blank description and photo. One user adds the description "Looks like a school?". Concurrently, another user adds a photo of a building. In the final state, the description is "Looks like a school?" but the photo is blank again.
To avoid this, consider using a map-like object, like the previous geography app example.
    More composed constructions that are similar to the CRDT-valued map:
        Same, except you don’t delete value CRDTs from uSet. Instead, they are kept around in an archive. You can “restore” a value CRDT by calling set(key, id) again later, possibly under a different key.
        A unique set of CRDTs where each value CRDT has a mutable key property, controlled by LWW. That way, you can change a value CRDT’s key - e.g., renaming a document. Note that your display must handle the case where multiple value CRDTs have the same key.
Collabs: CMap, CollabID
Refs: Yjs’s Y.Map; Automerge’s Map
# Archiving Collections
The CRDT-valued collections above (unique set, list, map) all have a delete operation that permanently deletes a value CRDT. It is good to have this option for performance reasons, but you often instead want an archive operation, which merely hides an element until it’s restored. (You can recover most of delete’s performance benefits by swapping archived values to disk/cloud.)
Example: In a notes app with cross-device sync, the user should be able to view and restore deleted notes. That way, they cannot lose a note by accident.
To implement archive/restore, add an isPresent field to each value. Values start with isPresent = true. The operation archive(id) sets it to false, and restore(id) sets it to true. In case of concurrent archive/restore operations, you can apply LWW, or use a multi-value register’s displayed value.
Alternate implementation: Use a separate add-wins set to indicate which values are currently present.
Collabs: CList.archive/restore
# Update-Wins Collections
If one user archives a value that another user is still using, you might choose to “auto-restore” that value.
Example: In a spreadsheet, one user deletes a column while another user edits some of that column’s cells concurrently. The second user probably wants to keep that column, and it’s easier if the column restores automatically (Yanakieva, Bird, and Bieniusa 2023).
To implement this on top of an archiving collection, merely call restore(id) each time the local user edits id’s CRDT. So each local operation translates to two ops in the history: the original (value CRDT) operation and restore(id).
To make sure that these “keep” operations win over concurrent archive operations, use an enable-wins flag to control the isPresent field. (I.e., a multi-value register whose displayed value is true if any of the multi-values are true.) Or, use the last section’s alternate implementation: an add-wins set of present values.
Collabs: supported by CList.restore
Refs: Yanakieva, Bird, and Bieniusa 2023
# Spreadsheet Grid
In a spreadsheet, users can insert, delete, and potentially move rows and columns. That is, the collection of rows behaves like a list, as does the collection of columns. Thus the cell grid is the 2D analog of a list.
It’s tempting to model the grid as a list-of-lists. However, that has the wrong semantics in some scenarios. In particular, if one user creates a new row, while another user creates a new column concurrently, then there won’t be a cell at the intersection.
Instead, you should think of the state as:
    A list of rows.
    A list of columns.
    For each pair (row, column), a single cell, uniquely identified by the pair (row id, column id). row id and column id are unique IDs.
The cells are not explicitly created; instead, the state implicitly contains such a cell as soon as its row and column exist. Of course, until a cell is actually used, it remains in a default state (blank) and doesn’t need to be stored in memory. Once a user’s app learns that the row or column was deleted, it can forget the cell’s state, without an explicit “delete cell” operation - like a foreign key cascade-delete.
In terms of the composition techniques above (objects, list-with-move, map-like object):
class CellCRDT extends CRDTObject {
    formula: LWWRegister<string>;
    ...
}
rows: ListWithMoveCRDT<Row>;
columns: ListWithMoveCRDT<Column>;
cells: MapLikeObject<(rowID: UID, columnID: UID), CellCRDT>;
// Note: if you use this compositional construction in an implementation,
// you must do extra work to forget deleted cells' states.
# Advanced Techniques
The techniques in this section are more advanced. This really means that they come from more recent papers and I am less comfortable with them; they also have fewer existing implementations, if any.
Except for undo/redo, you can think of these as additional composed examples. However, they have more complex views than the previous composed examples.
# Formatting Marks (Rich Text)
Rich text consists of plain text plus inline formatting: bold, font size, hyperlinks, etc. (This section does not consider block formatting like blockquotes or bulleted lists, discussed later.)
Inline formatting traditionally applies not to individual characters, but to spans of text: all characters from index i to j. E.g., atJSON (not a CRDT) uses the following to represent a bold span that affects characters 5 to 11:
{
    type: "-offset-bold",
    start: 5,
    end: 11,
    attributes: {}
}
Future characters inserted in the middle of the span should get the same format. Likewise for characters inserted at the end of the span, for certain formats. You can override (part of) the span by applying a new formatting span on top.
The inline formatting CRDT lets you use formatting spans in a CRDT setting. (I’m using this name for a specific part of the Peritext CRDT (Litt et al. 2021).) It consists of:
    An append-only log of CRDT-ified formatting spans, called marks.
    A view of the mark log that tells you the current formatting at each character.
Each mark has the following form:
{
    key: string;
    value: any;
    timestamp: LogicalTimestamp;
    start: { pos: Position, type: "before" | "after" }; // anchor
    end: { pos: Position, type: "before" | "after" }; // anchor
}
Here timestamp is a logical timestamp for LWW, while each Position is a list CRDT position. This mark sets key to value (e.g. "bold": true) for all characters between start and end. The endpoints are anchors that exist just before or just after their pos:
"Some text" with before and after anchors on each character. The middle text "me te" is bold due to a mark labeled 'Bold mark from { pos: (m's pos), type: "before" } to { pos: (x's pos), type: "before" }'.
LWW takes effect when multiple marks affect the same character and have the same key: the one with the largest timestamp wins. In particular, new marks override (causally) older ones. Note that a new mark might override only part of an older mark’s range.
Formally, the view of the mark log is given by: for each character c, for each format key key, find the mark with the largest timestamp satisfying
    mark.key = key, and
    the interval (mark.start, mark.end) contains c’s position.
Then c’s format value at key is mark.value.
Remarks:
    To unformat, apply a formatting mark with a null value, e.g., { key: "bold", value: null, ... }. This competes against other “bold” marks in LWW.
    A formatting mark affects not just (causally) future characters in its range, but also characters inserted concurrently: Text starts as "The cat jumped on table.", unbold. One user highlights the entire range and bolds it. Concurrently, another user inserts " the" after "on". The final state is "The cat jumped on the table.", all bold.
    Anchors let you choose whether a mark “expands” to affect future and concurrent characters at the beginning or end of its range. For example, the bold mark pictured above expands at the end: a character typed between e and x will still be within the mark’s range because the mark’s end is attached to x.
    The view of the mark log is difficult to compute and store efficiently. Part 3 will describe an optimized view that can be maintained incrementally and doesn’t store metadata on every character.
    Sometimes a new character should be bold (etc.) according to local rules, but existing formatting marks don’t make it bold. E.g., a character inserted at the beginning of a paragraph in MSWord inherits the following character’s formatting, but the inline formatting CRDT doesn’t do that automatically.
    To handle this, when a user types a new character, compute its formatting according to local rules. (Most rich-text editor libraries already do so.) If the inline formatting CRDT currently assigns different formatting to that character, fix it by adding new marks to the log.
    Fancy extension to (5): Usually the local rules are “extending” a formatting mark in some direction - e.g., backwards from the paragraph’s previous starting character. You can figure out which mark is being extended, then reuse its timestamp instead of making a new one. That way, LWW behaves identically for your new mark vs the one it’s extending.
Collabs: CRichText
Refs: Litt et al. 2021 (Peritext)
# Spreadsheet Formatting
You can also apply inline formatting to non-text lists. For example, Google Sheets lets you bold a range of rows, with similar behavior to a range of bold text: new rows at the middle or end of the range are also bold. A cell in a bold row renders as bold, unless you override the formatting for that cell.
In more detail, here’s an idea for spreadsheet formatting:
Use two inline formatting CRDTs, one for the rows and one for the columns. Also, for each cell, store an LWW map of format key-value pairs; mutate the map when the user formats that individual cell. To compute the current bold format for a cell, consider:
    The current (largest timestamp) bold mark for the cell’s row.
    The current bold mark for its column.
    The value at key “bold” in the cell’s own LWW map.
Then render the mark/value with the largest timestamp out of these three.
This idea lets you format rows, columns, and cells separately. Sequential formatting ops interact in the expected way: for example, if a user bolds a row and then unbolds a column, the intersecting cell is not bold, since the column op has a larger timestamp.
# Global Modifiers
Often you want an operation to do something “for each” element of a collection, including elements added concurrently.
Example: An inline formatting mark affects each character in its range, including characters inserted concurrently (see above).
Example: Suppose a recipe editor has a “Halve the recipe” button, which halves every ingredient’s amount. This should have the semantics: for each ingredient amount, including amounts set concurrently, halve it. If you don’t halve concurrent set ops, the recipe can get out of proportion:
An ingredients list starts with 100 g Flour and 80 g Milk. One user edits the amount of Milk to 90 g. Concurrently, another user halves the recipe (50 g Flour, 40 g Milk). The final state is: 50 g Flour, 90 g Milk.
I’ve talked about these for-each operations before and co-authored a paper formalizing them (Weidner et al. 2023). However, the descriptions so far query the causal order (below), making them difficult to implement.
Instead, I currently recommend implementing these examples using global modifiers. By “global modifier”, I mean a piece of state that affects all elements of a collection/range: causally prior, concurrent, and (causally) future.
The inline formatting marks above have this form: a mark affects each character in its range, regardless of when it was inserted. If a user decides that a future character should not be affected, that user can override the formatting mark with a new one.
To implement the “Halve the recipe” example:
    Store a global scale alongside the recipe. This is a number controlled by LWW, which you can think of as the number of servings.
    Store each ingredient’s amount as a scale-independent number. You can think of this as the amount per serving.
    The app displays the product ingrAmount.value * globalScale.value for each ingredient’s amount.
    To halve the recipe, merely set the global scale to half of its current value: globalScale.set(0.5 * globalScale.value). This halves all displayed amounts, including amounts set concurrently and concurrently-added ingredients.
    When the user sets an amount, locally compute the corresponding scale-independent amount, then set that. E.g. if they change flour from 50 g to 55 g but the global scale is 0.5, instead call ingrAmount.set(110).
    In the recipe editor, you could even make the global scale non-collaborative: each user chooses how many servings to display on their own device. But all collaborative edits affect the same single-serving recipe internally.
Refs: Weidner, Miller, and Meiklejohn 2020; Weidner et al. 2023
# Forests and Trees
Many apps include a tree or forest structure. (A forest is a collection of disconnected trees.) Typical operations are creating a new node, deleting a node, and moving a node (changing its parent).
Examples: A file system is a tree whose leaves are files and inner nodes are folders. A Figma document is a tree of rendered objects.
The CRDT way to represent a tree or forest is: Each node has a parent node, set via LWW. The parent can either be another node, a special “root” node (in a tree), or “none” (in a forest). You compute the tree/forest as a view of these child->parent relationships (edges) in the obvious way.
When a user deletes a node - implicitly deleting its whole subtree - don’t actually loop over the subtree deleting nodes. That would have weird results if another user concurrently moved some nodes into or out of the subtree. Instead, only delete the top node (or archive it - e.g., set its parent to a special “trash” node). It’s a good idea to let users view the deleted subtree and move nodes out of it.
Everything I’ve said so far is just an application of basic techniques. Cycles are what make forests and trees advanced: it’s possible that one user sets B.parent = A, while concurrently, another user sets A.parent = B. Then it’s unclear what the computed view should be.
A tree starts with root C and children A, B. One user moves A under B (sets A.parent = B). Concurrently, another user moves B under A. The final state has C, and A-B cycle, and "??".
Figure 6. Concurrent tree-move operations - each valid on their own - may create a cycle. When this happens, what state should the app display, given that cycles are not allowed in a forest/tree?
Some ideas for how to handle cycles:
    Error. Some desktop file sync apps do this in practice (Kleppmann et al. (2022) give an example).
    Render the cycle nodes (and their descendants) in a special “time-out” zone. They will stay there until some user manually fixes the cycle.
    Use a server to process move ops. When the server receives an op, if it would create a cycle in the server’s own state, the server rejects it and tells users to do likewise. This is what Figma does. Users can still process move ops optimistically, but they are tentative until confirmed by the server. (Optimistic updates can cause temporary cycles for users; in that case, Figma uses strategy (2): it hides the cycle nodes.)
    Similar, but use a topological sort (below) instead of a server’s receipt order. When processing ops in the sort order, if an op would create a cycle, skip it (Kleppmann et al. 2022).
    For forests: Within each cycle, let B.parent = A be the edge whose set operation has the largest LWW timestamp. At render time, “hide” that edge, instead rendering B.parent = "none", but don’t change the actual CRDT state. This hides one of the concurrent edges that created the cycle.
        To prevent future surprises, users’ apps should follow the rule: before performing any operation that would create or destroy a cycle involving a hidden edge, first “affirm” that hidden edge, by performing an op that sets B.parent = "none".
    For trees: Similar, except instead of rendering B.parent = "none", render the previous parent for B - as if the bad operation never happened. More generally, you might have to backtrack several operations. Both Hall et al. (2018) and Nair et al. (2022) describe strategies along these lines.
Refs: Graphs in Shapiro et al. 2011a; Martin, Ahmed-Nacer, and Urso 2011; Hall et al. (2018); Nair et al. 2022; Kleppmann et al. 2022; Wallace 2022
# Undo/Redo
In most apps, users should be able to undo and redo their own operations in a stack, using Ctrl+Z / Ctrl+Shift+Z. You might also allow selective undo (undo operations anywhere in the history) or group undo (users can undo each others’ operations) - e.g., for reverting changes.
A simple way to undo an operation is: perform a new operation whose effect locally undoes the target operation. For example, to undo typing a character, perform a new operation that deletes that character.
However, this “local undo” has sub-optimal semantics. For example, suppose one user posts an image, undoes it, then redoes it; concurrently to the undo/redo, another user comments on the image. If you implement the redo as “make a new post with the same contents”, then the comment will be lost: it attaches to the original post, not the re-done one.
Exact undo instead uses the following semantics:
    In addition to normal app operations, there are operations undo(opID) and redo(opID), where opID identifies a normal operation.
    For each opID, consider the history of undo(opID) and redo(opID) operations. Apply some boolean-value CRDT to that operation history to decide whether opID is currently (re)done or undone.
    The current state is the result of applying your app’s semantics to the (re)done operations only.
Top: Operations A-F with arrows A to B, A to D, B to C, B to E, D to E, E to F. The labels are: "op1x7: add(red)"; "op33n: add(blue)"; "undo(op33n)"; "undo(op1x7)"; "op91k: add(green)"; "redo(op1x7)". Bottom: Only operations A and D, with an arrow A to D.
Figure 7. Top: Operation history for an add-wins set with exact undo. Currently, op1x7 is redone, op33n is undone, and op91k is done.
Bottom: We filter the (re)done operations only and pass the filtered operation history to the add-wins set's semantics, yielding state { "red", "green" }.
For the boolean-value CRDT, you can use LWW, or a multi-value register’s displayed value (e.g., redo-wins). Or, you can use the maximum causal length: the first undo operation is undo(opID, 1); redoing it is redo(opID, 2); undoing it again is undo(opID, 3), etc.; and the winner is the op with the largest number. (Equivalently, the head of the longest chain of causally-ordered operations - hence the name.)
    Maximum causal length makes sense as a general boolean-value CRDT, but I’ve only seen it used for undo/redo.
Step 3 is more difficult that it sounds. Your app might assume causal-order delivery, then give weird results when undone operations violate it. (E.g., our multi-value register algorithm above will not match the intended semantics after undos.) Also, most algorithms do not support removing past operations from the history. But see Brattli and Yu (2021) for a multi-value register that is compatible with exact undo.
Refs: Weiss, Uros, and Molli 2010; Yu, Elvinger, and Ignat 2019
# Other Techniques
This section mentions other techniques that I personally find less useful. Some are designed for distributed data stores instead of collaborative apps; some give reasonable semantics but are hard to implement efficiently; and some are plausibly useful but I have not yet found a good example app.
# Remove-Wins Set
The remove-wins set is like the add-wins set, except if there are concurrent operations add(x) and remove(x), then the remove wins: x is not in the set. You can implement this similarly to the add-wins set, using a disable-wins flag instead of enable-wins flag. (Take care that the set starts empty, instead of containing every possible value.) Or, you can implement the remove-wins set using:
    an append-only log of all values that have ever been added, and
    an add-wins set indicating which of those values are currently removed.
In general, any implementation must store all values that have ever been added; this is a practical reason to prefer the add-wins set instead. I also do not know of an example app where I prefer the remove-wins set’s semantics. The exception is apps that already store all values elsewhere, such as an archiving collection: I think a remove-wins set of present values would give reasonable semantics. That is equivalent to using an add-wins set of archived values, or to using a disable-wins flag for each value’s isPresent field.
Refs: Bieniusa et al. 2012; Baquero, Almeida, and Shoker 2017; Baquero et al. 2017
# PN-Set
The PN-Set (Positive-Negative Set) is another alternative to the add-wins set. Its semantics are: for each value x, count the number of add(x) operations in the operation history, minus the number of remove(x) operations; if it’s positive, x is in the set.
This semantics give strange results in the face of concurrent operations, as described by Shapiro et al. (2011a). For example, if two users call add(x) concurrently, then to remove x from the set, you must call remove(x) twice. If two users do that concurrently, it will interact strangely with further add(x) operations, etc.
Like the maximum causal length semantics, the PN-Set was originally proposed for undo/redo.
Refs: Weiss, Urso, and Molli 2010; Shapiro et al. (2011a)
# Observed-Reset Operations
An observed-reset operation cancels out all causally prior operations. That is, when looking at the operation history, you ignore all operations that are causally prior to any reset operation, then compute the state from the remaining operations.
Six +1 operations, and one reset() operation that is causally greater than three of the +1 operations (underlined).
Figure 8. Operation history for a counter with +1 and observed-reset operations. The reset operation cancels out the underlined +1 operations, so the current state is 3.
Observed-reset operations are a tempting way to add a delete(key) operation to a map-like object: make delete(key) be an observed-reset operation on the value CRDT at key. Thus delete(key) restores a value CRDT to its original, unused state, which you can treat as the “key-not-present” state. However, then if one user calls delete(key) while another user operates on the value CRDT concurrently, you’ll end with an awkward partial state:
In a collaborative todo-list with observed-reset deletes, concurrently deleting an item and marking it done results in a nonsense list item with no text field.
Image credit: Figure 6 by Kleppmann and Beresford. That paper describes a theoretical JSON CRDT, but Firebase RTDB has the same behavior.
I instead prefer to omit delete(key) from the map-like object entirely. If you need deletions, instead use a CRDT-valued map or similar. Those ultimately treat delete(key) as a permanent deletion (from the unique-set of CRDTs) or as an archive operation.
Refs: Deletes in Riak Map
# Querying the Causal Order
Most of our techniques so far don’t use the causal order on operations (arrows in the operation history). However, the multi-value register does: it queries the set of causally-maximal operations, displaying their values. Observed-reset operations also query the causal order, and the add-wins set/remove-wins set reference it indirectly.
One can imagine CRDTs that query the causal order in many more ways. However, I find these too complicated for practical use:
    It is expensive to track the causal order on all pairs of operations.
    Semantics that ask “is there an operation concurrent to this one?” generally need to store operations forever, in case a concurrent operation appears later.
    It is easy to create semantic rules that don’t behave well in all scenarios.
(The multi-value register and add-wins set occupy a special case that avoids (1) and (2).)
As an example of (3), it is tempting to define an add-wins set by: an add(x) operation overrules any concurrent remove(x) operation, so that the add wins. But then in Figure 9’s operation history, both remove(x) operations get overruled by concurrent add(x) operations. That makes x present in the set when it shouldn’t be.
Operations A-D with arrows A to B, C to D. The labels are: add(x), remove(x), add(x), remove(x).
Figure 9. Operation history for an add-wins set. One user calls add(x) and then remove(x); concurrently, another user does likewise. The correct current state is the empty set: the causally-maximal operations on x are both remove(x).
As another example, you might try to define an add-wins set by: if there are concurrent add(x) and remove(x) operations, apply the remove(x) “first”, so that add(x) wins; otherwise apply operations in causal order. But then in the above operation history, the intended order of operations contains a cycle:
Operations A-D with "causal order" arrows A to B, C to D, and "remove-first rule" arrows B to C, D to A. The labels are: add(x), remove(x), add(x), remove(x).
    I always try out this operation history when a paper claims to reproduce/understand the add-wins set in a new way.
# Topological Sort
A topological sort is a general way to “derive” CRDT semantics from an ordinary data structure. Given an operation history made out of ordinary data structure operations, the current state is defined by:
    Sort the operations into some consistent linear order that is compatible with the causal order (o < p implies o appears before p). E.g., sort them by Lamport timestamp.
    Apply those operations to the starting state in order, returning the final state. If an operation would be invalid (e.g. it would create a cycle in a tree), skip it.
The problem with these semantics is that you don’t know what result you will get - it depends on the sort order, which is fairly arbitrary.
However, topological sort can be useful as a fallback in complex cases, like tree cycles or group-chat permissions. You can think of it like asking a central server for help: the sort order stands in for “the order in which ops reached the server”. (If you do have a server, you can use that instead.)
Ref: Kleppmann et al 2018
# Capstones
Let’s finish by designing novel semantics for two practical but complex collaborative apps.
# Recipe Editor
I’ve mentioned a collaborative recipe editor in several examples. It’s implemented as a Collabs demo: live demo, talk slides, talk video, source code.
Recipe editor screenshot showing a recipe for roast broccoli.
The app’s semantics can be described compositionally using nested objects. Here is a schematic:
{
    ingredients: UniqueSetOfCRDTs<{
        text: TextCRDT,
        amount: LWWRegister<number>, // Scale-independent amount
        units: LWWRegister<Unit>,
        position: LWWRegister<Position>, // List CRDT position, for list-with-move
        isPresent: EnableWinsFlag // For update-wins semantics
    }>,
    globalScale: LWWRegister<number>, // For scale ops
    description: {
        text: TextCRDT,
        formatting: InlineFormattingCRDT
    }
}
(Links by class name: UniqueSetOfCRDTs, TextCRDT, LWWRegister, EnableWinsFlag, InlineFormattingCRDT.)
Most GUI operations translate directly to operations on this state, but there are some edge cases.
    The ingredient list is a list-with-move: move operations (the arrow buttons) set position.
    Delete operations (the red X’s) use update-wins semantics: delete sets isPresent to false, while each operation on an ingredient (e.g., setting its amount) additionally sets isPresent to true.
    Ingredient amounts, and the “Double the recipe!” / “Halve the recipe!” buttons, treat the scale as a global modifier.
# Block-Based Rich Text
We described inline rich-text formatting above, like bold and italics. Real rich-text editors also support block formatting: headers, lists, blockquotes, etc. Fancy apps like Notion even let you rearrange the order of blocks using drag-and-drop:
Notion screenshot of moving block "Lorem ipsum dolor sit amet" from before to after "consectetur adipiscing elit".
Let’s see if we can design a CRDT semantics that has all of these features: inline formatting, block formatting, and movable blocks. Like the list-with-move, moving a block should not affect concurrent edits within that block. We’d also like nice behavior in tricky cases - e.g., one user moves a block while a concurrent user splits it into two blocks.
This section is experimental; I’ll update it in the future if I learn of improvements (suggestions are welcome).
Refs: Ignat, André, and Oster 2017 (similar Operational Transformation algorithm); Quill line formatting; unpublished notes by Martin Kleppmann (2022); Notion’s data model
# CRDT State
The CRDT state is an object with several components:
    text: A text CRDT. It stores the plain text characters plus two kinds of invisible characters: block markers and hooks. Each block marker indicates the start of a block, while hooks are used to place blocks that have been split or merged.
    format: An inline formatting CRDT on top of text. It controls the inline formatting of the text characters (bold, italic, links, etc.). It has no effect on invisible characters.
    blockList: A separate list CRDT that we will use to order blocks. It does not actually have content; it just serves as a source of list CRDT positions.
    For each block marker (keyed by its list CRDT position), a block CRDT object with the following components:
        blockType: An LWW register whose value is the block’s type. This can be “heading 2”, “blockquote”, “unordered list item”, “ordered list item”, etc.
        indent: An LWW register whose value is the block’s indent level (a nonnegative integer).
        placement: An LWW register that we will explain later. Its value is one of:
            { case: "pos", target: <position in blockList> }
            { case: "origin", target: <a hook's list CRDT position> }
            { case: "parent", target: <a hook's list CRDT position>, prevPlacement: <a "pos" or "origin" placement value> }
# Rendering the App State
Let’s ignore blockCRDT.placement for now. Then rendering the rich-text state resulting from the CRDT state is straightforward:
    Each block marker defines a block.
    The block’s contents are the text immediately following that block marker in text, ending at the next block marker.
    The block is displayed according to its blockType and indent.
        For ordered list items, the leading number (e.g. “3.”) is computed at render time according to how many preceding blocks are ordered list items. Unlike in HTML, the CRDT state does not store an explicit “list start” or “list end”.
    The text inside a block is inline-formatted according to format. Note that formatting marks might cross block boundaries; this is fine.
Top: "text: _Hello_Okay" with underscores labeled "Block marker n7b3", "Block marker ttx7". Bottom left: 'n7b3: { blockType: “ordered list item”, indent: 0 }', 'ttx7: { blockType: “blockquote”, indent: 0 }'. Bottom right: two rendered blocks, "1. Hello" and "(blockquote) Okay".
Figure 10. Sample state and rendered text, omitting blockList, hooks, and blockCRDT.placement.
Now we need to explain blockCRDT.placement. It tells you how to order a block relative to other blocks, and whether it is merged into another block.
    If case is "pos": This block stands on its own. Its order relative to other "pos" blocks is given by target’s position in blockList.
    If case is "origin": This block again stands on its own, but it follows another block (its origin) instead of having its own position. Specifically, let block A be the block containing target (i.e., the last block marker prior to target in text). Render this block immediately after block A.
    If case is "parent": This block has been merged into another block (its parent). Specifically, let block A be the block containing target. Render our text as part of block A, immediately after block A’s own text. (Our blockType and indent are ignored.)
Top: "blockList: [p32x, p789]". Middle: "text: _Hel^lo_Ok^_ay" with underscores labeled "Block marker n7b3", "Block marker ttx7", "Block marker x1bc", and carets labeled "Hook @ pbj8", "Hook @ p6v6". Bottom left: 'n7b3.placement: { case: “pos”, target: p32x }', 'ttx7.placement: { case: “origin”, target: pbj8 }', 'x1bc.placement: { case: “parent”, target: p6v6 }'. Bottom right: two rendered blocks, "1. Hello" and "(blockquote) Okay".
Figure 11. Repeat of Figure 10 showing blockList, hooks, and blockCRDT.placement. Observe that "Okay" is the merger of two blocks, "Ok" and "ay".
You might notice some dangerous edge cases here! We’ll address those shortly.
# Move, Split, and Merge
Now we can implement the three interesting block-level operations:
    Move. To move block B so that it immediately follows block A, first create a new position in blockList that is immediately after A’s position (or its origin’s (origin’s…) position). Then set block B’s placement to { case: "pos", target: <new blockList position> }.
        If there are any blocks with origin B that you don’t want to move along with B, perform additional move ops to keep them where they currently appear.
        If there are any blocks with origin A, perform additional move ops to move them after B, so that they aren’t rendered between A and B.
        Edge case: to move block B to the start, create the position at the beginning of blockList.
    Split. To split a block in two, insert a new hook and block marker at the splitting position (in that order). Set the new block’s placement to { case: "origin", target: <new hook's position> }, and set its blockType and indent as you like.
        Why do we point to a hook instead of the previous block’s block header? Basically, we want to follow the text just prior to the split, which might someday end up in a different block. (Consider the case where the previous block splits again, then one half moves without the other.)
    Merge. To merge block B into the previous block, first find the previous block A in the current rendered state. (This might not be the previous block marker in text.) Insert a new hook at the end of A’s rendered text, then set block B’s placement to { case: "parent", target: <new hook's position>, prevPlacement: <block B's previous placement value> }.
        The “end of A’s rendered text” might be in a block that was merged into A.
# Edge Cases
It remains to address some dangerous edge cases during rendering.
First, it is possible for two blocks B and C to have the same origin block A. So according to the above rules, they should both be rendered immediately after block A, which is impossible. Instead, render them one after the other, in the same order that their hooks appear in the rendered text. (This might differ from the hooks’ order in text.)
More generally, the relationships “block B has origin A” form a forest. (I believe there will never be a cycle, so we don’t need our advanced technique above.) For each tree in the forest, render that tree’s blocks consecutively, in depth-first pre-order traversal order.
Top: "blockList: [p32x, p789]. Bottom: "Forest of origins" with nodes A-E and edges B to A, C to B, D to A. Nodes A and E have lines to p32x and p789, respectively.
Figure 12. A forest of origin relationships. This renders as block order A, B, C, D, E. Note that the tree roots A, E are ordered by their positions in blockList.
Second, it is likewise possible for two blocks B and C to have the same parent block A. In that case, render both blocks’ text as part of block A, again in order by their hooks.
More generally, we would like the relationships “block B has parent A” to form a forest. However, this time cycles are possible!
Staring state: A then B. Top state: AB. Bottom state: (B then A) with an arrow to (BA). Final state: A and B with arrows in a cycle and "??".
Figure 13. One user merges block B into A. Concurrently, another user moves B above A, then merges A into B. Now their parent relationships form a cycle.
To solve this, use any of the ideas from forests and trees to avoid/hide cycles. I recommend a variant of idea 5: within each cycle, “hide” the edge with largest LWW timestamp, instead rendering its prevPlacement. (The prevPlacement always has case "pos" or "origin", so this won’t create any new cycles. Also, I believe you can ignore idea 5’s sentence about “affirming” hidden edges.)
Third, to make sure there is always at least one block, the app’s initial state should be: blockList contains a single position; text contains a single block marker with placement = { case: "pos", target: <the blockList position> }.
    Why does moving a block set its placement.target to a position within blockList, instead of a hook like Split/Merge? This lets us avoid another cycle case: if block B is moved after block A, while concurrently, block A is moved after block B, then blockList gives them a definite order without any fancy cycle-breaking rules.
# Validation
I can’t promise that these semantics will give a reasonable result in every situation. But following the advice in Composition Techniques, we can check all interesting pairs of concurrent operations, then trust that the general case at least satisfies strong convergence (by composition).
The following figure shows what I believe will happen in various concurrent situations. In each case, it matches my preferred behavior. Lines indicate blocks, while A/B/C indicate chunks of text.
Split vs Split: ABC to (A BC / AB C) to A B C. Merge vs Merge: A B C to (A BC / AB C) to ABC. Split vs Merge 1: AB C to (A B C / ABC) to A BC. Split vs Merge 2: A BC to (A B C / ABC) to AB C. Move vs Merge: A B C to (B A C / A BC) to BC A. Move vs Split: A BC to (BC A / A B C) to B C A.
# Conclusion
A CRDT-based app’s semantics describe what the state of the app should be, given a history of collaborative operations. Choosing semantics is ultimately a per-app problem, but the CRDT literature provides many ideas and examples.
This blog post was long because there are indeed many techniques. However, we saw that they are mostly just a few basic ideas (UIDs, list CRDT positions, LWW, multi-value register) plus composed examples.
It remains to describe algorithms implementing these semantics. Although we gave some basic CRDT algorithms in-line, there are additional nuances and optimizations. Those will be the subject of the next post, Part 3: Algorithmic Techniques.
    This blog post is Part 2 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
Home • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub

File addition: 20230930_github_changesets.md (----------)

[0.1]

# Document Title
Reorient GitHub Pull Requests Around Changesets
I've had the experience of using GitHub as a maintainer for very large open source projects (1000+ contributors), as an engineer for very large closed source corporate projects, and everything smaller. Through those experiences up to today, GitHub pull requests is where I spend almost all of my time while on GitHub, and to me its also unfortunately the most frustrating part of GitHub.
There are a lot of improvements I would love to see with pull requests, but a massive chunk of my problems would be solved through one major feature: changesets. This blog post describes this suggestion and what I would love to see.
Disclaimer: My ideas here are not original! I do not claim to have come up with these ideas. My suggestions here are based on well-explored Git workflows and also are partially or in full implemented by other products such as Gerrit, Phabricator, or plain ol' email-based patch review.
The Problem Today
The lifecycle of a GitHub pull request today is effectively one giant mutable changeset. This is a mess!
Here is a typical PR today: A contributor pushes a set of commits to a branch, opens a PR, and the PR now represents that branch. People discuss the PR through comments. When the contributor pushes new changes, they show up directly on the same PR, updating it immediately. Reviewers can leave comments and the contributor can push changes at the same time, and it all updates the same PR.
This has many problems:
    A reviewer can leave a review for a previous state the PR was in and it can become immediately outdated because while the review was happening the contributor pushed changes.
    Worse, a review can become partially outdated and the other feedback may not make sense in the context of the changes a contributor pushed. For example, a line comment may say "same feedback as the previous comment" but the previous comment is now gone/hidden because the contributor pushed a change that moved those lines.1
    Reviews don't contain any metadata about the commit they were attached to, only a timestamp they were submitted. A user can roughly correlate timestamp to commit but it isn't totally accurate because if a commit comes in during a review, the timestamp will seem to imply it was after the most recent commit but your review may have been against the prior commit. 😕
    Work-in-progress commits towards addressing review feedback become visible as soon as the branch is pushed. This forces contributors to address all feedback in a single commit, or for reviewers to deal with partially-addressed feedback.
    You can't easily scrub to prior states of the PR. If you want to review a set of earlier commits while ignoring later commits, you either have to manually build a "compare" view or use a local checkout (I do the latter). But through either approach you only get the code changes, you don't also get the point-in-time reviewer feedback!
    Similar to the above, if a contributor pushes multiple new commits, you can't easily compare the new set of commits to the old. You can only really scrub one commit at a time. For this, you again have to fallback to local git to build up a diff manually.
    And more... I talk about some more later, but I think I've made my point.
I'm sure I'm wrong about some detail about some of the points above. Someone is likely to say "he could've just done this to solve problem 5(a)". That's helpful! But, the point I'm trying to make is that if you step back the fundamentals causing these problems are the real issue. Namely, a single mutable changeset tracking a branch on a per-commit basis.
Changesets
The solution is changesets: A pull request is versionable through a monotonic number (v1, v2, ...). These versions are often called "changesets."
Each changeset points to the state of a branch at a fixed time. These versions are immutable: when new commits are pushed, they become part of a new changeset. If the contributor force pushes the branch, that also becomes part of a new changeset. The previous changeset is saved forever.
A new changeset can be published immediately (per commit) or it can be deferred until the contributor decides to propose a new version for review. The latter allows a contributor to make multiple commits to address prior feedback and only publish those changes when they feel ready.
In the world of changesets, feedback is attached to a changeset. If a reviewer begins reviewing a changeset and a new changeset is published, that's okay because the review as an atomic unit is attached to the prior changeset.
In future changesets, it is often useful to denote that a file or line has unresolved comments in prior changesets. This ensures that feedback on earlier changesets is not lost and must be addressed before any changeset is accepted.
Typically, each changeset is represented by a different Git ref. For example, GitHub pull requests today are usually refs/pr/1234 and you can use git locally to check out any pull request this way. A changeset would be something like refs/pr/1234/v2 (hypothetical) so you can also check out individual changesets.
Instead of "approving" a PR and merging, reviewers approve a changeset. This means that the contributor can also post multiple changesets with differing approaches to a problem in a single PR and the maintainer can potentially choose a non-latest changeset as the set of changes they want to merge.
GitHub, Please!
Changesets are a well-established pattern across many open source projects and companies. They're already a well-explored user experience problem in existing products like Gerrit and Phabricator. I also believe changesets can be introduced in a non-breaking way (since current PRs are like single-mutable-changeset mode).
Changesets would make pull requests so much more scalable for larger projects and organizations. Besides the scalability, they make the review process cleaner and safer for both parties involved in pull requests.
Of course, I can only speak for myself and my experience, but this single major feature would dramatically improve my quality of life and capabilities while using GitHub2.
Footnotes
    This is a minor, inconvenient issue, but this issue scales up to serious problem. ↩
    "Just don't use GitHub!" I've heard this feedback before. There are many other reasons I use GitHub today, so this is not a viable option for me personally right now. If you can get away with not using GitHub, then yes you can find changeset support in other products. ↩

File addition: 20230926_crdt_survey_further_topics.md (----------)

[0.1]

# Document Title
CRDT Survey, Part 4: Further Topics
Matthew Weidner | Feb 13th, 2024
Home | RSS Feed
Keywords: CRDTs, bibliography
    This blog post is Part 4 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
# Further Topics
This post concludes my CRDT survey by describing further topics that are outside my focus area. It is not a proper survey of those topics, but instead a lightweight bibliography, with links to a relevant resource or two. To find more papers about a given topic, see crdt.tech/papers.html.
My previous posts made several implicit assumptions, which I consider part of the “traditional” CRDT model:
    All operations only target eventual consistency. There is no mechanism for coordination/strong consistency (e.g. a central server), even for just a subset of operations.
    Users always wish to synchronize all data and all operations, as quickly as possible. There is no concept of partial access to a document, or of sync strategies that deliberately omit operations.
    All devices are trustworthy and follow the given protocol.
The further topics below concern collaborative apps that violate at least one of these assumptions. I’ll end with some alternatives to CRDT design and an afterword.
# Table of Contents
    Incorporating Strong Consistency
        Mixed Consistency • Server-Assisted CRDTs
    Incomplete Synchronization
        Undo • Partial Replication • Versions and Branches
    Security
        Byzantine Fault Tolerance • Access Control
    Alternatives to CRDT Design
        Operational Transformation • Program Synthesis
    Afterword: Why Learn CRDT Theory?
# Incorporating Strong Consistency
Strong consistency is (roughly) the guarantee that all replicas see all operations in the same order. This contrasts with CRDTs’ eventual consistency guarantee, which allows different replicas to see concurrent operations in different orders. Existing work explores systems that mix CRDTs and strong consistency.
    “Consistency in Non-Transactional Distributed Storage Systems”, Viotti and Vukolić (2016). Provides an overview of various consistency guarantees.
# Mixed Consistency
Some apps need strong consistency for certain operations but allow CRDT-style optimistic updates for others. For example, a calendar app might use strong consistency when reserving a room, to prevent double-booking. Several papers describe mixed consistency systems or languages that are designed for such apps.
    “Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary”, Li et al. (2012). Describes a system that offers “RedBlue consistency”, in which operations can be marked as needing strong consistency (red) or only eventual consistency (blue).
# Server-Assisted CRDTs
Besides providing strong consistency for certain operations, a central server can provide strong consistency for parts of the protocol while clients use CRDTs for optimistic local updates. For example, the server can decide which operations are valid or invalid, assign a final total order to operations, or coordinate “garbage collection” actions that are only possible in the absence of concurrent operations (e.g., deleting tombstones).
    “Replicache”, Rocicorp (2024). Client-side sync framework that uses a server to assign a final total order to operations, but allow clients to perform optimistic local updates using a “rebase” technique.
    “Making Operation-Based CRDTs Operation-Based”, Baquero, Almeida, and Shoker (2014). Defines causal stability: the condition that there can be no further operations concurrent to a given operation. Causal stability is usually necessary for garbage collection actions, but tricky to ensure in a collaborative app where replicas come and go.
# Incomplete Synchronization
# Undo
Part 2 described exact undo, in which you apply your normal semantics to the operation history less undone operations. The same principle applies to other situations where you deliberately omit operations: rejected/reverted changes, partial replication, “cherry-picking” across branches, etc.
However, applying your normal semantics to “the operation history less undone operations” is more difficult that it sounds, because many semantics and algorithms assume causal-order delivery. For example:
    There are two common ways to describe an Add-Wins Set’s semantics, which are equivalent assuming causal-order delivery but inequivalent without it.
    To compare list CRDT positions, you often need some metadata (e.g., an underlying Fugue tree), which is not guaranteed to exist if you are missing prior operations.
It is an open problem to formulate undo-tolerant variants of various CRDTs.
    “Supporting Undo and Redo for Replicated Registers in Collaborative Applications”, Brattli and Yu (2021). Describes an undo-tolerant variant of the multi-value register.
# Partial Replication
Previous posts assumed that all collaborators always load an entire document and sychronize all operations on that document. In partial replication, replicas only store and synchronize parts of a document. Partial replication can be used as part of access control, or as an optimization for large documents - clients can leave most of the document on disk or on a storage server.
Existing CRDT libraries (Yjs, Automerge, etc.) do not have built-in partial replication, but instead encourage you to split collaborative state into multiple documents if needed, so that you can share or load each document separately. However, you then lose causal consistency guarantees between operations in different documents. It is also challenging to design systems that perform the actual replication, e.g., intelligently deciding which documents to fetch from a storage server.
    “Conflict-Free Partially Replicated Data Types”, Briquemont et al. 2015. Describes a system with explicit support for partial replication of CRDTs.
    “Automerge Repo”, Automerge contributors (2024). “Automerge Repo is a wrapper for the Automerge CRDT library which provides facilities to support working with many documents at once”.
# Versions and Branches
In a traditional collaborative app, all users want to see the same state, although they may temporarily diverge due to network latency. Version control systems instead support multiple branches that evolve independently, except during explicit merges. CRDTs are a natural fit for version control because their concurrency-aware semantics may lead to better merge results (operations in parallel branches are considered concurrent).
The original local-first essay already discussed versions and branches on top of CRDTs. Recent work has started exploring these ideas in practice.
    “Upwelling: Combining real-time collaboration with version control for writers”, McKelvey et al. (2023). A Google Docs-style editor with added version control features that uses CRDTs.
    “Proposal: Versioned Collaborative Documents”, Weidner (2023). A workshop paper where I propose an architecture that combines features of git and Google Docs.
# Security
# Byzantine Fault Tolerance
Traditional CRDTs assume that all devices are trustworthy and follow the given protocol. A malicious or buggy device can easily deviate from the protocol in a way that confuses other users.
In particular, a malicious device could assign the same UID to two versions of an operation, then broadcast each version to a different subset of collaborators. Collaborators will likely process only the first version they receive, then reject the second as redundant. Thus the group will end up in a permanently inconsistent state.
A few papers explore how to make CRDTs Byzantine fault tolerant so that these inconsistent states do not occur, even in the face of arbitrary behavior by malicious collaborators.
    “Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases”, Kleppmann and Howard (2020).
    “Making CRDTs Byzantine fault tolerant”, Kleppmann (2022).
# Access Control
In a purely local-first setting, access control - controlling who can read and write a collaborative document - is tricky to implement or even define. For example:
    If a user loses access to a collaborative document, what happens to their local copy?
    How should we handle tricky concurrent situations, such as a user who performs an operation concurrent to losing access, or two admins who demote each other concurrently?
A few systems implement protocols that handle these situations.
    “Matrix Decomposition: Analysis of an Access Control Approach on Transaction-based DAGs without Finality”, Jacob et al. (2020). Describes the access control protocol used by the Matrix chat network.
    “@localfirst/auth”, Caudill et al. (2024). “@localfirst/auth is a TypeScript library providing decentralized authentication and authorization for team collaboration, using a secure chain of cryptographic signatures.”
# Alternatives to CRDT Design
# Operational Transformation
In a collaborative text document, when one users inserts or deletes text, they shift later characters’ indices. This would cause trouble for other users’ concurrent editing operations if you interpreted their original indices literally:
The *gray* cat jumped on **the** table.
Alice typed " the" at index 17, but concurrently, Bob typed " gray" in front of her. From Bob's perspective, Alice's insert should happen at index 22.
The CRDT way to fix this problem is to use immutable list CRDT positions instead of indices. Another class of algorithms, called Operational Transformation (OT), instead “transform” indices that you receive over the network to account for concurrent operations. So in the above figure, Bob would receive “Index 17” from Alice, but add 5 to it before processing the insertion, to account for the 5 characters that he concurrently inserted in front of it.
I personally consider list CRDT positions to be the simpler mental model, because they stay the same over time. They also work in arbitrary networks. (Most deployed OT algorithms require a central server; decentralized OT exists but is notoriously complicated.) Nonetheless, OT algorithms predate CRDTs and are more widely deployed - in particular, by Google Docs.
    “An Integrating, Transformation-Oriented Approach to Concurrency Control and Undo in Group Editors”, Ressel, Nitsche-Ruhland, and Gunzenhäuser (1996). A classic OT paper.
    “Enhancing rich content wikis with real-time collaboration”, Ignat, André, and Oster (2017). Describes a central-server OT algorithm for block-based rich text.
# Program Synthesis
Traditional CRDT papers are algorithm papers that design a CRDT by hand and prove eventual consistency with pen-and-paper. Some more recent papers instead try to synthesize CRDTs automatically from some specification of their behavior - e.g., a sequential (single-threaded) data structure plus some invariants that it should preserve even in the face of concurrency.
Note that while synthesis may be easier than designing a CRDT from scratch, there is little guarantee that the synthesized semantics are reasonable in the eyes of your users. Indeed, some existing papers choose rather unusual semantics.
    “Katara: Synthesizing CRDTs with Verified Lifting”, Laddad et al. 2022. Synthesizes a CRDT from a specification by searching a space of composed state-based CRDTs until it finds one that works.
# Afterword: Why Learn CRDT Theory?
I hope you’ve enjoyed reading this blog series. Besides satisfying your curiosity, though, you might wonder: Why learn this in the first place?
Indeed, one approach to CRDTs is to let experts implement them in a library, then use those implementations without thinking too hard about how they work. That is the approach we take for ordinary local data structures, like Java Collections. It works well when you use a CRDT library for the task it was built for - e.g., Yjs makes it easy to add central-server live collaboration to various rich-text editors.
However, in my experience so far - both using and implementing CRDT libraries - it is hard to use a CRDT library in any way that the library creator didn’t anticipate. So if your app ends up needing some of the Further Topics above, or if you need to tune the collaborative semantics, you may be forced to develop a custom system. For that, CRDT theory will come in handy. (Though still consider using established tools when possible - especially for tricky parts like list CRDT positions.)
Even when you use an existing CRDT library, you might have practical questions about it, like:
    What invariants can I expect to hold?
    What happens if <…> fails?
    How will it perform under different workloads?
Understanding a bit of CRDT theory will help you answer these questions.
    This blog post is Part 4 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
Home • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub

File addition: 20230926_crdt_survey.md (----------)

[0.1]

# Document Title
CRDT Survey, Part 1: Introduction
Matthew Weidner | Sep 26th, 2023
Home | RSS Feed
Keywords: CRDTs, collaborative apps
    This blog post is Part 1 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
# What is a CRDT?
Suppose you’re implementing a collaborative app. You’ve heard that Conflict-free Replicated Data Types (CRDTs) are a good fit and you want to know more about them.
If you look up the definition of CRDT, you will find that there are two main kinds, “op-based” and “state-based”, and these are defined using mathematical terms like “commutativity”, “semilattice”, etc. This is probably already more complicated than your mental model of a collaborative app, and I imagine it can be intimidating.
Let’s step back a bit and think about what you’re trying to accomplish.
In a collaborative app, users expect to see their own operations immediately, without waiting for a round-trip to a central server. This is especially true in a local-first app, where users can make edits even when they are offline, or when there is no central server.
Immediate local edits make it possible for users to perform operations concurrently: logically simultaneously, with no agreement on the order of operations. Those users will temporarily see different states. Eventually, though, they will synchronize with each other, combining their operations.
At that point, the collaborative app must decide: What is the state resulting from these concurrent operations? Because the operations were logically simultaneous, we shouldn’t just pretend that they happened in some sequential order. Instead, we need to combine them in a way that matches users’ expectations.
# Example: Ingredient List
Here is a simple example. The app is a collaborative recipe editor (Collabs demo), which includes an ingredient list:
A list of ingredients: "Head of broccoli", "Oil", "Salt".
Suppose that:
    One user deletes the first ingredient (“Head of broccoli”).
    Concurrently, a second user edits “Oil” to read “Olive Oil”.
We could try broadcasting the non-collaborative version of these operations: Delete ingredient 0; Prepend "Olive " to ingredient 1. But if another user applies those operations literally in that order, they’ll end up with “Olive Salt”:
The wrong result: "Oil", "Olive Salt".
Instead, you need to interpret those operations in a concurrency-aware way: Prepend 'Olive' applies to the “Oil” ingredient regardless of its current index.
The intended result: "Olive Oil", "Salt".
# Semantics
In the above example, the users’ intended outcome was obvious. You can probably also anticipate how to implement this behavior: identify each ingredient by a unique ID instead of its index.
Other situations can be more interesting. For example, starting from the last recipe above, suppose the two users perform two more operations:
    The first user increases the amount of salt to 3 mL.
    Concurrently, the second user clicks a “Halve the recipe” button, which halves all amounts.
We’d like to preserve both users’ edits. Also, since it’s a recipe, the ratio of amounts is more important than their absolute values. Thus you should aim for the following result:
An ingredients list starts with 15 mL Olive Oil and 2 mL Salt. One user edits the amount of Salt to 3 mL. Concurrently, another user halves the recipe (7.5 mL Olive Oil, 1 mL Salt). The final state is: 7.5 mL Olive Oil, 1.5 mL Salt.
In general, a collaborative app’s semantics are an abstract description of what the app’s state should be, given the operations that users have performed. This state must be consistent across users: two users who are aware of the same operations must be in the same state.
Choosing an app’s semantics is more difficult than it sounds, because they must be well-defined in arbitrarily complex situations. For example, if a third user performs a bunch of operations concurrently to the four operations above, the collaborative recipe editor must still end up in some consistent state - hopefully one that makes sense to users.
# Algorithms
Once you’ve chosen your collaborative app’s semantics, you need to implement them, using algorithms on each user device.
For example, suppose train conductors at different doors of a train count the number of passengers boarding. When a conductor sees a passenger board through their door, they increment the collaborative count. Increments don’t always synchronize immediately due to flaky internet; this is fine as long as all conductors eventually converge to the correct total count.
The app’s semantics are obvious: its state should be the number of increment operations so far, regardless of concurrency. Specifically, an individual conductor’s app will display the number of increment operations that it is aware of.
# Op-Based CRDTs
Here is one algorithm that implements these semantics:
    Per-user state: The current count, initially 0.
    Operation inc(): Broadcast a message +1 to all devices. Upon receiving this message, each user increments their own state. (The initiator also processes the message, immediately.)
Top user performs inc(), changing their state from 0 to 1. They broadcast a "+1" message to other users. Upon receipt, those two users each change their states from 0 to 1.
It is assumed that when a user broadcasts a message, it is eventually received by all collaborators, without duplication (i.e., exactly-once). For example, each user could send their messages to a server; the server stores these messages and forwards them to other users, who filter out duplicates.
Algorithms in this style are called op-based CRDTs. They work even if the network has unbounded delays or delivers messages in different orders to different users. Indeed, the above algorithm matches our chosen semantics even under those conditions.
# State-Based CRDTs
Op-based CRDTs can also be used in peer-to-peer networks without any servers. However, usually each user needs to store a complete message history, in case another user requests old messages. (It is a good idea to let users forward each other’s messages, not just their own.) This has a high storage cost: O(total count) in our example app.
A state-based CRDT is a different kind of algorithm that sometimes reduces this storage cost. It consists of:
    A per-user state. Implicitly, this state encodes the set of operations that the user is aware of, but it is allowed to be a lossy encoding.
    For each operation (e.g. inc()), a function that updates the local state to reflect that operation.
    A merge function that inputs a second state and updates the local state to reflect the union of sets of operations:
(Set of +1s mapping to "Current state") plus (set of +1s mapping to "Other state") becomes (set of +1s mapping to "Merged state"). Each state has a box with color-coded +1 ops: Current state has orange, red, green; Other state has orange, red, purple, blue, yellow; Merged state has orange, red, green, purple, blue, yellow.
We’ll see an optimized state-based counter CRDT later in this blog series. But briefly, instead of storing the complete message history, you store a map from each device to the number of inc() operations performed by that device. This encodes the operation history in a way that permits merging (entrywise max), but has storage cost O(# devices) instead of O(total count).
    Unless you need state-based merging, the counting app doesn’t really require specialized knowledge - it just counts in the obvious way. But you can still talk about the “op-based counter CRDT” if you want to sound fancy.
# Defining CRDTs
One way to define a CRDT is as “either an op-based CRDT or a state-based CRDT”. In practice, we can be more flexible. For example, CRDT libraries usually implement hybrid op-based/state-based CRDTs, which let you use both op-based messaging and state-based merging in the same app.
I like this broader, informal definition: A CRDT is a distributed algorithm that computes a collaborative app’s state from its operation history. This state must depend only on the operations that users have performed: it must be the same for all users that are aware of the same operations, and it must be computable without extra coordination or help from a single source-of-truth. (Message forwarding servers are okay, but you’re not allowed to put a central server in charge of the state like in a traditional web app.)
Of course, you can also use CRDT techniques even when you do have a central server. For example, a collaborative text editor could use a CRDT to manage the text, but a server DB to manage permissions.
# Outline of this Survey
In this blog series, I will survey CRDT techniques for collaborative apps. My goal is to demystify and summarize the techniques I know about, so that you can learn them too without a PhD’s worth of effort.
The techniques are divided into two “topics”, corresponding to the sections above:
    Semantic techniques help you decide what a collaborative app’s state should be.
    Algorithmic techniques tell you how to compute that state efficiently.
Part 2 (the next post) covers semantic techniques; I’ll cover algorithmic techniques in Part 3.
Since I’m the lead developer for the Collabs CRDT library, I’ll also mention where various techniques appear in Collabs. You can find a summary in the Collabs docs.
This survey is opinionated: I omit or advise against techniques that I believe don’t translate well to collaborative apps. (CRDTs are also used in distributed data stores, which have different requirements.) I also can’t hope to mention every CRDT paper or topic area; for that, see crdt.tech. If you believe that I’ve treated a technique unfairly - including your own! - please feel free to contact me: mweidner037 [at] gmail.com.
I thank Martin Kleppmann, Jake Teton-Landis, and Florian Jacob for feedback on portions of this survey. Any mistakes or bad ideas are my own.
# Sources
I’ll cite specific references in-line, but here are some sources of general inspiration.
    Pure Operation-Based Replicated Data Types, Carlos Baquero, Paulo Sergio Almeida, and Ali Shoker (2017).
    How Figma’s multiplayer technology works, Evan Wallace (2019)
    Tackling Consistency-related Design Challenges of Distributed Data-Intensive Systems - An Action Research Study, Susanne Braun, Stefan Deßloch, Eberhard Wolff, Frank Elberzhager, and Andreas Jedlitschka (2021).
    A Framework for Convergence, Matt Wonlaw (2023).
Next Post - Part 2: Semantics
    This blog post is Part 1 of a series.
        Part 1: Introduction
        Part 2: Semantic Techniques
        Part 3: Algorithmic Techniques
        Part 4: Further Topics
Home • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub

File addition: 20230917_git_is_awful_prime_reacts.md (----------)

[0.1]

so uh yeah my unpopular opinion is that
git is awful and it needs to die die die
die wow
why why why why why yeah well because it
doesn't scale uh among other things also
it's completely unintuitive and it's
honestly it's it's
oh God you're gonna get me ranting uh
look first of all I've seen
I don't get the unintuitive part I guess
because I used SVN and perforce those
don't feel very intuitive either
as fian emotionally hurt me I did love
tortoise though
tortoise SVN
and a company that bisco and I both
worked at
computers unlimited which sounds like a
1970s computer repair shop so they
renamed it Tim's which is
now sounds like a fishing shop
um
they had a even
rapper around everything called
trebuchet which everyone just called
tree bucket
and I hated it I hated my life I hated
it I hate it I don't want to go back to
SVN
the thing is is that I don't mind
specifically on something like this I
don't mind if you on it but I want
to hear a good alternative all right
let's let's see some Alternatives here
visions of control systems come and go
right I started with RCS and then CVS
and then SVN and then pert force and it
went on and on Piper at Google and and
then get in Mercurio and I mean get was
just another one and it had great
marketing it had great it had some sort
of great virality it's really kind of
garbage the whole thing is just it's
very powerful and flexible but uh and it
doesn't scale like fundamentally all the
companies that we work with that use git
have you know maybe a hundred thousand
git repos what what are you gonna do
with a hundred thousand git repos you
know Android struggled with this
mightily when I was uh on the Android
team at Google right I mean just we had
all these huge rappers around git to
deal with multiple repos because of the
open source problem and you know the
internal stuff and I just learned to
hate git but is that a git problem was
that a git problem or was that not a git
problem you have part of first off you
have part of your thing that's open
source part of it that's not open source
by the way the not open source part
that's the spying part that's the NSA
backdoor issue you know what I'm talking
about that's the back door part
um anyway so when you're hiding your
back door now you're trying to have like
two repos that are going together
I mean that's always difficult like is
that ever easy I don't think it is
sub modules
did you just say the word sub module
sub module is the world's greatest idea
that is emotionally painful every time
it happens
nobody likes sub modules
everybody thinks sub modules are going
to be good nobody loves sub modules
we still use sub modules
I hate it here
I hate git and I think that there's an
unhealthy dependence on both git and
GitHub I think that's fair I think the
GitHub Reliance thing is actually a real
that that's a fair critique which is now
we we really just wrapped up a single
point of failure
um and it's also owned by Microsoft so I
don't trust it at all now I'm just
wondering when the next monetization is
going to come out from it other than
using all my sweet sweet programs
whether or not my license says not to
for co-pilot but besides for that I'm
just saying
GitHub is a monopoly they're closed
fundamentally I think that that micro
Microsoft under Satya has been a very
open ecosystem right look at vs code
yeah right look at I mean they've been
some really cool stuff but GitHub I
don't I don't trust this guy's opinion
anymore this is getting hard for me to
really love this opinion
because I'm just saying that's that's a
hard opinion to believe that it's just
really open when a copilot X Works only
for vs code uh they're building these SP
these like spaces online and everything
they're just trying to take your
computer take everything and just charge
you monthly for it
I don't trust Microsoft one bit it's an
acquisition and they're still very oh
we're just doing
free and we love
[Music]
we look really when has a company ever
been really great for no reason come on
very very closed off and you know uh and
I I don't I I think developers like
GitHub too much I mean maybe the
Alternatives aren't that great but like
I I see this uh you like it just too
much too much liking going on it will
stop from here on out
um I do get what he's saying though I do
get what he's saying it's a great Pro I
mean the thing is it's a good product
have you used stash stash is not that
much fun okay I haven't used gitlab I
don't have enough use with gitlab to
really understand it
it's attachment to it and I'm like stuff
changes and you're not holding a high
enough bar it's not good enough you
should never be satisfied with the state
of the art if it's killing you and right
now the state of the art is killing us
how's it killing us yeah 100 000 repos
every company okay that's the killing us
but why is that bad
I gotta understand why is having
multiple repos bad
why is having Netflix's
python slash whatever the hell the
language is repo that is designed to
help produce algorithmic recommendations
separated from the UI Library why why is
that good or bad
right
to me I don't I don't I don't think
I don't see why they have to be the same
code base right I don't even see why
they're related but who's feeling that
pain I'm not feeling that pain well you
don't have that many repos is anybody
working on this who's working on this
well Facebook just launched what is it
sapling right which looks kind of
promising although it didn't start
taking the getting the acceleration that
I was looking at but changing code hosts
is a big deal for people you guys want
an unpopular opinion on giving you like
potentially the most no no we're we're
with you this is part of the game here
we're playing the game I'm enjoying this
I'm considering it I do like GitHub I'm
wondering you said maybe they like it
too much and I'm thinking the product is
good though so that's why I like it like
it's good it's decent you haven't used
Google's tools true
check mate I I I'm curious about
Google's internal tool I've never worked
at Google
it'd be fun to actually use like what
does a company that builds everything
themselves what does it look like right
I know you've never used Google tools
I.E you never passed Google's interview
I.E wrecked
um
draw the little square
proved uh but real talk
I would like to try out Google's tools
right I've never worked at Google I
think it'd be a lot of fun
so
oh no could be cool
I think I'd get bored pretty quick at
Google though real talk I think I think
I'd get pretty
I think I get pretty pretty bored so if
Google's tools are that much better
then why
doesn't Google make a better
revisioning system and then why doesn't
Google create the hosting for it and why
doesn't Google just simply make all the
monies from it
right
there has to be a lot of monies on it
and you got Google barred the subpar AI
why not just have a hosting service for
free
and then slurp up all the data from it
some people need a source graph another
source graph uh swords graph Source
graph I would argue is you know because
the the folks at source graph actually
love GitHub and haven't used Google's
tools Source staff is you know I mean
Source graph is better than GitHub in a
lot of ways but you know Source graph
doesn't try to be GitHub with all the
workflows and all that stuff right
TJ wrecked TJ TJ is in shambles
TJ are you in shambles right now
shambled TJ
shout out can we get a shout out for TJ
he works at sourcecraft why would I be
in shambles uh because you're just like
a not as good GitHub as what I take from
this and github's killing people so that
means you're killing people faster what
I mean by that is they Beyond work there
and Quinn worked there the history of
knowing Google tools using Google Google
tools and then being an expat of Google
and then doing something without it is
what I mean by that they were the
inspiration right we need yeah we need
somebody to say okay I've been in Google
and I've used Google tooling and we need
a non-github that is Google toying
that's better a startup that knows
Google's tools but I can then recreate
them yeah who's who's gonna do that who
I think that doing that sounds unfun I'm
just gonna throw it out there
I don't think I would like doing that
personally who would you bet on
to do that
you mean to to come up with a Google
style tool so well that's that's
I really wish you would qualify what a
Google tool is
because I think that would help a lot of
us try to understand exactly what does
he mean by
why Google is better like I love like an
example or a a diagram or something
something that I understand why it's
better because I I guess I'm a little
confused as to why it's better think
angular I don't want to think about that
come on man that's my
dreams
oh I see so this is not the beating this
might be the beginning of a new story
sorry the Vault by the way his volume is
very low it's been it's been hard I'm
trying to get it as good as I can it
could be I like that so git isn't good
enough and I think you said GitHub is
bad I'm just trying to think of how
I would love a compare feature that
doesn't require me to have a pull
request and then manipulate the URL just
saying that'd be kind of nice just give
me just a button that says compare give
me a button that says diff like if I was
going to put this into it we're not
going to try to uh if you've been trying
to read the two leaves if I'm not trying
to tackle GitHub or anything like that
right now Karen unfortunately it's you
know it's the least bad right it's the
least of all the bad options least bad
of all right there but I truly believe
it could be a lot better and that and
that AI is going to make it a lot better
and um so you know I love being at
source graph because we can actually
bring that experience across there are
still bit bucket users in the world I'm
one
I mean to be fair I use stash which is
like
that's this bit bucket so I get it I'm
I'm I I Netflix is all on bitbucket
so
you know I deal with them honestly bit
bucket not all that bad
we have a good setup
you know if I wanna if I want some
automation I click a bunch of buttons on
uh Jenkins for about 20 minutes until I
find the job that I actually want
because their search is really confusing
and we have multiple sub domains to
different builds once I find the build I
want I copy the build and the
configuration then when I get the
configuration I updated the point at the
new stash URL and then boom I got CI
CI you know what I mean it's pretty neat
one oh yeah not not to get not the bit
look it's great either but have you used
fossil fossil SCM from Richard Hipp from
a sqlite it's a completely different way
of thinking about it maybe give that a
look fossil you never commit right like
it's always you commit but everything's
always synchronized around every piece
it's still distributed but it's always
synchronized it's never on your machine
only you never have to get push Master
it's just there I don't know if it
scales or not but they use it for sqlite
yeah I think fundamentally we need
somebody who who comes at this from the
perspective of we need to make this
scale up to world scale code bases okay
and I think that will ultimately come
out of Industry I think Facebook sapling
might be the closest but we'll see or
ask Google or somebody who leaves Google
and says I need a company you know
what's good was Google's tools you know
what's bad is git and I'm gonna try to
tackle this that could happen
still want to know what makes it better
just give me a three second explanation
what's the thing
you know gosh I'm so curious I'm so is
this actually how there is this Google
hiring
a guy to secretly get people to want to
work at Google
by not telling them but telling them how
great their tools are is this like a
psyop of the greatest amount because I'm
feeling very psyopped right now I'm
feeling super Psy often all of a sudden
I have this great desire to go work at
Google to go touch their tools I'm
getting foreplayed
this is foreplay I'm
I'm
what is Google's Tools in this case like
what do they have that would be better
the scale version of it why can you
describe it or is it under like NDA and
you can't tell anything about it forever
I got four played for five minutes
straight and now we're hearing about it
all right
pre-watched
no yeah it's just that like the the the
the code graph you know gets exposed
across the entire workflow so you know
on GitHub if you see a symbol in a pull
in a pull request you know you can't
click on it or hover it or get you know
graph information about it you know like
in your IDE
uh you know they don't they don't have
IDE quality indexing when you're you
know looking at like when you're looking
at Trace logs or debug views or whatever
all of that stuff is completely and
fully instrumented at Google where you
know so in other words their ability to
discover and diagnose problems uh is
just unprecedented there's a Gestalt to
it that's really hard to get across
that sounds like what sourcecraft's
doing he works at sourcegraph right
I mean that's I I love that idea I mean
I would like that but that's not I mean
to me what he just described isn't a git
problem specifically isn't it a tool on
top of git
because like is the version is the
revision the real thing that's needed is
that what causes it or is this actually
just a tool on top of code
to be smarter about everything
it feels like it's it's the things uh
We've abstracted overkid you can use
other Version Control Systems as well
yeah that's what I mean is that is the
Version Control really the place that
this should live to begin with it kind
of seems like this is something bigger
right this is LSP for the future
how has there been three golden kappas
in this chat why are people getting
golden caplets I never get a golden
Kappa why is my Kappa never golden okay
yes that's also my team works on it uh
we've talked about this for a while I
know I know we've talked about this and
we've talked about but that's what I'm
saying is why I don't I don't get the
the dislike forget I guess that's where
I'm struggling is I don't see the
connection between git
and what he just described maybe one
could argue that you could
I mean could you store language level
features in a
revision system
yeah because that's kind of what I see
is that there's a revision system
in which would have to store
like stuff about what's happening
because how is it supposed to understand
JavaScript or the linking between two
libraries that are internal like why
would you want to do that that's just it
just it fully that's why I love
sourcegraph right
I want more Source graph in my life not
less but more I want more TJ but it's a
very comfy environment by the way TJ if
I ever decide to try to get a different
job I would apply for Source graph
just let them know that they've won me
over
okay and it was you
if I were to apply somewhere else I'd
apply
Source graph
it's like a world scale IDE almost
except it's distributed internally can I
use some of your words from what you
wrote you this is in regards to code
search at Google so I would imagine
there's some similarity in uh
satisfaction score potentially for this
um this intelligence I suppose you said
by the way this guy's voice is the only
normalized voice in this whole thing
Google codes searches like the Matrix
except for developers it is it has near
perfect satisfaction score on Google's
internal surveys and pretty much every
Dev who leaves Google misses it this is
regards to code search this is the
reason why Source graph exists because
this was only a Google and everybody
else needs it too you want to say Google
Engineers today can navigate and
understand their own multi-billion line
code base better than perhaps any other
group of devs in such a large
environment so are you saying that they
have a tool like gator GitHub that's
that gives net intelligence this better
and no one else has access to this thing
this is their proverbial Secret Sauce
behind the scenes to be more efficient
as an engineering team despite everyone
now having to work on AI and kind of
being behind the ball that's right okay
that's right their tooling environment
in fact Google's entire infrastructure
stack not just the tools but everything
you use as a developer there even the
docs and stuff are just uh just
unbelievably good unstopidly good like
you just come in and you're like what
like it just it makes no sense like the
rest of the world just feels like people
just bang on rocks compared to Google
stuff
so I have long kind of on I think
how Google Google does stuff they're
technical artifacts how they do
promotions I think it creates a lot of
perverse incentives but based on what
he's saying maybe there are some things
that are good
from it this is Huli propaganda I I will
I do I think everyone is correct which
is if they're so good at all this why do
they kill everything they create Netflix
tooling we do not have anything like
this nothing at all nothing I could
never convince them to give it to the
rest of the developers if killed by
Google Google domains Google optimize
Google Cloud iot Core Google album
archive YouTube stories grasshopper
conversational conversational actions
Google Currents Google Street View
Standalone app a Jacquard Google code
completions Google stadia rip stadia uh
Google on HUB
um okay YouTube Originals a threaded
YouTube Originals rap that was not uh
wow oh my goodness thank you oh gosh I
can't read all this
oh my goodness okay some of these are
fair though
some of these have to be fair like
Google killed a Google uh desktop bar
desktop bar was a small inset window on
the Windows Toolbar to allow users to
perform search without leaving the
desktop I could see why this would be
killed right I think some of this is a
little unfair
on some of theirs right that was 17
years ago I know but I'm just saying a
few loads I mean obviously it's fun to
uh make fun of Google killing tons of
projects but let's be serious Google's
market cap is 1.7 trillion dollars like
they are shipping stuff that power such
an incredible amount to the world yes
and so you have to explore a bunch and
you have to create stuff that people
like but if it doesn't make a material
difference you have to kill it because
you don't want to have to have staff and
people hired around stuff that isn't
producing anything like I get it I'm not
against what they're doing but
the point still stands they kill stuff I
mean Netflix kills a bunch of features
too you just don't hear about it because
our features are all within a player
only Steve Baumer had been there
screaming about developers right okay
Falcor Falco hey hey TJ you know what
problem I had just yesterday
my Falcor client couldn't con couldn't
connect to the database server on the
current latest build of tvi so guess
what falcore still alive Felker still
alive it has been developed on in
literally a half decade but it's still
alive
that would do it that would do it well
this has been a very good unpopular
opinion maybe I like it yeah I
appreciate it I'm still thinking about
it yeah
I would love to see it I'd love to see
like a demo you know for the rest of the
world because sometimes you don't know
you're banging on rocks until you see
somebody who has like a more
sophisticated tool and you're like kind
of like when you're in vs code and you
don't realize how sophisticated it can
be so then you see someone in neovim and
you're just like wow
up and banging on rocks this whole time
like oh I could do that yeah I mean you
can see you can see Google code search
if you just type chromium code search
they've indexed it was I actually did I
actually liked I do like the chromium
code search stuff I used it quite a bit
to explore the V8 engine it's very very
good their chromium code search stuff is
incredible this is on my way out it was
my Swanson at Google uh at least in the
code search team it was was was indexing
the Android and chromium code basis so
you can play with it it doesn't have all
of the functionality but it has a lot
and you can see it's very slick uh you
know navigation it's just it's really
really good but that's only the search
stuff and that's that's actually not
even used that much compared to some of
the other things like their their quick
editor uh and uh so cider they have and
then sit-see they're clients in the
cloud they have basically cloud-based
clients and oh my God the stuff they
have is like science fiction still the
stuff they had 15 years ago is still
science fiction for the rest of the
world like they have high speed networks
and they can make they can they can do
uh incred edible
it's really nice it's really nice so
yeah the rest of the world is kind of
hurting and that's that's why I'm still
in this space because I think the rest
of the world needs to get to where
Google's at wow thank you changelog that
was awesome
hey that's me in my own video being
promoted to me
um thank you that was actually really
good um I really like that I okay so I'm
still confused by the get things still I
I don't understand why it gets the
problem on this stuff but I I understand
what he's trying to say which is that
the tools we use are absolutely awful
comparatively to what say Google has
either way this is it's very interesting
I would love I mean I really want to
just get like an exp I want to go
experience the Google tools just to feel
them
and then take that experience
and go that's what they mean this is why
it's bad I would love it I would love
that experience
either way the name
is I don't understand why it gets like
so battle it's killing I don't it seems
kind of intense but I guess I haven't
seen the other side maybe I'm just the
one with rocks here okay a gin

File addition: 20230914_where_do_your_files_live.md (----------)

[0.1]

# Document Title
In a git repository, where do your files live?
• git •
Hello! I was talking to a friend about how git works today, and we got onto the topic – where does git store your files? We know that it’s in your .git directory, but where exactly in there are all the versions of your old files?
For example, this blog is in a git repository, and it contains a file called content/post/2019-06-28-brag-doc.markdown. Where is that in my .git folder? And where are the old versions of that file? Let’s investigate by writing some very short Python programs.
git stores files in .git/objects
Every previous version of every file in your repository is in .git/objects. For example, for this blog, .git/objects contains 2700 files.
$ find .git/objects/ -type f | wc -l
2761
note: .git/objects actually has more information than “every previous version of every file in your repository”, but we’re not going to get into that just yet
Here’s a very short Python program (find-git-object.py) that finds out where any given file is stored in .git/objects.
import hashlib
import sys
def object_path(content):
    header = f"blob {len(content)}\0"
    data = header.encode() + content
    digest = hashlib.sha1(data).hexdigest()
    return f".git/objects/{digest[:2]}/{digest[2:]}"
with open(sys.argv[1], "rb") as f:
    print(object_path(f.read()))
What this does is:
    read the contents of the file
    calculate a header (blob 16673\0) and combine it with the contents
    calculate the sha1 sum (e33121a9af82dd99d6d706d037204251d41d54 in this case)
    translate that sha1 sum into a path (.git/objects/e3/3121a9af82dd99d6d706d037204251d41d54)
We can run it like this:
$ python3 find-git-object.py content/post/2019-06-28-brag-doc.markdown
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
jargon: “content addressed storage”
The term for this storage strategy (where the filename of an object in the database is the same as the hash of the file’s contents) is “content addressed storage”.
One neat thing about content addressed storage is that if I have two files (or 50 files!) with the exact same contents, that doesn’t take up any extra space in Git’s database – if the hash of the contents is aabbbbbbbbbbbbbbbbbbbbbbbbb, they’ll both be stored in .git/objects/aa/bbbbbbbbbbbbbbbbbbbbb.
how are those objects encoded?
If I try to look at this file in .git/objects, it gets a bit weird:
$ cat .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
x^A<8D><9B>}s<E3>Ƒ<C6><EF>o|<8A>^Q<9D><EC>ju<92><E8><DD>\<9C><9C>*<89>j<FD>^...
What’s going on? Let’s run file on it:
$ file .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54
.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54: zlib compressed data
It’s just compressed! We can write another little Python program called decompress.py that uses the zlib module to decompress the data:
import zlib
import sys
with open(sys.argv[1], "rb") as f:
    content = f.read()
    print(zlib.decompress(content).decode())
Now let’s decompress it:
$ python3 decompress.py .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54 
blob 16673---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... the entire blog post ...
So this data is encoded in a pretty simple way: there’s this blob 16673\0 thing, and then the full contents of the file.
there aren’t any diffs
One thing that surprised me here is the first time I learned it: there aren’t any diffs here! That file is the 9th version of that blog post, but the version git stores in the .git/objects is the whole file, not the diff from the previous version.
Git actually sometimes also does store files as diffs (when you run git gc it can combine multiple different files into a “packfile” for efficiency), but I have never needed to think about that in my life so we’re not going to get into it. Aditya Mukerjee has a great post called Unpacking Git packfiles about how the format works.
what about older versions of the blog post?
Now you might be wondering – if there are 8 previous versions of that blog post (before I fixed some typos), where are they in the .git/objects directory? How do we find them?
First, let’s find every commit where that file changed with git log:
$ git log --oneline  content/post/2019-06-28-brag-doc.markdown
c6d4db2d
423cd76a
7e91d7d0
f105905a
b6d23643
998a46dd
67a26b04
d9999f17
026c0f52
72442b67
Now let’s pick a previous commit, let’s say 026c0f52. Commits are also stored in .git/objects, and we can try to look at it there. But the commit isn’t there! ls .git/objects/02/6c* doesn’t have any results! You know how we mentioned “sometimes git packs objects to save space but we don’t need to worry about it?“. I guess now is the time that we need to worry about it.
So let’s take care of that.
let’s unpack some objects
So we need to unpack the objects from the pack files. I looked it up on Stack Overflow and apparently you can do it like this:
$ mv .git/objects/pack/pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack .
$ git unpack-objects < pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack
This is weird repository surgery so it’s a bit alarming but I can always just clone the repository from Github again if I mess it up, so I wasn’t too worried.
After unpacking all the object files, we end up with way more objects: about 20000 instead of about 2700. Neat.
find .git/objects/ -type f | wc -l
20138
back to looking at a commit
Now we can go back to looking at our commit 026c0f52. You know how we said that not everything in .git/objects is a file? Some of them are commits! And to figure out where the old version of our post content/post/2019-06-28-brag-doc.markdown is stored, we need to dig pretty deep into this commit.
The first step is to look at the commit in .git/objects.
commit step 1: look at the commit
The commit 026c0f52 is now in .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4 after doing some unpacking and we can look at it like this:
$ python3 decompress.py .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4
commit 211tree 01832a9109ab738dac78ee4e95024c74b9b71c27
parent 72442b67590ae1fcbfe05883a351d822454e3826
author Julia Evans <julia@jvns.ca> 1561998673 -0400
committer Julia Evans <julia@jvns.ca> 1561998673 -0400
brag doc
We can also get same information with git cat-file -p 026c0f52, which does the same thing but does a better job of formatting the data. (the -p option means “format it nicely please”)
commit step 2: look at the tree
This commit has a tree. What’s that? Well let’s take a look. The tree’s ID is 01832a9109ab738dac78ee4e95024c74b9b71c27, and we can use our decompress.py script from earlier to look at that git object. (though I had to remove the .decode() to get the script to not crash)
$ python3 decompress.py .git/objects/01/832a9109ab738dac78ee4e95024c74b9b71c27
b'tree 396\x00100644 .gitignore\x00\xc3\xf7`$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\x8e:h\xad100644 README.md\x00~\xba\xec\xb3\x11\xa0^\x1c\xa9\xa4?\x1e\xb9\x0f\x1cfG\x96\x0b
This is formatted in kind of an unreadable way. The main display issue here is that the commit hashes (\xc3\xf7$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\…) are raw bytes instead of being encoded in hexadecimal. So we see \xc3\xf7$8\x9b\x8d instead of c3f76024389b8d. Let’s switch over to using git cat-file -p which formats the data in a friendlier way, because I don’t feel like writing a parser for that.
$ git cat-file -p 01832a9109ab738dac78ee4e95024c74b9b71c27
100644 blob c3f76024389b8d4f192f18b77d7cc7ce8e3a68ad	.gitignore
100644 blob 7ebaecb311a05e1ca9a43f1eb90f1c6647960bc1	README.md
100644 blob 0f21dc9bf1a73afc89634bac586271384e24b2c9	Rakefile
100644 blob 00b9d54abd71119737d33ee5d29d81ebdcea5a37	config.yaml
040000 tree 61ad34108a327a163cdd66fa1a86342dcef4518e	content <-- this is where we're going next
040000 tree 6d8543e9eeba67748ded7b5f88b781016200db6f	layouts
100644 blob 22a321a88157293c81e4ddcfef4844c6c698c26f	mystery.rb
040000 tree 8157dc84a37fca4cb13e1257f37a7dd35cfe391e	scripts
040000 tree 84fe9c4cb9cef83e78e90a7fbf33a9a799d7be60	static
040000 tree 34fd3aa2625ba784bced4a95db6154806ae1d9ee	themes
This is showing us all of the files I had in the root directory of the repository as of that commit. Looks like I accidentally committed some file called mystery.rb at some point which I later removed.
Our file is in the content directory, so let’s look at that tree: 61ad34108a327a163cdd66fa1a86342dcef4518e
commit step 3: yet another tree
$ git cat-file -p 61ad34108a327a163cdd66fa1a86342dcef4518e
040000 tree 1168078878f9d500ea4e7462a9cd29cbdf4f9a56	about
100644 blob e06d03f28d58982a5b8282a61c4d3cd5ca793005	newsletter.markdown
040000 tree 1f94b8103ca9b6714614614ed79254feb1d9676c	post <-- where we're going next!
100644 blob 2d7d22581e64ef9077455d834d18c209a8f05302	profiler-project.markdown
040000 tree 06bd3cee1ed46cf403d9d5a201232af5697527bb	projects
040000 tree 65e9357973f0cc60bedaa511489a9c2eeab73c29	talks
040000 tree 8a9d561d536b955209def58f5255fc7fe9523efd	zines
Still not done…
commit step 4: one more tree….
The file we’re looking for is in the post/ directory, so there’s one more tree:
$ git cat-file -p 1f94b8103ca9b6714614614ed79254feb1d9676c	
.... MANY MANY lines omitted ...
100644 blob 170da7b0e607c4fd6fb4e921d76307397ab89c1e	2019-02-17-organizing-this-blog-into-categories.markdown
100644 blob 7d4f27e9804e3dc80ab3a3912b4f1c890c4d2432	2019-03-15-new-zine--bite-size-networking-.markdown
100644 blob 0d1b9fbc7896e47da6166e9386347f9ff58856aa	2019-03-26-what-are-monoidal-categories.markdown
100644 blob d6949755c3dadbc6fcbdd20cc0d919809d754e56	2019-06-23-a-few-debugging-resources.markdown
100644 blob 3105bdd067f7db16436d2ea85463755c8a772046	2019-06-28-brag-doc.markdown <-- found it!!!!!
Here the 2019-06-28-brag-doc.markdown is the last file listed because it was the most recent blog post when it was published.
commit step 5: we made it!
Finally we have found the object file where a previous version of my blog post lives! Hooray! It has the hash 3105bdd067f7db16436d2ea85463755c8a772046, so it’s in git/objects/31/05bdd067f7db16436d2ea85463755c8a772046.
We can look at it with decompress.py
$ python3 decompress.py .git/objects/31/05bdd067f7db16436d2ea85463755c8a772046 | head
blob 15924---
title: "Get your work recognized: write a brag document"
date: 2019-06-28T18:46:02Z
url: /blog/brag-documents/
categories: []
---
... rest of the contents of the file here ...
This is the old version of the post! If I ran git checkout 026c0f52 content/post/2019-06-28-brag-doc.markdown or git restore --source 026c0f52 content/post/2019-06-28-brag-doc.markdown, that’s what I’d get.
this tree traversal is how git log works
This whole process we just went through (find the commit, go through the various directory trees, search for the filename we wanted) seems kind of long and complicated but this is actually what’s happening behind the scenes when we run git log content/post/2019-06-28-brag-doc.markdown. It needs to go through every single commit in your history, check the version (for example 3105bdd067f7db16436d2ea85463755c8a772046 in this case) of content/post/2019-06-28-brag-doc.markdown, and see if it changed from the previous commit.
That’s why git log FILENAME is a little slow sometimes – I have 3000 commits in this repository and it needs to do a bunch of work for every single commit to figure out if the file changed in that commit or not.
how many previous versions of files do I have?
Right now I have 1530 files tracked in my blog repository:
$ git ls-files | wc -l
1530
But how many historical files are there? We can list everything in .git/objects to see how many object files there are:
$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | wc -l
20135
Not all of these represent previous versions of files though – as we saw before, lots of them are commits and directory trees. But we can write another little Python script called find-blobs.py that goes through all of the objects and checks if it starts with blob or not:
import zlib
import sys
for line in sys.stdin:
    line = line.strip()
    filename = f".git/objects/{line[0:2]}/{line[2:]}"
    with open(filename, "rb") as f:
        contents = zlib.decompress(f.read())
        if contents.startswith(b"blob"):
            print(line)
$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | python3 find-blobs.py | wc -l
6713
So it looks like there are 6713 - 1530 = 5183 old versions of files lying around in my git repository that git is keeping around for me in case I ever want to get them back. How nice!
that’s all!
Here’s the gist with all the code for this post. There’s not very much.
I thought I already knew how git worked, but I’d never really thought about pack files before so this was a fun exploration. I also don’t spend too much time thinking about how much work git log is actually doing when I ask it to track the history of a file, so that was fun to dig into.
As a funny postscript: as soon as I committed this blog post, git got mad about how many objects I had in my repository (I guess 20,000 is too many!) and ran git gc to compress them all into packfiles. So now my .git/objects directory is very small:
$ find .git/objects/ -type f | wc -l
14

File addition: 20230805_lazy_git_5_years_on.md (----------)

[0.1]

# Document Title
Lazygit Turns 5: Musings on Git, TUIs, and Open Source
Written on August 5, 2023
This post is brought to you by my sponsors. If you would like to support me, consider becoming a sponsor
Lazygit, the world’s coolest terminal UI for git, was released to the world on August 5 2018, five years ago today. I say released but I really mean discovered, because I had taken a few stabs at publicising it in the weeks prior which fell on deaf ears. When I eventually posted to Hacker News I was so sure nothing would come of it that I had already forgotten about it by that afternoon, so when I received an email asking what license the code fell under I was deeply confused. And then the journey began!
In this post I’m going to dive into a bunch of topics directly or tangetially related to Lazygit. In honour of the Hacker News commenters whose flamewar over git UIs vs the git CLI likely boosted the debut post to the frontpage, I’ve been sure to include plenty of juicy hot-takes on various topics I’m underqualified to comment on. It’s a pretty long post so feel free to pick and choose whatever topics interest you.
Contents:
    Where are we now?
    Lessons learnt
    What comes next?
    Is git even that good?
    Weighing in on the CLI vs UI debate
    Weighing in on the terminal renaissance
    Credits
Where are we now?
Stars
Lazygit has 37 thousand stars on GitHub, placing it at rank 26 in terms of Go projects and rank 263 across all git repos globally.
What’s the secret? The number one factor (I hope) is that people actually like using Lazygit enough to star the repo. But there were two decisions I made that have nothing to do with the app itself that I think helped.
Firstly, I don’t have a standalone landing page site or docs site. I keep everything in the repo, which means you’re always one click away from starring. You can add a GitHub star button to your external site, but it doesn’t actually star the repo; it just links to the repo and it’s up to you to realise that you actually need to press the star button again. I suspect that is a big deal.
Secondly, Lazygit shows a popup when you first start it which at the very bottom suggests staring the repo:
Thanks for using lazygit! Seriously you rock. Three things to share with you:
 1) If you want to learn about lazygit's features, watch this vid:
    https://youtu.be/CPLdltN7wgE
 2) Be sure to read the latest release notes at:
    https://github.com/jesseduffield/lazygit/releases
 3) If you're using git, that makes you a programmer! With your help we can
    make lazygit better, so consider becoming a contributor and joining the fun at
    https://github.com/jesseduffield/lazygit
    You can also sponsor me and tell me what to work on by clicking the donate
    button at the bottom right.
    Or even just star the repo to share the love!
I know this all sounds machiavellian but at the end of the day, a high star count lends credibility to your project which makes users more likely to use it, and that leads to more contributors, which leads to more features, creating a virtuous cycle.
It’s important to note that GitHub stars don’t necessarily track real world popularity: magit, the de facto standard git UI for emacs, has only 6.1k stars but has north of 3.8 million downloads which as you’ll see below blows Lazygit out of the water.
Downloads
Downloads are harder to measure than stars because there are so many sources from which to download Lazygit, and I don’t have any telemetry to lean on.
GitHub tells me we’ve had 359k total direct downloads.
4.6% of Arch Linux users have installed Lazygit.
Homebrew ranks Lazygit at 294th (two below emacs) with 15k installs-on-request in the last year (ignoring the tap with 5k of its own). For comparison tig, the incumbent standalone git TUI at the time of Lazygit’s creation, ranks at 480 with 8k installs.
I’m torn on how to interpret these results: being in the top 300 in Homebrew is pretty cool, but 15k installs feels lower than I would expect for that ranking. On the other hand, having almost 1 in 20 Arch Linux users using Lazygit seems huge.
Lessons Learnt
I’ve maintained Lazygit for 5 years now and it has been a wild ride. Here’s some things I’ve learnt.
Ask for help
I don’t know why this didn’t occur to me sooner, but there is something unique and magical about writing open source software whose users are developers: any developer who raises an issue has the capacity to fix the issue themselves. All you need to do is ask! Simply asking ‘are you up to the challenge of fixing this yourself?’ and offering to provide pointers goes a long way.
I’ve gotten better over time at identifying easy issues and labelling them with the good-first-issue label so that others can help out, with a chance of becoming regular contributors.
Get feedback
If your repo is popular enough, you’ll get plenty of feedback through the issues board. But issues are often of the form ‘this is a problem that needs fixing’ or ‘this is a feature that should be added’ and the demand for rigour is a source of friction. There are other ways you can reduce the friction on getting feedback. I pinned a google form to the top of the issues page to get general feedback on what people like/dislike about Lazygit.
Something that the google form made clear was that people wanted to know what commands were being run under the hood, so I decided to add a command log (shown by default) that would tell you which commands were being run. This made a huge difference and it’s now one of the things people like best about Lazygit.
Something that surprised me was how big of a barrier the language of the project is in deciding whether somebody contributes. And Go of all languages: the one that’s intended to be dead-easy to pick up. Maybe I need to do a rewrite in javascript to attract more contributors ;)
MVP is the MVP
This is not much a ‘lesson learnt’ as it was a ‘something I got right’. When I first began work on Lazygit I had a plan: hit MVP (Minimum Viable Product) and then release it to the world to see if the world had an appetite for it. The MVP was pretty basic: allow staging files, committing, checking out branches, and resolving merge conflicts. But it was enough to satisfy my own basic needs at the time and it was enough for many others as well. Development was accelerated post-release thanks to some early contributors who joined the team (shoutout to Mark Kopenga, Dawid Dziurla, Glenn Vriesman, Anthony Hamon, David Chen, and other OG contributors). This not only sped up development but I personally learned a tonne in the process.
Tech debt is a perennial threat
In your day job, tech debt is to be expected: there are deadlines and customers to appease and competitors to race against. In open source, then, you would think that the lack of urgency would mean less tech debt. But I’ve found that where time is the limiting factor at my day job, motivation is the limiting factor in my spare time, and the siren song of tech debt is just as alluring. Does anybody want to spend their weekend writing a bunch of tests? Does anybody want to spend a week of annual leave on a mind-numbing refactoring? Not me, but I have done those things in order to improve the health of the codebase (and there is still much to improve upon).
Thankfully, open source has natural incentives against tech debt that are absent from proprietary codebases. Firstly, if your codebase sucks, nobody will want to contribute to it. Contrast this to a company where no matter how broken and contemptible a codebase is, there is an amount you can pay a developer to endure it.
Secondly, because your code is public, anybody who considers hiring you in the future can skim through it to get a feel for whether you suck or not. You want your codebase to be a positive reflection on your own skills and values.
So, tech debt is still a problem, but for different reasons than in a proprietary codebase.
Get your testing patterns right as soon as possible
The sooner you get a good test pattern in place with good coverage, the easier life will be.
In the beginning, I was doing manual regression tests before releasing each feature. Although I had unit tests, they didn’t inspire much confidence, and I had no end-to-end tests. Later on I introduced a framework based on recorded sessions: each test would have a bash script to prepare a repo, then you would record yourself doing something in Lazygit, and the resultant repo would be saved as a snapshot to compare against when the test was run and the recording was played back. This was great for writing tests but terrible for maintaining them. Looking at a minified JSON containing a sequence of keypresses, it was impossible to glean the intent, and the only way to make a change to the test was to re-record it.
I’ve spent a lot of time working on an end-to-end test framework where you define your tests with code, and although I still shiver thinking about the time it took to migrate from the old framework to the new one, every day I see evidence that the effort was worth it. Contributors find it easy to write the tests and I find it easy to read them which tightens the pull request feedback loop.
Here’s an example to give you an idea:
// We call them 'integration tests' but they're really end-to-end tests.
var RewordLastCommit = NewIntegrationTest(NewIntegrationTestArgs{
	Description:  "Rewords the last (HEAD) commit",
	SetupRepo: func(shell *Shell) {
		shell.
			CreateNCommits(2)
	},
	Run: func(t *TestDriver, keys config.KeybindingConfig) {
		t.Views().Commits().
			Focus().
			Lines(
				Contains("commit 02").IsSelected(),
				Contains("commit 01"),
			).
			Press(keys.Commits.RenameCommit).
			Tap(func() {
				t.ExpectPopup().CommitMessagePanel().
					Title(Equals("Reword commit")).
					InitialText(Equals("commit 02")).
					Clear().
					Type("renamed 02").
					Confirm()
			}).
			Lines(
				Contains("renamed 02"),
				Contains("commit 01"),
			)
	},
})
I wish I had come up with that framework from the get-go: it would have saved me a lot of time fixing bugs and migrating tests from the old framework.
What comes next?
If I could flick my wrist and secure funding to go fulltime on Lazygit I’d do it in a heartbeat, but given the limited time available, things move slower than I would like. Here are some things I’m excited for:
    Bulk actions (e.g. moving multiple commits at once in a rebase)
    Repo actions (e.g. pulling in three different repos at once)
    Better integration with forges (github, gitlab) (e.g. view PR numbers against branches)
    Improved diff functionality
    More flexibility in deciding which args are used in a command
    More performance improvements
    A million small enhancements
I’ve just wrapped up worktree support, and my current focus is on improving documentation.
If you want to be part of what comes next, join the team! There are plenty of issues to choose from and we’re always up to chat in the discord channel.
Okay, you’ve listened to me ramble about me and my project for long enough. Now onto the juicy stuff.
Is git even that good?
I’m not old enough to compare git with its predecessors, and from what I’ve heard from those who are old enough, it was a big improvement.
There are many who criticize git for being unnecessarily complex, in part due to its original purpose in serving the needs of linux development. Fossil is a recent (not-recent: released in 2006 as commenter nathell points out) git alternative that optimises for simplicity; serving the needs of small, high-trust teams. I disagree with a few of its design choices, but it might be perfect for you!
My beef with git is not so much its complexity (I’m fine dealing with multiple remotes and the worktree/index distinction) but its UX, including:
    lacking high-level commands
    no undo feature
    merge conflicts aren’t first-class
Lacking high-level commands
Consider the common use case of ‘remove this file from my git status output’. Depending on the state of the file, the required command is different: for untracked files you do rm <path>, for tracked it’s git checkout -- <path>, and for staged files it’s git reset -- <path> && git checkout -- <path>. One of the reasons I made Lazygit was so that I could press ‘d’ on a file in a ‘changed files’ view and have it just go away.
No undo feature
Git should have an undo feature, and it should support undoing changes to the working tree. Although Lazygit has an undo feature, it depends on the reflog, so we can’t undo anything specific to the working tree. If git treated the working tree like its own commit, we would be able to undo pretty much anything.
Merge conflicts aren’t first class
I also dislike how merge conflicts aren’t first-class: when they show up you have to choose between resolving all of them or aborting an entire rebase (which may have involved other conflicts), and you can’t easily switch to another task mid-conflict (though worktrees make this easier).
One project that addresses these concerns is Jujutsu. I highly recommend reading through its readme to realise how many problems you took for granted and how a few structural tweaks can provide a much better experience.
Unlike Fossil which trades power for simplicity, Jujutsu feels more like a reboot of git, representing what git could have been from the start. It’s especially encouraging that Jujutsu can use git as a backend. I hope that regardless of Jujutsu’s success, git incorporates some of its ideas.
Weighing in on the CLI-UI debate
If my debut hacker news post hadn’t sparked a flamewar on the legitimacy of git UIs, it probably would have gone unnoticed and Lazygit would have been relegated to obscurity forever. So thanks, Moloch!
I’ve had plenty of time to think about this endless war and I have a few things to say.
Here are the main arguments against using git UIs:
    git UIs sometimes do things you didn’t expect which gets you in trouble
    git UIs rarely give you everything you need and you will sometimes need to fall back to the command line
    git UIs make you especially vulnerable when you do need to use the CLI
    git UIs obscure what’s really happening
    The CLI is faster
I’m going to address each of these points.
Git UIs sometimes do things you didn’t expect
This is plainly true. Lazygit works around this by logging all the git commands that it runs so that you know what’s happening under the hood. Also, over time, lazygit’s ethos has changed to be less about compensating for git’s shortcomings via magic and more making it easier to do the things that you can naturally do in git, which means there are fewer surprises.
Git UIs don’t cover the full API
This is indeed an issue. However, as a git UI matures, it expands to cover more and more of git’s API (until you end up like magit). And the fact you need to fall back to git is not really a point against the UI: when given the choice between using the CLI 100% of the time and using it 1% of the time, I pick the latter. If you forgive the shameless plug (is it really a plug given the topic of the post?) Lazygit also works around this with a pretty cool custom commands system that lets you invoke that bespoke git command from the UI; making use of the selection state to spare you from typing everything out yourself.
Git UIs make you vulnerable when you need to use the CLI
I’ve conceded the first two points. Now I go to war.
What people envision with a seasoned CLI user is that they come across some situation they haven’t seen before and using their strong knowledge of the git API and the git object model they craft an appropriate solution. The reality that I’ve experienced is that you instead just look for the answer on stack overflow, copy+paste, and then forget about it until next time when you google it again. With the advent of ChatGPT this will increasingly become the norm.
Whenever a new technology comes along that diminishes the need for the previous one, there are outcries that it will make everybody dumber. Socrates was famously suspicious of the impact that writing would have on society, saying:
    Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant…
The argument perfectly applies to UIs, and is just as misguided. The truth is that some people have good memory and some people (i.e. me) have shockingly bad memory and it has little to do with technology (unless the technology is hard drugs in which case yes that does make a difference). I think that many debates about UX are actually debates between people with differing memory ability who therefore have different UX needs. UIs make things more discoverable so you don’t need to remember as much, and people with shocking memory who stick to the git CLI have no guarantee of actually remembering any of it. Yes, all abstractions are leaky, but that doesn’t mean that we should go without abstractions, any more than we should all revert to writing code in assembly.
What’s especially peculiar is that many complex git commands involve a visual component whether you like it or not: the git CLI by default will open up a text editor to prepare for an interactive rebase which is visual in the sense that you’re shown items whose position is meaningful and you can interact with them (e.g. shuffling commits around). The question is whether that interface is easy to use or not, and I find the default behaviour very difficult to use.
For the record, I’m good at helping colleagues fix their git issues, but if I’m in their terminal trying to update their remote URL I have no idea what the command is. Not to worry: I do know how to run brew install lazygit.
Harry potter meme
Git UIs obscure what’s really happening
Again, strong disagree. Compared to the CLI, there’s nothing to obscure!
When I create a commit, several things happen:
    my staged files disappear from my list of file changes
    a new commit is appended to my git log
    my branch ends up with a new head commit, diverging from its upstream
If you create a commit from the command line, you see none of this. You can query for any of this information after the fact, for example by running git status, but it only gives you one piece of information. If you’re a beginner using the git CLI, you want to be learning the relationship between the different entities, and it’s almost impossible to do that without seeing how these entities are changed as a direct result to your actions. Lazygit has helped some people better understand git by providing that visual context.
Perhaps UIs aren’t visually obscuring the entities, but they are obscuring the commands. Okay, fine, I concede that point. But my caveats in the Git UIs sometimes do things you didn’t expect section above still apply.
The CLI is faster
If you’re a CLI die-hard you probably have some aliases that speed you up, but when it gets to complex use cases like splitting an old commit in two or only applying a single file from a stash entry to the index, it helps to have a UI that lets you press a few keys to select what you want and then perform an action with it. In fact I’d love to pit a CLI veteran against a UI veteran in a contrived gauntlet of git challenges and see who reaches the finish line first. You could also determine the speed of light for each approach i.e. the minimum number of keypresses required to perform the action and then see which approach wins. Even if you had a thousand aliases, I still think a keyboard-centric UI (with good git API coverage) would win.
Conclusion
As somebody who maintains a git UI, I’m clearly partial. But I also feel for the devs who I see stage files by typing git status, dragging their mouse over the file they want to add, and then typing git add <path>. It makes my stomach turn. There are some pros out there who are happy using the CLI for everything, but the average CLI user I see is taking painstakingly slow approaches to very simple problems.
Weighing in on the terminal renaissance
The terminal is making a comeback. Various companies have sprung up with the intention of improving the developer experience in the terminal:
    Warp: a terminal emulator whose killer feature is allowing you to edit your command as if you were in vscode/sublime
    Fig: a suite of tools including autocomplete, terminal plugin manager, and some UI helpers for CLI tools
    Charm: various tools and libraries for terminals including a terminal UI ecosystem
Warp and Fig both add original elements to the exterior of a terminal emulator to improve the UX, whereas charm is all about improving things on the inside. All of these projects have the same general idea: rather than replace the terminal, embrace it.
I’m interested to see where this goes.
TUI vs CLI
I would say I’m pro-terminal, but borderline anti-CLI. When I’m interfacing with something I want to know the current state, the available actions, and once I’ve performed an action, I want to see how the state changes in response. So it’s state -> action -> new state. You can use a CLI to give you all that information, but commands and queries are typically separated and it’s left as an exercise for the user to piece it all together. The simplest example is that after you run cd, you have to run ls to know which files are in the directory. Compare this to nnn which mimics your OS’s file explorer. Another example is running docker compose restart mycontainer and then having to run a separate command to see whether or not your container died as soon as it started (compared to using Lazydocker). Even programs like npm can benefit from some visualisation when it comes to linking packages (which is why I created Lazynpm). CLI interfaces are great for scripts and composition but as a direct interface, the lack of feedback about state is jarring.
All this to say that when I see demos that show slick auto-complete functionality added to a CLI tool, I can see that it solves the problem of knowing what actions are available, but I’d rather solve the issue of exposing state.
I want to drive home how easy it is to improve on the design of many CLIs. It’s not hard to pick an existing CLI and think about what entities are involved and how they could be represented visually. A random example: asdf is an all-in-one version manager that can manage versions of multiple programs. So you have programs, the available versions of each program, the currently selected version, and you have some actions like CRUD operations and setting a given version as the default. This is perfectly suited to a UI! It just so happens that somebody has gone and made a TUI for it: lazyasdf (I’m proud to have started a trend with the naming convention!).
TUI vs Web
So, I’ve said my piece about how TUIs can improve upon CLIs, but what about this separate trend of re-imagining web/desktop applications in the terminal?
nsf, the author of the now unmaintained terminal UI framework termbox, says the following at the top of the readme (emphasis mine):
    This library is no longer maintained. It’s pretty small if you have a big project that relies on it, just maintain it yourself. Or look for forks. Or look for alternatives. Or better - avoid using terminals for UI
When you think about it, the only thing that separates terminals and standalone applications is that terminals only render text. The need for terminals was obvious when there were literally no alternatives. Now that we have shiny standalone applications for many things that were once confined to the terminal, it’s harder to justify extending our terminals beyond CLI programs. But there are some reasons:
TUIs guarantee a keyboard-centric UX
There is nothing stopping a non-TUI application from having a keyboard centric UX, but few do. Likewise, TUIs can be mouse-centric, but I’ve never encountered one that is.
TUIs have minimalistic designs
In a TUI, not only can you only render text, but you’re often space-constrained as well. This leads to compact designs with little in the way of superfluous clutter. On the other hand, it’s nice when your UI can render bespoke icons and images and render some text in a small font so that the info is there if you need it but it’s not taking up space. It is interesting that Charm’s UI library seems to go for more whitespace and padding than the typical TUI design: I suspect that trend will be shortlived and in the long run terminal apps will lean compact and utilitarian (No doubt Charm has room for both designs in its ecosystem).
TUIs are often faster than non-TUI counterparts
In one sense, this is a no-brainer: all you’re rendering is text, so your computer doesn’t need to work as hard to render it. But I don’t actually think that’s the main factor. Rather, terminal users expect TUIs to be fast, because they value speed more than other people. So TUI devs put extra effort in towards speed in order to satisfy that desire. I’ve spent enough time on Lazygit’s performance to know that it doesn’t come for free.
Conclusion
So, let’s see where this TUI renaissance goes. Even if the renaissance’s only long-term impact is to support more keyboard-centric UIs in web apps, it will have been worth it.
Credits
Well, that wraps up this Anniversary mega-post. Now I’d like to thank some people who’ve helped Lazygit become what it is today.
First of all, a HUGE thankyou to the 206 people who have contributed to Lazygit over the years, and those who have supported me with donations.
I’d like to shoutout contributors who’ve been part of the journey at different stages: Ryoga, Mark Kopenga, Dawid Dziurla, Glenn Vriesman, Anthony Hamon, David Chen, Flavio Miamoto, and many many others. Thankyou all so much. I also want to thank loyal users who’ve given lots of useful feedback including Dean Herbert, Oliver Joseph Ash, and others.
I want to give a special shoutout to Stefan Haller and Luka Markušić who currently comprise the core team. You’ve both been invaluable for Lazygit’s development, maintenance, and direction. I also hereby publicly award Stefan the prize of ‘most arguments won against maintainer’ ;)
I also want to shoutout Appwrite who generously sponsored me for a year. It warms my heart when companies donate to open source projects.
As for you, dear reader: if you would like to support Lazygit’s development you can join the team by picking up an issue or expressing your intent to help out in the discord channel
And as always, if you want to support me, please consider donating <3
I now leave you with a gif of our new explosion animation when nuking the worktree.
Discussion links
    Hacker News
Shameless plug: I recently quit my job to co-found Subble, a web app that helps you manage your company's SaaS subscriptions. Your company is almost certainly wasting time and money on unused subscriptions and Subble can fix that. Check it out at subble.com

File addition: 20230711_pijul_version_control_post_git.md (----------)

[0.1]

thank you
my name is I works on way too many
projects at the same time right now I'm
working on Cotton X which is a way to
share renewable electricity with your
neighbors efficiently and using
mathematics to do that and I've been
working for a few years now on which is
a version the version control system
based also on mathematics that I'm going
to talk about today
so I have way too many things to tell
you I'll start with talking about what
version control is and do a brief recap
on uh like where it comes from how it
started where we're at now then I'll
talk about our solution
and the principles behind it and then
I'll talk about implementation of of
that Version Control System including
one of the fastest database backends uh
in the world that I've been forced to
write in order to implement payroll
correctly
and then finally I'll have a some some
announcements some surprise
announcements to make about the hosting
platform to host uh repositories
so first version control is actually
very simple and it's not it's not
specific to to uh to uh coders it's when
one or more co-authors edit three of
documents concurrently and one key
feature a Version Control compared to
things like Google docs for example is
the ability to do asynchronous edits so
this sounds like it's it should be uh
easier uh when when you're doing when
you're going asynchronous because you're
giving more flexibility your users it's
actually the opposite so when you allow
co-authors to choose when they want to
sync or merge there are there are
changes or their their work then uh
Things become much more complicated uh
the main reason is because edits May
conflicts and and that their like
conflicts happen in in uh in human work
like when I don't have I'm not claiming
here to have a universal solution to all
conflicts human humans may have
um but I'm merely trying to uh help them
model their conflicts in a proper way
then finally another like not finally
but yet another feature Version Control
that we might like is to be able to
review a Project's history to tell when
a feature or when a bug was introduced
who introduced it and uh sometimes it
gives indication on how to fix it
so many of you here or many people I
talk to about this project think uh that
version control is a sold problem uh
because our tools like gets Mercurial
SVN CVS uh people sometimes mention
fossil per force like we have a huge
collection of tools
but
the our tools like we're they're
probably considered one of the greatest
achievements of our of our industry and
yet there are nobody outside of us uses
them so these days we have silverware
provided by uh NASA with materials that
can go on Mars but our greatest
achievements cannot be used even by uh
editors of legal documents or
parliaments or any and even even the
video game industry doesn't use them
so it's not because they're they're too
young uh they've been around for quite
quite a number of decades now
um lately there's been a trend of doing
distributed version controls that's cool
there's no no Central Central server
except that the tools are unusable if
you don't use a central server and even
actually worse or well sorry uh worse
than the central server a global Central
server Universal to all projects
and uh our current tools require strong
work discipline and planning you have to
plan your your uh things in advance so
the picture on the right is a simple
example of a workflow considered useful
and I personally don't understand it
um well I actually do both
um
but but yeah onboarding people and
letting like diverse people uh from
outside of like without for example a
formal training in computer science uh
know about uh about this tool they'd be
like uh are you crazy or whatever and
that's just a small part of what I'm I'm
about to say because there's flows that
any other any engineer in any other
industry in the world would just laugh
at us if they knew about that and so my
claim is that by using these tools we
are wasting significant human work time
at a global scale I don't know how many
millions of engineer hours are wasted
every year into fixing that rebase that
didn't work or refixing that conflict
again or wait wait re-refixing it or I
don't know a re-refixing it there's
actually a command in git called rear
um
uh some improvements have been proposed
using uh using mathematics physics like
darks for example but unfortunately They
Don't Really scale uh well just a note
before I go any further it's like
puregold is open source and I I'm not
considering uh non-open version short
systems because get git is good enough
uh it's it's an okay system it's it's
phenomenal in some ways but uh well if
you if you're going for a commercial
system uh then you'd better be like at
least as good as that so and I don't
know any system that achieved out anyway
so our demands for a version control
system so we want associative merges so
we want like what what that means in so
associativity is a mathematical term and
it means that when you take changes A
and B together
they should be the same as a followed by
B so that sounds like something
absolutely trivial and I'll give you a
picture in a minute you'll uh that'll
make it even clearer next we want
commutative mergers so we want the
property that if A and B can be produced
independently there are ordered the
order in which you apply them doesn't
matter you have Alice and Bob working
together if Alice pools Bob's changes it
should result in the same thing as if
Bob pulls Alice's changes
then we want branches well or maybe not
we have branches in pihood but they are
less fundamental than in in gits to in
order not to uh do the same kind of
workflows considered useful that I
showed in previous slide
and obviously we're going to label
algorithmic complexity an ideally fast
implementations and actually here we
have both
so more about uh like the two properties
starting with associative merges so this
is really easy so this is like you have
Alice producing a comments a and Bob
producing your comets B in parallel
and Alice wants to First review Bob's
first comments merges and then review
Bob's second comments and merge it and
this should do the same as if she uh had
merged both comments at the same like at
once I'm not reordering comments here
I'm just merging and actually in git or
SVN or Mercury or CBS or RCS or like any
system based on freeware Merch this is
not the case
such a simple property isn't even
um isn't even Satisfied by by three-way
merge so here's an example of why it's a
counter example of the assistant you've
get so you start with a documents with
only two lines in the
um Alice this she's following the top
path she will start by introducing a g
and then
she will later add another comment with
two new lines above that G A and B and
Bob in parallel to that we'll just add
an X between the original A and B and if
you try that if you tried that scenario
on git today you will see that Bob's new
line is like gets merged into Alice's
new lines so that's a giant line
reshuffling happening and git that just
does that silently uh so this is this
isn't even a conflict the reason for
that is that it's trying to run a hack
to optimize some Metric and uh it turns
out there may be several uh Solutions
sometimes and it just doesn't know about
them it just picks one and said ah okay
done
so I don't know about you but if I were
running high security applications and
if I were writing code related security
this was absolutely terrify me so the
fact that your tool can just silently
reshuffle your lines even if it doesn't
happen often uh it's just super scary it
also means that the code your review is
not the code that gets merged so you
should review your your tool request
before merge and after merge so that's
double the work
well you should test them after merge
anyway but you shouldn't be as careful
in your review and tests don't catch all
bugs
now the community emerges that's a
slightly less trivial thing because all
our all the tools other than darks and
people
um explicitly prevent that
so community of merges means exactly
what's in this diagram you have Alice
producing your comments a bob producing
your comments or a change B and then
they want to be able to pool each
other's changes and they should just
happen transparently uh and without
without anything special happening and
they should like the order should the
order is important because they
obviously your local local repository
history is something super important but
it shouldn't matter in the sense that
um like there should there there is no
Global way to order things that happen
in parallel so uh so that's this should
be reflected in in how the tool handles
uh parallelism
all right so why do we why would we even
want that uh Beyond just academic
curiosity because our the tools we are
currently using right now are never
Community even they explicitly prevent
that so why would you we won't want this
well one reason would be that you might
want to unapply all changes for example
you pull the change
and then went on like you you or you
push the change into into production
um because you've you tested it
thoroughly and it seemed to work then
you push more changes
and then after a while you realize that
your initial change was wrong and you
want to unprove it quickly uh without
having to change the entire uh the
entire sequence of patches that came
afterwards so you might be able to to
you might you want you might want to be
able to do that and uh well if you if
you disallow uh commutation you can't
and here uh Community allows you to like
change that like
move that change that buggy change to uh
the uh latest uh like to the top of the
uh to the top of the list and and then
and apply it simply then you might want
to do uh cherry picking so cherry
picking is like oh my colleague produced
a nice a nice bug fix while working in
feature I want that bug fix but the
feature is not quite ready so how do I
do that without changing the entire
identity uh and solving conflicts and
resolving them and re-resolving them and
re-re-resolving them so another reason I
might want that is because I want
partial clones so I have a giant mono
Ripple and I want to pull just
dispatches related to Tiny sub-projects
and so that's the way we handle mono
repos in people and you don't need sub
modules you don't need hacks you don't
need lfs you don't need any of that it
just it just works and it's just a
standard situation
okay so how do we do that
um well first we have to change
perspective and take some uh like on on
Zoom uh the the the the the space and
and and try to look at what we're doing
fundamentally uh what What's what is it
that we're doing when we work
um so we'll start talking about States
or snapshots and changes also called
patches sometimes so all our tools today
are storing States snapshots and they
only compute changes when needed like
for example in three-way merge computes
changes and changes between the lists of
changes
but now what if we did the opposite what
is if we change perspective and started
considering uh changes as the as first
class citizens why would we want to do
that well because my claim is and it's
not backed by anything but my claim is
that when we work what we're
fundamentally producing is not a new
version of the world what or a new
version of a project when we work what
we're producing is changes to a project
and so this seems to match the way we
work or the way we think about work
closer and so it probably will be able
to get some benefits out of that and so
now what if we did a hybrid system where
we stored both it actually that's
actually what we do
all right so this has been looked at
before
um I'll just give you two examples of of
ideas in the in that space that some of
you may already know about so the first
one is operational transforms it's the
idea behind the Google docs for example
so in Google doc like this is this is an
example so in operational transforms you
have transforms and or or changes on on
an initial State here a document with
three letters ABC and then
what you do in operational transforms is
that when you have two changes two two
transforms coming in constantly
um you change they might change each
other in order to be to be able to apply
them in a sequence so for example here
um on the path down downwards we're
we're we're inserting an X at the very
beginning of the file so that's change
T1 and on the path to the right we're
doing T2 which deletes the letter c
and what happens when you combine these
two changes well if you follow the path
on the top you're you're first deleting
the C and then while T1 was at the
beginning of the file so you don't need
to do anything because your previous
change uh the deviation changed the the
end of five so that's okay nothing
nothing special going on there on the
other path uh going first downwards and
then to the right
you have to uh well you you're first
inserting something and then that so
that shifts shifts the index of your
deletion so now you're instead of
deleting the character that was at uh
position R2 you're deleting the
character that's at position three so
darks for example does this uh it
changes uh it changes it's edit like it
changes uh uh patches as as it as it
goes
and it actually does does something
really clever to detect conflicts they
don't have time to uh get into the
details but that's what what they're
doing there is really cool
um unfortunately
this technique leads to uh quadratic
explosion of cases because for if you
have like n different types of changes
you have n times n minus one over two uh
different cases to consider and when
you're just doing insertions and
deletions that's easy when you're doing
anything uh worse than that or more
complicated that's that becomes a
nightmare to implements and I'm actually
here I'm I'm quoting uh like saying a
nightmare I should implements actually a
quote by Google Engineers who try to
implement that for for uh Google Docs so
so it actually is a nightmare
um all right uh another approach that uh
some of you may have heard about is
crdts or config-free replicated data
types the general principle is to design
a structure and operations at the same
time together so that all operations
have the properties we want so they are
associative commutative
natural examples and the the easiest
examples uh that you might come across
when you're learning about trdts are
increment only counters where the only
operation on the counter is just to
increment it and in certainly sets or
append only sets so these are easy now
what happens when you want to do
deletions then you get into the more
subtle examples of crdts then you start
needing uh tomestones and Lamport clocks
and all these things from distributed
programming and so I've done the natural
the subtle now let's move on to the
useless if you consider a full git
repository that's a crdt so what are we
even doing here
um well the thing is why my claim is why
I claim this is useless is because
saying git repository the crdt just
means that you can clone it and you can
design a protocol to clone it and and
that's just it
um now if you just if you consider a
head which is the thing we're interested
in which is the current state of your
repository then that's not a serology
that's like absolutely not one uh simply
because as I said uh concurrent changes
don't compute
so
that was like a really brief regard of
the literature on that thing now let's
move on to uh our our solution or people
so this all started because we were
looking at conflicts and because they
easy cases the cases where you can just
merge in everything uh goes goes right
then that's not super interesting so
what happens when you look at conflicts
where that's where we need a good tool
the most because conflicts are confusing
and you want to be able to just talk
about the fundamental things behind the
conflict like we disagree on something
and and not about how your tools result
like model the conflicts so the exact
definition depends on the tool different
tools have different definitions of what
what a conflict is so for example one
commonly accepted definition is that
when Alice and Bob write the same file
at the same place so that's obviously
conflict there's no way to order their
uh there are there are changes
another example is when Alice renames a
file from F to G and Bob in parallel
renames it to H so that's also a
conflict again that depends on the tool
another example which actually very few
systems handle and people doesn't handle
this uh that's when Alice renames the
function f while bobco adds a call to
ALF so that's extremely tricky uh darks
tries to do that unfortunately uh it's
undecidable to tell whether Bob actually
added a call to F or I did something
else so that's one of the reasons we
don't handle this uh there's also many
other reasons but that's good enough
reason for me
okay so how do we how do we do that so
why are reflection and conflicts helped
us shape a new tool
because we were inspired by a paper by
Samuel and mimraim and Cynthia di Gusto
about uh using category to solve that
problem so category theory is a very
general theory in mathematics that
allows you to model many different kinds
of proofs in this particular 2D
framework with points and arrows between
between the points that's that's most of
what we have in category Theory it's a
very it's it's very uh very simple
and very abstract at the same time
so what we want is that for any two
patches f and g produced from an initial
State X so F leads to Y and G leads to Z
we want the states p and we want a
unique State P such that anything we do
in the future so for any state q that we
can reach after both f and g
so for anything at least and Bob could
do to reach a common state in the future
they could start by uh merging now
reaching a minimal common state uh p and
then and then they can reach Q so we we
what we want is that for any two patches
you can start by finding a minimum
common state and and then and then doing
something to reach any other future
common States
so I realize I'm going a bit fast on
this slide but a category theorists have
the tool to handle that uh they they say
that if P exists which implies its
uniqueness we call P the push out of FNG
so why is why is this important well
because
as you can imagine push outs like it's
not that simple so push outs don't
always exist
and this is this is strictly equivalent
to saying that sometimes uh there are
conflicts in in between our edits so how
do we how do we uh deal with that
then well category Theory tells you that
the quest that gives you a new question
to Lucas so now the question becomes how
to generalize the representation of
States so states are like X Y uh z p or
Q so that all pairs of changes like f
and g uh have a push out
well the solution is that uh your the
minimal like the the minimal uh
extension of files that can tolerate
conflict so that's what we're actually
looking at so the minimal extension of
files that can model conflicts is
um uh directed graphs where vertices are
bytes or byte intervals and edges
represent the union of all known order
between bytes so I know that so probably
sounds a little abstract but I'll give
you a few examples so for example
let's let's see how we uh how we deal
with uh bytes with insertions like let's
add some bytes to an existing file so
um well first some details so vertices
in people are labeled by change number
uh that's the the change that introduced
the introduced the the vertex and then
the interval within that change and the
edges are labeled by the chains that
introduce them so for example here we're
starting with just one vertex uh C 0 0 n
so that's the first n byte in change G
zero
and we're turn we're trying to insert uh
M bytes between positions I minus 1 and
I of that vertex so what we do is we
start by splitting the vertex in so we
get two vertices C 0 0 I and C 0 i n and
now we're inserting a new vertex uh
between these two halves of the split so
that's super easy and now we can we can
we can tell from that graph that our
file has three blocks so one is the
first I bytes of c0 followed by the
first M bytes of C1 and then uh some
bytes in in c0 so bytes I I to n
okay so uh that was easy enough so now
how do we delete bytes well a good thing
about Version Control is that we need to
uh keep the history of the entire
repository anyway so it doesn't cost
cost more to uh to just uh keep the keep
the deleted parts so that's what we do
here
so starting from the the graph we
obtained in the last slide what I'm
doing is uh I'm I'm now deleting bytes
like a contiguous interval uh by its
first j2i from c0 and then 0 to K from
C1 so that's by it's starting from uh J
and then uh I minus J plus K I'm
deleting I minus J plus K bytes from
that from there so the way I do I do it
is exactly the same thing the same way
as as for insertions
I start by splitting my vertices
splitting the relevant vertices at the
relevant positions and and then the way
to mark them as deleted is just
um is just modifying the the label of
the edges so here I'm marking my edges
as deleted by uh turning them into Dash
dashed lines
and and that's all we need that's that's
it so that payroll is not more
complicated it's a bit more complicated
than that but that's fundamentally the
these are the two constructs we uh we
need and then there's a lot of stuff
above above that but that's a like at
the very base the very basic it's just
just that so other vertex in the context
the context of parents and children of
the vertex then change the night and
edges label
so
um how does that handle conflict I won't
dive into that too uh too too deep uh
for uh regions of time but I'll just
State the definition of conflicts and
I'll stop there so um
they like first between getting into
before getting into conflicts first live
vertices I call the live vertices that's
the definition uh there are vertices
with incoming edges are all alive and
dead vertices are vertices with incoming
edges are all that and all the other
vertices so verses that have both alive
and dead uh Edge is pointing to them
they're called zombies and now I'm ready
to State my definition of complex so a
graph has no conflicts if and only if it
has no zombie and all its alive vertices
are totally ordered so that's that's my
definition of conflict and actually it
actually matches what you expect so
that's just a a sequence of bytes that's
that can be ordered unambiguously and uh
it can be you can tell for each byte
that it is either alive or dead but done
both at the same time
and well there's an extension to that an
extension of that to uh files and
directories and so on but that's uh
that's a significant significantly more
involved so I won't talk about that
okay
so
um just some concluding remarks on that
part uh changes are so I said I I wanted
them to be commutative so I I can get
that uh using using this uh this
framework
uh they're not completely commutative in
the sense that changes are partially
ordered by their dependencies on other
changes so each change has
um encodes explicitly a number of uh
dependency dependencies that are
required in order to apply the change
like for example you can you cannot
write to a file before introducing that
file or you cannot delete a line that
doesn't exist yet so that's like basic
dependencies
so now cherry picking well there's no
there's there there isn't even a cherry
pick command in people because chirping
is the same as applying a patch we don't
need to do anything special
there's no git driver so git rather is
about uh solving a conflict like having
to solve the conflict several times I
don't know if many of you have used that
commands but uh the goal of I think it's
somewhat automated now but the goal is
like once you've solved the conflict you
record the conflict resolution and then
try maybe if git allows to maybe replay
it sometime in sometimes in the future
if it works and it doesn't always work
so now conflicts are just like the
normal case and they're sold by changes
and changes can be Cherry Picked so if
you've solved the conflict in one
context you don't need to solve it again
in another context
um for partial clones and monoreepos so
I already mentioned that but there's a
they're easy to implement as long as
white patches are disallowed so for
example if you do Global reformatting
like a patch a patch that reformats all
um all of your repository at once well I
don't know we want to do that but if you
do that obviously then you introduce
dependencies like unwanted dependencies
between changes so if you want to do
this Global reformatting one thing you
can do is just just make like one patch
by uh one reformatting patch by uh by a
sub project
and then you can keep going
for large files
um well one thing I haven't meant like I
haven't really talked about uh in detail
is the way we handle large files is that
like patches patches are actually have
two parts
one part is the description of what they
do so insert inserting some bytes
deleting some bytes and the other thing
is uh the actual bytes that are inserted
or deleted
and the way we handle large files is by
splitting patches into the description
of what they do and that's like the
operational parts and the
and the the actual contents and the
operational part can be exponentially
smaller than the actual content so for
example if one of you are if you work at
a video game company and uh one of your
artists has produced 10 version of two
gigabyte assets during the day you don't
know you don't need to download all all
20 or all 10 versions you only need to
download the bytes that end up being
still alive at the end of the day
so that allows you to just handle large
files easily while you still need to
download some some contents but much
less content than deleting all the
versions
all right so let's move on to uh some
implementation tricks and some like cool
like some things I like like there's a
lot there's a lot to say about
implementation but I'm just going to
tell you about some things I I like and
some things I'm proud of and and the
implementation of this system
um the main the main challenge was
working with large graphs on disk so
obviously when you're doing any uh kind
of like more complicated data structure
than just files
the question arised of of how you should
store them on disk so you don't have to
load the entire thing each time because
that would be like the the cost would be
proportional to the size of history and
that's just an acceptable
so we want it to be actually logarithmic
in the size of history and that's what
we achieve so we can we can upload the
entire graph each time so we have to
keep it on disk and and manipulate it
from there so the trick is to store
edges in a key value store so vertices
and edges vertices mapping to uh their
uh edges to their surrounding edges
another thing we absolutely want is
transactions we want passive crash
safety uh if like the goal with pihul is
to be much more intuitive than anything
else than all the existing tools my goal
is to introduce it to lawyers artists
maybe Lego builders or sonic Pi
composers or the these kinds of people
and
uh these people cannot tolerate uh non-p
like active crash safety they don't they
cannot possibly uh they cannot possibly
tolerate like some operation on the log
that should be done after you've
unplugged the machine for example or
after a crash happens so we absolutely
want that
and next another feature is that we want
uh branches so they're not as useful as
in gits but we still want them and so we
want an efficiently forkable store so we
want to be able to take a database and
then just clone it without copying a
single byte
and so in order to solve these problems
I've written a library called Santa
claria so there's a Finnish word uh
meaning dictionary and it's an on this
transactional key value store but it's
not just that actually it's a more like
a general purpose file block allocator
so it's
um
it allocates blocks in the file in a
transactional way so if you unplug the
machine at any time but I really do mean
anytime
um your your all your equations will go
away and memory will be automatically
freed
and so it uses crash safety using
referential transparency and copy and
write tricks so it never modifies the
previous version it just creates a new
version and with that comes at a no cost
because you don't
because you you already like you already
need to so when you're working with disk
uh with this files you you already need
to read them and so this is such an
expensive operation that just a few
copies even like one I do only one copy
at most but just a few copies each time
you read a block from a file don't cost
anything more than uh just like don't
cut don't cost like the the cost of that
is just negligible compared to the cost
of reading reading a block
um it's workable in bigoof login but
login is like an approximation it's an
absolute worst case is logarithmic in
the size in the total number of keys and
values
and it's written in Rust which
might make some of you feel that it's
probably safe to use and so on but it
actually it's actually it actually uses
a super tricky API because it's way too
generic and it's actually super hard to
use and anyone anyone wants to uh use
sanakedia often has to write a layer on
top of it in order to just provide the
normal safety guarantees that we might
want from a rust library and it uses a
generic underlying storage layer so I
can store stuff in an M map file but I
can also do my read and write
independently individually and manually
or I can I can use a IOU ring like the
new fancy uh i o system in Linux or I
can do well other thing I'll talk about
in the rest of this talk
so now just a really brief
like a really brief description of how I
like how like I I won't okay just a
re-brief description of how I how I
manage crash safety using uh using this
system and using multiple beat trees and
roots
so B trees are these magical data
structure that always stay balanced
without having to do anything special uh
the reason is that in a b tree
insertions so they're a search tree with
more than just one element in each in
each node so there can be usually
there's like my nodes for example in
Santa claria are limited to their to the
size of one memory page or one disk
sector so four kilobytes
and I store as many keys and values as I
can in these blocks so here for the sake
of this example I've just limited my uh
block size to just two elements
to to keep the to keep the picture
simple so for example let's say I want
to insert a five uh so I first I I first
start by uh deciding where I want to
insert it so routing from the top
like I know I need to insert it between
between three and seven because five is
between between three and seven uh so I
go down to this children to this child
and now I know that I need to insert the
five between the four and six so this
node is already full because I told you
the limit is two elements so this causes
a split in this node so now I get two uh
blocks two uh two leaves 4 and 6 and I
wasn't able to insert the 5 in any of
them so this means that I have to insert
it uh in the parents so between the
three and seven but then again that node
is full it's it's already at maximum
capacity so I need to split it and now
this is what I get and so this this is
magical and because it's super it's a
super simple way of doing insertions
that keep the tree balanced
because the only way the depth can
increase is by splitting the roots and
this gives you automatically the
guarantee that all paths will have the
same length so I really love that idea
it's one of the oldest data structures
uh but it's still really cool and it's
very suitable for storing stuff on disk
uh so now just a bit about uh crash 15
how how we do how we use that uh to uh
to uh keep our data safe
the way we do it is by having a number
of sectors at the beginning of the file
pointing each to one copy of the entire
database so for example here in the
first page I'm pointing to the old
version of my my B tree which is this
version here
and on the next one I'm building the new
version by modifying some stuff and and
um the new version well I don't have to
to just to copy everything I can just I
can just copy just the bits I edits
and the old uh it will share uh most of
it's like everything that hasn't been
modified will be shared with the
previous version
so that's that's all we do and so what
happens when you unplug the machine at
any time really uh well that part will
not get written like the the the pages
at the beginning of the file would not
get written and so nothing will happen
the the allocations will get back to
what they were before you start the
transaction
and the the Comets of a transaction
actually happens when we're uh changing
the first eight bytes of the file so uh
hard drives usually guarantee that you
can write a full sector they have a
little battery inside that keeps going
to write at least like one full sector
would often they tell you well it's best
efforts so there's no actual guarantee
that they do that so they guarantee it
but with no actual guarantee I don't
really know what that means but what I
know is that
um writing eight bytes should be okay so
if they try to do best effort for uh
four thousand and Ninety Six bytes then
probably there's eight bytes they they
can they can certainly do it with high
probability another feature of this
system is that Riders don't block
readers because the old versions are
still available so if you start a
transaction while a read-only
transaction while you're writing uh
writing writing something you can still
read the old version and so that's
that's really cool as well it's not
super useful in Imperial well unless you
start running people in the cloud as
I'll share in a minute
and while this sounds like something
super uh fancy and with lots of like
redundancy crash safety copy and write
and should be super expensive but
actually it's the fastest key Value
Store I've tested
so this these are two curves
um showing how long it takes to retrieve
things so get and insert things into my
B trees this is not specific to people
it's not particularly optimized for
people the only thing that's related to
people is that I not implementing long
values yet just because I have I've
never needed to do that but so here I'm
comparing four systems so four different
implementations of key values
um the the most like the slowest one is
a rust driver equals sled so sled is
super slow but it's it's also really
really cool it's using state-of-the-art
technology to um to to do Lock Free
transactions on the database so you can
have a giant computer with thousands of
uh of course or maybe hundreds of course
more realistically and your transactions
won't block each other and there will
still be uh have acid guarantees so this
is super cool but unfortunately it's
still a research prototype and so for
the kind of stuff I'm doing in a single
core uh it's not super relevant so the
green line is the fastest uh C library
um lmdb it's battle tested and all that
and uh it's claimed to be the like the
fastest possible
um in many places
and now this is uh Santa claria the
system I've just introduced and this is
like the orange line is a benchmark of
something that cannot be achieved so
this is the standard Library the
implementation of B trees in the
standard library of rust and so it
doesn't store anything on disk so if
you're storing stuff on disk I will
obviously obviously take more time so
this is just like the reason I've I've
added it I added it there is to just uh
see how close we are to uh doing that so
we are not paying a lot you know to
stores well this is an SSD drive
obviously but we're not paying a lot
because we're not we're minimizing the
number of times we're writing and
reading uh to the disk so so that's it
while the puts thing has a similar uh
performance
removing sleds we can see it more
clearly so this is about twice as fast
as the fastest uh C equivalence
okay
so and this was actually no unexpected
like performance was never the goal uh
the goal was to just to be able to Fork
um and initially I contacted the author
of lmdb to get him to introduce a fork
primitive but uh it wasn't it was
apparently Impossible on the design so I
had to write my own
um all right so now some announcements
so a hosting platform so we we all like
working together but we don't like
setting up servers so how do we uh
collaborate and share repositories uh
one one way to do it in bihull and
that's been the case since the beginning
is to use
um self-hosted repositories using SSH
but uh it's not often convenient you
have to set up a machine in order to
work together so I've wanted to build a
hosting platform and I actually built it
the first version was released quite a
while ago in 2016. it's using uh it's
written entirely in Rust just like
bihull and uh and postgresql to deal
with all the user accounts and uh
discussions and text and so on
uh there was running for a while and
seeing on a single machine it went
through all the iterations of the rust
uh asynchronous ecosystem so that's a
lot of refactoring and rewrite and so on
and it's never been really stable really
uh but the worst time for stability was
definitely ovh to Strasbourg data center
fire in March March 2021 where my
machines so I've seen a slide yesterday
in one of the talks where someone talked
about your server being on fire but I
don't think they really mean it like
here I do really mean it uh like there
is an actual fire in the actual data
center and so the machines were down for
uh two weeks and because it was an
experimental prototype uh he had no uh
real backups replications or anything of
the kind in place so during these two
weeks I took advantage of that little
break in my work to uh
to rebuild something to rebuild a
replicated setup using well the fact
that people is a crdt itself so it's
easy to replicate and then I've used
raft the raft protocol to replicate
postgres and at the time it was also
convenience because my two largest
contributors like the two largest
contributors to people who were using
the South and cross cable if you guys
know what that means so they were they
were communicating with the server in
strasbour by first going from New
Zealand and Australia to uh San
Francisco and then across the us across
the Atlantic Ocean uh to uh uh start in
across France to to Salisbury so they
had absolutely unbearable latencies
and so I was able to give them so this
was cool and convenient because it was I
was finally able to give them a proper
server with short uh short latencies
short response times but it's been
working okay for two years now a little
bit over two years but the problem is
that this is a at the moment it's a
personal project it's totally
yarn-funded so the machines are really
small and I'm using postgres in like
ways that aren't really intended because
my like the core of my database is
actually Santa Clara and pihul it's not
it's not installed in in Plus in uh
postgres so I need to communicate
between these two databases and so I
need the databases to be located close
to
um they're not like the replica are not
just backups they're they're backups and
caches at the same time so the
consequence of that is that when the
machines are under a high at like too
high a load it causes a failure of
postgres so postgres takes a little more
time to answer
and and so the raft uh thing understands
that as a total failure and triggers a
switchover of the main uh like the
leader of the cluster and that would be
okay uh just having some down time right
but actually the consequence of that is
way worse than downtime is data loss so
having small smaller machines is fine I
don't mind uh if some of my users are
using my experimental system and it just
crashes sometimes or is down for a
little while that it doesn't really
matter but when they're starting to lose
data that's that's a problem so I've
decided to uh rewrite it uh and because
I were working on with cloudflare
workers and function as a service and in
other projects and my renewable energy
projects I started thinking about how we
could use people uh to do that so the
like really quickly function as a
service is uh different from traditional
architecture where you have a big
process overhead or a virtual machine
overhead for each uh for for like each
little piece of server you're running
instead of doing that you're just
sharing the machine and sharing just a
single giant JavaScript runtime with
lots of different uh processes or
functions and even but even from other
users so cloudflare uses on like each
machine uh this giant runtime shared by
all its customers
so that's really cool because uh you can
answer from all of like cloudflare's 250
data centers and it gives you optimal
latency
it's also very easy to write
so that's uh the the minimal example
taken from their documentation where
you're just answering a hello worker
from like you're responding now to a
request
and now the question becomes like can we
run or at least simulate people
repository in a pure functional service
framework like the storage options are
fairly Limited in function as a service
you don't have access to to a hard drive
you don't even have an actual machine so
how do you do that or at least how do
you pretend to be a full-fledged behold
repository where in fact you're just a
like some key Value Store some
replicated
eventually consistent key value store in
the cloud so that's the main challenge
it's completely unlike everything I had
been like completely at odds with my
hypothesis when I first wrote Santa
Clara and NP hole has not no hard drive
at all
so the solution is to compile Santa
claria to wasm because you can run one
of them on on cloudflare workers
um and you are storing sudo memory pages
and storage engine so instead of instead
of using uh these sectors I'm using key
keys and values in the in their storage
engine
the main
the main problem now becomes uh the
eventually the eventual consistency I'm
solving that problem by using the
multiple heads I talked about earlier
like the multiple routes so I keep the
older Roots because I know that maybe
like the changes I'm making to uh to my
key value store I haven't propagated to
all of data centers so I keep the old
routes while they haven't propagated so
cloudflare guarantees for example a one
minute propagation time so that's what I
use to keep to keep my older older
branches in order to avoid like stepping
on each other's Foods
so and we don't need a full period like
checking dependencies and maintaining
maintaining a list of fetches is enough
okay so some technical details to as
like almost my conclusion um so this
this service is using typescript for web
Parts
um as well for the UI and then Rustin
wasm for the people parrots um it can be
self-hosted although I've never tested
that yes uh using cloudflare's workers
so they've released their runtime uh in
like as an open source projects it's
open source uh agpl license and it will
be released progressively because
there's a lot a lot of stuff to release
that's currently just experimental in
prototypal and it's starting today so
I've just opened it just before the
beginning of this talk uh so now you
guys can connect to a nest.phool.org and
start black creating an accounts and
like there's no documentation uh things
may crash there's probably lots of bugs
but this will come in the next few days
or weeks
uh okay so as a conclusion uh this is a
new open source version control system
based in proper algorithms rather than
collections of hacks like uh we we've
had uh for some time uh it's scalable to
Mono repos and large files it's
potentially usable by non-coders the
like the craziest like the the farthest
stretch I I've seen uh in discussions in
that project is using it uh as uh as a
tool to help parliaments do their jobs
so parliaments are giants Version
Control Systems operated manually by uh
highly qualified and highly paid lawyers
who are paid to like check the
consistency of The Logical consistency
of the of the law but actually spend a
significant share of their time actually
editing Word documents to apply changes
that have been voted voted by a member
of parliaments so they're doing manual
Version Control and they're wasting lots
of time on that and I've collaborated
with the French parliaments uh which
would have been a good test case because
we're not actually using our Parliament
at the time like the cabinet passes
their bills as they wish
so it's like the test mode of uh of an
API
uh it can be usable by artists by uh
I've talked to lawyers as well
um by maybe Sonic pie composers we had a
really cool discussions last night about
that and maybe why not buy uh Lego
builders wanting to build larger
projects
the hosting service is available since
today I've said that
and another conclusion uh is is a
personal conclusion of mine so I have a
tendency to do work and way too many
things at the same time but uh and and
it never works well until it does like
for example here working on uh
electricity sharing at the same time as
um as a Version Control help me see how
these would fit together and share ideas
across across projects so to conclude I
would like to acknowledge uh some some
of my co-authors and contributors of
Florence Baker for all the uh
discussions Inspirations and early
contributions so tank feeder so that's
the most patient tester I've ever seen
um he's still there after many years uh
passion patiently checking all my bags
so a huge thanks to him Rohan Hart and
uh Chris Bailey uh though Hart and Angus
Finch are actually the two folks using
the Southern Cross cable and they've
contributed like really cool stuff to
people who increase uh Bailey who have
bridged the gap between lawyers and
legal people and and uh what what I'm
doing all right so thanks for your
attention

File addition: 20230628_what_comes_after_git.md (----------)

[0.1]

uh hi everyone first of all thank you so
much for coming
this is a project I've been thinking
about and working on for a couple years
and this is the first time I'm
presenting it at a conference and I'm
excited and nervous and I really
appreciate your interest um if you're
watching this at home thank you so much
for watching this video I watched tons
of these NDC videos myself
um
so before I start
there is if this wouldn't work oh I got
it there we go you've seen this
uh disclaimer on Twitter and whatever
else like I need to kind of double down
on it before I start because yes I work
for GitHub and yes I'm talking about
Source control but this is a side
project this is a personal project I'm
not announcing anything on behalf of
GitHub today at all don't get me in
trouble
and uh
my friend here is watching you so don't
don't get it wrong
uh here's what we're going to talk about
you know a lot of these uh kind of tech
presentations they're either product
presentations where it's about here's
what the product does and then there's
philosophy
presentations and there's code
presentation that gives you a little bit
of all three and talk about why it's
important to be working on Source
control right now
I'm going to talk about of course Grace
and what I've done with it and talk
about the architecture of it do a little
demo and then I'm going to talk there's
a product for it then I'm going to talk
philosophy about why I chose F sharp and
why I love it it's my favorite language
and then we're going to look at some
code at the end and and hopefully it all
makes sense
so why on Earth am I doing a source
control system I mean
git has won right
like what's wrong with me uh like with
all due respect to get to honorable
competitors like git has wiped the floor
with everyone and you know we all use it
by default it handles almost every
Source control scenario like large files
are a problem as you probably know
um I think get one because it's
branching mechanics or really brilliant
compared to the the old the older other
source control systems in the past the
lightweight branches are great I love
the ephemeral working directory and git
of course GitHub is a big part of why
get one and you know Linus Torvalds I've
heard he's famous so that clearly had
part of it and and
you know in in terms of doing a new
source control system I I have to
uh I have to talk about some of the
things I don't like about git
um so I have to say some mean things
about git
on the next slide but before I do
um I just want to really clearly say how
much I respect get and how much I
respect its maintainers I'm privileged
to be at GitHub and to know some of the
some of the core maintainers of git and
they are brilliant Engineers like I you
know one day I hope to grow up and be
half as good as they are they do world
changing engineering they're super
professional there I mean they're just
amazing their blog posts if you ever
want to read the stuff from Taylor Blau
and Derek Stoli on the GitHub blog
they're like mandatory reading they're
great I I really deeply respect git
um
however
git has terrible ux just absolutely
terrible terrible no good horrible yeah
it's designed by a kernel developer
like I don't want my ux designed by a
kernel developer ever
I mean like my my second programming
language when I was 11 was 6502 assembly
like I've done three kinds of assembler
like I get it I love being down to the
metal it's super interesting but
I don't want them designing my ux and
really get was designed for 2005's
Computing conditions which was you know
much lower bandwidth on networks smaller
computers smaller disks smaller networks
and like
we don't have those constraints anymore
we all have if you're here at NDC if
you're watching this video odds are you
have abundant bandwidth abundant disk
space you have connection to the cloud a
lot of us can't even do our jobs without
an internet connection anymore
so you know things have moved on I also
I mean there's tons of research on how
confusing git is but what I Really Wanna
I kind of pound the table on is that we
need to stop blaming users for not
understanding it git is really hard to
understand
there's this really interesting quote
from this research paper from Google
from I don't know 10 years ago or
something
even one of the more experienced git
users requested that someone else
perform an operation because quote it
scares the out of me so like that's
it it's not just me saying this by the
way Mark racinovich is a pretty smart
guy
git is the bane of my development
process
so so being smart is not the ticket out
of understanding it right
um an incantation known only to the git
Wizards who share their spells
um you know who else knows git is hard
GitHub
this is what you see when you go to
download GitHub desktop focus on what
matters instead of fighting with get
so we know
um here's some interesting questions
from quora this is like from a year ago
I was doing this research and some of
these questions are kind of funny like
why is it so hard to learn
what makes you hate git
is get poorly designed
uh
uh kid is awful that's plain enough but
you know things are funny and there's
pain in these questions
two which I really feel but this one
this one broke my heart
if I think it is too hard to learn
does it mean that I don't have the
potential to be a developer
and and I just like imagine some young
person who's just getting started
messing around with some JavaScript or
python or something and they're and
they're thinking wow like I think I
could do this like this seems fun and
then someone hands them git
and and they go what what am I what is
this you know and and like like
we deserve better
like we need to do better and
um and instead of complaining about it
I've decided to devote pretty much all
my free time for the last couple years
to doing something I also just want to
point out that we're in an industry
where everything changes like nothing
lasts forever and git right now is so
dominant that it's hard to imagine
something new coming along and replacing
it but if we don't imagine it we can't
get there
um I just want to say that we're driven
by Trends just as much as we are by good
technology and
um
I've had a little bit of an education on
Trend analysis I'm I'm very lucky my um
my my partner was a fashion designer was
a clothing designer and
um and her job for 15 years was to think
about what women would want to wear a
year and a half two years from now have
really to think about what women would
want to feel like a year and a half two
years from now and then design clothes
for it and she was very successful she's
very good at it and just in like the
conversations over all the years we've
been together I've
I've picked up a few things I picked up
that perspective she sold way better
Earth than I'll ever be
um and you know some of our Trends have
shorter Cycles like web UI Frameworks
right they come and go every six months
or whatever I mean it's stabilized now
but
um and some are longer like you know for
20 years I think when you said the word
database what you meant was relational
database
and now there's key value there's
document there's you know I mean really
with Hadoop we got mapreduce so like
things things do change
um most importantly no product that's
ever gotten to 90 has just like gotten
there and stayed there forever
um
it is currently according to the stack
Overflow developer survey there should
be a new developer survey coming out in
about a month or so
um gets it 93.9 percent and something
with ux that bad is not going to be the
thing that breaks the trend like git is
going to go
so
there will be Source control after git
like there will be
um and what I want to say about it is
that it won't be like git plus plus I've
had this discussion a lot over the last
couple years well can't you put another
layer on top of git can't you maybe use
git as a back end and put a new front
end on it like that's been tried a
number of times over the last many years
and none of them have gotten any market
share so I feel like people once you go
through the um
challenge of learning Git You Don't Want
To re-challenge Yourself by learning
something else just to use the thing you
already know and I want to say adding
features to git won't prevent git from
being replaced it's not about well if
git make some changes it'll extend its
lifespan it just it won't it's too late
for that
um I do think that whatever replaces get
will have to be Cloud native because
oh look it's 2023.
um
and I think the thing that that will
attract users to use the new thing is
that it has to have features that git
doesn't have or git can't have and I've
tried to build some
um okay
let's talk about Grace
um
Source control is easy and helpful
imagine that
that's really what I've tried to do my
north star from the first
evening that I was sitting on my porch
dreaming during pandemic during lockdown
just sitting on my front porch thinking
about it I thought how can I make
something super easy
something that actually my inspiration
believe it or not was the OneDrive sync
client and if you like the OneDrive
syncline used to be problematic a few
years ago it's like really starting
about three four years ago it's been
great it's like Rock Solid it just works
I really like it and you substitute
Dropbox iCloud whatever you want but
like those things just work they're easy
they sync I was thinking about like how
do I get stuff off my machine onto into
the cloud so anyway Grace has features
that make being a developer easier
not just Source control but try things
that try to make your life easier every
day and it feels lightweight and
automatic that's what I'm mostly going
for
it hopefully reduces merge conflicts
which in Grace you'll see here called
promotion conflicts
and yes of course it's Cloud native
thanks to dapper
so let's just talk about the basic usage
real quick I'm going to show you lots of
pictures and demos I'll show you a short
demo so the first thing is Grace watch
and Grace watch you can think of that as
like the thing that runs in the system
tray and windows in the lower right hand
corner on a Mac it runs with that with
that little icon near the clock so just
a background process that watches your
working directory for any changes and
every time you save a file it uploads
that file to the cloud to the repo and
marks it as a save so that you get a
full new version of the repo like in a
new root directory version with a new
computed shawl every single time you see
the file and it's just automatic it just
works in the background I'll show you
some detailed examples of that aside
from that background process the sort of
main commands you're going to see here
Grace save save is that thing that Grace
watch is going to run to upload after
every save on disk
checkpoint and commit so in git commit
is an overloaded concept right commit
means I'm partially done like git commit
minus M I'm done with part one of the pr
I'm done with part two and then you do a
final git commit I'm ready for the pr
and then we have this debate about do we
squash right there's no squashing in
grease you don't need to checkpoint is
the is that intermediate step and that's
just for you
it's for you to mark your own time uh it
doesn't matter you know it doesn't
affect anyone else but you can mark this
version as this this version is that and
because of that we can do some cool
things I'll talk about later
um there is a Grace commit of course
commit is a candidate for merge or what
I call Promotion and eventually when
you're ready to promote you do a Grace
promote I'm going to show you how the
branching Works in a little bit
there's also of course a Grace tag which
is the label if you want to label a
particular version like this is version
three of our production code so those
five things save checkpoint commit
promote and tag or what I call a
reference and a reference in Greece is
something that just points to you a
specific root version of the repo and
those root versions just come and come
and go of course there's other commands
you need a status Grace switch to switch
branches Grace refs to list the
references that you've been working on
all your saves and checkpoints and
commits you can list just your
checkpoints and gesture commands and
whatever there's of course a diff
there's a rebase there's going to be a
gray share command gray share is kind of
interesting because it'll I haven't
written it yet some of this stuff I'm
talking about I haven't written yet this
is an early Alpha just to be clear I
should have said that it's an early
Alpha
it it kind of works um
um but gray shares this idea where like
I'm working on my code I have a problem
and I might want to you know get in chat
with a team member and go hey could you
look at this code for me and you said
gray share and it'll spit out a command
line that you can just copy and paste
and give to that person and because with
Grace everything's automatically
uploaded and the status of your repo is
always saved you don't ever need to
stash
and you can just literally take that
command paste it hit enter and now your
version of the repo is exactly the one
that your teammate was working on and
when you're ready you just do great
switch back to your own branch
so that's the kind of workflow I'm
trying to enable
um
a quick like General overview so locally
with Grace Grace watch I said you know
is this background is file system
launcher that watches your directory
switching branches is really fast and
lightweight I love that about git I kept
it
um of course like I have a doc Grace
Grace directory like a DOT get directory
and I have objects in it and config and
stuff so that's the local side your
working directory is still totally
ephemeral
um gray server is actually just a web
API there's nothing fancy there's no
special tightly packed binary protocols
it's just a web API you can code against
it too I use Dapper if you're familiar
with Dapper to enable it to run on any
cloud or any infrastructure
um and because if you're everyone's
running Grace watch and you don't have
to run gracewatch but you really want to
um the server kind of has a real-time
view of what everyone's up to and that
enables some really interesting features
to be built
um
there is because everything that happens
in Grace is an event
that gets saved
um
it immediately gives you an Eventing
model that you can automate from and I
just want to say the system is written
almost entirely in F sharp and I'll talk
about that of course
now some of you might be thinking wait
just a damn minute
you're uploading every save that seems
like a lot and the answer is yeah but we
don't keep them for very long unless you
specifically checkpoint or commit or
promote your saves after
like by default I'm starting with seven
days I don't know if that's the right
number maybe it's three maybe it's 30
whatever it is we'll figure it out and
it's going to be settable in the repo
but I didn't like to say seven days we
just delete the saves
so they're there for you to look back on
when you want to and then they just go
away commits and promotions are kept
forever and that sort of like gets me to
thinking what features can we build well
one thing I wanted to build was if
you've used like mac backup the time
machine and that that interface that was
sort of an inspiration I had in my head
I'm not going to quite do design it like
that but this idea where you can just go
back and look at the the diffs of the
changes that you've made and hopefully
do some cool stuff like reload stayed in
your mind faster you know that is that
they're often quoted number of it takes
18 minutes to reload state in your head
after an interruption well like what if
we could use Source control to get that
down to two minutes where you could just
quickly just Leaf through the one leaf
through them and go oh I see I see I see
what I was doing
um
you can also do cool stuff like if the
server can see what everyone's doing and
it sees that Alice is working in a
particular file and Bob is working in
the same file right now well that's a
possible conflict
so we can notify both of them in real
time and go hey your teammate is working
in the same file maybe you want to
here's here's a link to look at the
version that your teammates changing
maybe it's in a totally different part
of the file you don't care about maybe
it's in the same part and now you get on
chat and talk about it before before you
get to a conflict when you think you're
done with your work like I hate merge
conflicts I hate them so much right I'm
sure I'm alone in that
um another thing we can do is when there
is a promotion domain
I can Auto rebase everyone in the repo
immediately on that new version
so and you'll see that that's super
important in the way Grace's branching
is designed but it happens within
seconds
um so what is the branching model I call
it single step branching and the idea is
that like git I mean there's these
versions of the repo and they have a
particular Shaw value and and all of the
you know commits and labels and get just
point to a particular version well
that's what I do with Grace a promotion
is just a new reference a database
record in other words it's a new
database record that points to the same
root directory that your commit pointed
to so I do Grace commit minus M I'm
ready to promote then I do Grace promote
to Main
and all all that I do when when you do
Grace promote all it does is it creates
a database record that's all it does
because the files are already uploaded
you already did it come in
a child Branch cannot promote to the
parent branch unless it's based on the
latest promotion
so that prevents a lot of conflict
so let me just show you a little bit
about what that looks like
here is a diagram that I have oops no
not that
this
here's my repo by the way I want Stars I
want lots of stars
I want all the stars if you're watching
this home I want all the stars
star of the repo
it's the only way I can let people know
so here's a here's a little document I
have on my branching strategy
um I'm going to show you four quick
pictures just to walk you through really
quick here's Alice and Bob when they're
not busy if you're into physics you know
you see Alice and Bob a lot when they're
not busy holding two entangled particles
that improbably large distances they're
working on a repo together
and here's uh here's like they're
starting it's in good shape Maine is
this particular Shaw value right it's
this EDF 3F whatever and ed3f and Alice
and Bob are both based on that version
everything's cool now their Shaw values
are different because they're they're
updating different files Alice is
updating her files Bob is updating his
files but they're both doing their thing
okay now Alice makes one more change
runs some tests and goes cool I'm ready
to promote so she types Grace commit
minus M I'm done with my PR and then she
types Grace promote minus M I'm
promoting and that now of course her
files are already uploaded because she
did the commit that creates a new
database record on the main branch that
points to that particular version of the
repo
um
so now after she does that let's assume
that it's successful
we now are in this state where Alice is
based on the version that she just
promoted because in fact in fact she's
identical to the version she just
promoted because she just promoted it
but Bob is still based on the previous
version of Maine
so that means that for the moment Bob
can't promote
poor Bob
however
heroically craze watch comes to the
rescue
um that's so hacky I'm sorry
um
but Grace watch gets a notification that
a promotion has happened and immediately
Auto rebases Bob within seconds
so if as long as there are no conflicts
if you know the files that are changed
that have changed aren't the ones that
he's changed within seconds those new
versions get downloaded Bob Bob's branch
is marked as now being rebased on the
latest version in Maine and a new save
is created because now on Bob's Branch
he has the files that he was changing
plus whatever ones just came down from
the promotion so now after that's done
Bob has a new shot value
which indicates the file T's change plus
the ones that were in the promotion
Alice at this point of course has the
same exact Shaw value as mean because
literally she just committed and
promoted so that's branching and Grace
it's really simple
it's as simple as I could possibly make
it mostly because like
I'm not that bright I mean why make it
hard on myself right
um uh but really it's simple to
understand
so that is a little bit about branching
let's go back to the deck
so let's talk about I'm going to show
you some some pictures of how how this
works and from the server's point of
view so in the server let's just say
that I I'm working on my branch and I
these dots are directories in a
structure right in a repo and let's say
I save a file on this uh
thing over here yeah so let's say I save
a file in this
bottom one
um Grace watch sees it it uploads that
version of the file to object storage
and it creates two new it has to compute
new Shaw values for those two
directories
so I create what I call directory
version
I upload those to the server the server
double checks them and now we have a
save reference that's pointing to this
brand new version of the repo that
doesn't exist in anyone else's branch
cool at the same time my teammate Mia is
working and she saves a file in that
directory that bottom directory and that
file gets uploaded those three now
versions three directory versions are
computed all the way up to the root they
got uploaded to the server and double
checked now she runs some tests and goes
cool I like this version I'm going to
checkpoint it so at this time she does
checkpoint all it does is it creates a
database record because there's nothing
to upload it's already uploaded so that
takes about a second I've really tried
to make a lot of the gestures despite
the fact that there is always a network
hop involved in Grace I've tried to make
them as fast as I possibly could and
I've aimed for that sort of one second
timing there are things that get because
when git does local stuff there's things
that get is just going to be super fast
at that I kind of can't compete at if
there's a network hop but there's things
that I can be faster than get that
involve the network so anyway so here's
me again I save a file down there this
time it's four directory versions to get
recomputed did and get uploaded cool now
uh teammate Lorenzo saves the file in
that directory
file gets uploaded there's three
directory versions and he likes it he's
going to commit it and is thinking about
promoting it
meanwhile Mia is working she saves that
file in this bottom orange directory she
likes it she commits it she runs tests
everything's cool and she actually does
it for Grace promotes if she does Grace
commit now she does Grace promote now
there's a reference pointing to that
same exact root jaw now we have three
references pointing at the same exact
directory version
one of them is on Main two of them are
on Mia's branch and in fact we like it
so much we're gonna put a label on it
and say this is our new production
version 4.2
cool so what do we have from that we
have five different versions of the repo
and 10 references pointing at those five
versions
cool okay seven days later right I said
we keep saves for seven days or so and
then we get rid of them well let's see
how that works
so here's my save that I did these are
the same five five things that we just
saw
well let's say it gets deleted
okay well now the root directory says do
I don't have anything pointing at me
anymore it checks and it starts a
recursive process to go down and check
if there's anything that was unique to
that version that doesn't appear
anywhere else
and deletes it so that version is now
gone now remember Mia did a save and a
checkpoint well we delete the save
reference but the checkpoint's still
there so this version of the repo stays
cool here's me again I did that save I
didn't do anything else with it
save gets deleted that version goes away
uh here's Lorenzo again save goes away
but the commit's still there so this
version of the repo is still there
now we have this one
again the save gets deleted but we still
have three references pointing at that
version of the directory but like let's
imagine that that the branch gets
deleted Mia's Branch gets deleted no
problem we still have references
pointing to that from the main branch
so
now after seven days
we now have three versions the repo and
five references pointing at it and and
really like the thing is saves are
really going to be like 25 to one right
I don't know I'm making up a number it
depends on your workflow but you're
gonna have like 25 to 1 saves or 50 to 1
saves and they'll come and they'll go
and it's not a big deal
so
congratulations you now understand Grace
like really that's as simple as it is
um obviously there's a tiny bit more but
like fundamentally
like in 10 minutes I just explained
grace to you and if you recall your
experience learning get
I'm gonna guess it took more than 10
minutes you know it's one of the
questions I like to ask people is like
would you rather teach somebody
get or would you rather teach them what
a monad is
and you got to think about that
I'd rather teach them what a monad is to
be honest
um
okay
demo gods
I'm going to do a very a very short
light demo
just to show you what a little bit of
what it feels like
so
um I'm just going to show you in for one
user um here is uh
here is my friend Alexi
I just pick uh I'm a New York Rangers
Fan I pick players from the Rangers to
for names so here is um here's Alexi
with the file now on the bottom here I'm
running gracewatch in fact I'm just
gonna oops
and here's Grace watch now right now
Grace watch is a command line thing with
a lot of debugs for you early Alpha
but here it is running
and
here is right now I'm going to run um
gray status
and here's the status of Alexi's
um
of Alexis uh oh it's funny now that the
font got got changed I'm going to switch
to
doing this give me just a moment
oops okay
so we can see the full screen
um
so here's the status of the Lexi's
Branch right now so
um obviously I was doing a little bit of
testing earlier today there was a save a
checkpoint see I checkpointed the same
save so they have the same Shaw value it
is based on this particular promotion
that Alexi in fact did and here was the
message from that in Maine here's the
main right the parent branch that I'm
based on so you can see the status of
your branch and the status of the branch
that you're based on
um
back to Grace watch here now here's
Grace watch I'm just going to upload I'm
going to update uh
I don't know I'm going to update this
file and I'm going to add a comment
somewhere
uh
cool new version of the file
I'm going to hit save
there you go files uploaded new
directory version computed
server has all the information This
Server by the way
um
I'm calling it a server now that was a
that was strangely slow I guess because
I haven't done it let's do another one
just to see in fact let's update this
comment thank you GitHub thank you
Copilot
I'm going to hit save again
and there you go files uploaded new
directory that's how fast it should be
this server I'm talking to by the way is
my desktop computer in Seattle
so um so it's 4 500 miles from here to 7
300 kilometers and you know it's pretty
quick it's not bad so so that feeling of
how quick it is in the background like
that's what I'm trying to do with Grace
it's like automatic so so like I'm
showing you what Grace watch is doing
but the truth is
you're just in here doing stuff and
you're just hitting save and Magic
stuff's happening in the background so
that version that we just did let's say
I like it and I want to
um I like it and I want to checkpoint it
so by the way let's do great status
again you'll notice that previously the
save was the ca21 version now I hit gray
status I've done a couple of saves and
now I have this new cd79 version in fact
if I do Grace refs
I can see
all the references on my branch
and again like 1.7 seconds everything's
debug mode and we're a few miles from
the server but like
it's fast that I'm aiming for everything
to be fast enough to keep you in flow so
um so there's all the things that's cool
you can see everything you've been doing
I'm going to do a Grace commit I like
this version I did some tests
um
NDC
[Music]
uh
oh slow yay
right cool
there's my commit now the commit again
everything's uploaded so the
commissioner screening database record
so it's pretty quick I like it I'm going
to do a promote
and this time I'm going to do for motion
4 from Alexi
again all I'm doing is creating a
database record really I'm creating a
database record and then I'm rebasing My
Own Branch because
well I am and now if I do gray status
what we'll see is
I'm my the save I did was that cd79 the
commit I did was that same version the
checkpoint was something I did four
hours ago whatever
I'm the main is now that same version
I'm based on it
and like this flow is what I'm aiming
for it's just this quick like most of it
happens in the background most of it's
automatic when you take an action I'm
aiming for that one to two second
window for all the commands
um that's the quick demo that's really
what it feels like and like I said you
already understand it I hope right cool
um
well I'm sorry I thank you I wasn't I
wasn't really begging for Applause but
thank you so much I really do appreciate
after two years thank you so much
um
I'll totally take it
[Laughter]
oh my all right
let's keep going
but wait there's so much more I wish I
had time to tell you about I'm going to
go through some of this really quick I'm
using signalr for two-way communication
with Grace watch so I have that live
two-way communication from the server to
everyone I mean right now you know I'm
showing you that example of Allison Bob
what if it's Alice and 20 other
contributors in the repo and those
contributors and what if it's 100 repo
people and what if they're all over the
world
well now I can do cool stuff like all
over the world I can Auto rebase people
in fact you can sort of like like I
don't even know all the features that I
want to build on this Eventing model
that I have I can't I'm not the one
who's going to think of them all but I
know that I have the architecture to do
it and I know that we can do some really
cool things in real time especially now
I mean I know that there's like people
who are going back to the office and
whatever they're still a we're still
going to be partially to mostly remote
and I feel like there's a there are some
interesting ways we can connect with our
teammates at the source control level
that have never been done before and
again can't be done in a distributed
Version Control System
like like I don't know what's happening
in your distribution I don't know what's
happening in your version to get until
you do a push but here I always know
what you're doing
um
uh with Grace in an open source capacity
there's no you don't have to Fork
anything I don't need to Fork the entire
repo you just create a personal Branch
you own the branch it could be like
imagine you know asp.net core
and all you do is walk off to it and
create a branch you own the branch they
don't have any control over it you do
whatever you want um you could keep that
you could keep your branch private or
public up to you visible or not you can
do work in it and do a PR off of it
um which is not a pull request in Grace
it's a promotion request
um
there there is no poll gesture in Grace
there's no push gesture and Grace
um
um
uh there I have tested Grace so far on
repositories as large as a hundred
thousand files and fifteen thousand
directories as long as you're running
Grace watch you still get that one to
two second
timing like it just works if you're not
running gracewatch then there's a lot of
other checks that I have to do when you
do a checkpoint or a commit or whatever
I have to if you if you're not running
gracewatch then I have to check that all
the things are uploaded and if I'm
walking through fifteen thousand
directories to do it it'll take a few
seconds with Grace watches I know that
it's up to date I've also tested it on
10 gigabyte files
um because like I'm back because Grace
is backed by object storage like in my
testing case I'm I'm an Azure and.net
guy so I've been running it on Azure
object on Azure storage
it just works 10 I mean I don't think a
10 gig file should be in Source control
to be clear but I've tested it it
totally works
um uh I want to build I have not built
apples down to the file level but or I
have an idea for how to how I'm going to
do it I want to do it I think it's super
important git can't do that customize
Branch permission so that's when we're
like let's say on Main I want to enable
promotions and tags but I want to
disable
saves and checkpoints and commits on me
and you should never do a checkpoint or
commit on Main okay cool that's already
there
um like on your branch maybe you want to
never enable it promote because why
would you promote on your own Branch
um you can delete version so obviously
like I'm deleting saves so we have the
Mechanics for deleting versions
unfortunately we do occasionally all
check in a secret and like I can speak
for GitHub GitHub has worked extensively
to create secret scanning so that when
you do a push we see that oh you're
checking in a connection string or
whatever still people are going to do
that and deleting that out of git is a
nightmare
um
uh in Greece I'll have a gesture I don't
have it yet but I'll have a gesture that
keeps the audit trail that will say okay
you promoted something with a secret we
want to get we have to get rid of that
version but here's the thing that says
this version was this Shaw value it was
deleted by this person at this time for
exactly this reason so there's permanent
auditability I've really had to think
about
Enterprise features of course like
Enterprises are going to be a big user
of any Version Control System I'm aiming
for that for sure
file locking is another one so if per
force one of the main reasons users
still use perforce is
scheming like gaming is a big where
there's big binary files I had a
conversation with somebody who does
Source control at Xbox
and he gave the example of let's say
that somebody's working on Forza and
what they're doing right now is they're
editing a track file like the graphical
definition of a track and their job
their task is to add 10 trees at around
turn three on some track well
like they have to lock that file because
there's no way to merge a binary file if
someone else edits that file they've
just thrown away hours of work so we
have to have optional file locking I
don't want like by default it won't be
on but I but I want to provide it in
Grace as I said they're just oops as I
said there's a simple web API
um and of course yes it's super fast
and like yes it's it's consistently fast
my my goal for performance I'm a big
believer in
um in perceived performance and
perceived performance for me consisted
of two things number one is it fast and
number two is it consistent so like your
muscle memory becomes if I type Grace
checkpoint and hit enter
I know that that command I know that my
command line is going to come back in a
second a second and a half every single
time that's what I'm trying that's
perceived performance to me that's what
I'm going for
um I very much want to build all the
interfaces of course I'm building a CLI
um I want to build a I will build a
Blazer web API but I'm going to go on
record and say
I hate electron
I hate electrons so much
I hate webview
we all have these first class pieces of
hardware
and we're running these second-rate apps
on them I just it drives me nuts I
believe in Native apps I'm going to try
doing with avalonia I haven't started
that yet I hope avalonia will work well
it's I expect it to if not I'll do Donna
Maui but I will have native apps to
include code browsing to include some of
these gestures to include that time
machine like history view if we have
time at the end I'll show you a little
sketch of that I did
what are we doing 20 minutes oh my God I
have to hurry up
of course I'm borrowing stuff from git
again like I respect get enormously
um
I'm trying I'm really trying
let's talk the architecture really quick
in case it's not obvious Grace is the
centralized Version Control System
um decentralized Version Control to me
is why we have a hard time understanding
it doing being centralized simplifies
everything there's been lots of attempts
at making a simple git or a simple
distributed one by the way there's a
couple of great other projects going on
right now like if you're interested in
Source control you should really check
out
um p huel p i j u l is a really
interesting distributed version control
system based on category theoretic
mathematical it's patch based it's in
fact actually a couple days ago the
stack Overflow podcast did an interview
with the creator of it it's a really
nice project another one that I really
like is JJ
um and that is comes from Google Martin
Von Z slash JJ on GitHub
um that's a uh I actually know that team
I speak to them a little bit they're
great people they're trying to do
something really interesting and move
Google's internal source control forward
um really like it more friendly
competition I'm I'm in this to win
um
but I just want to point out like if
you'd like distributed
um you're not doing distributed today
anyway like you're you're using GitHub
or gitlab or atlassian or whoever you're
doing Hub and spoke you're doing
centralized Version Control like you
never push from your machine to
production
you always push to the center and then
the center runs some CI CD stuff and
that's how it gets to promote to
production like you're doing centralized
Version Control you're just using a
distributed Version Control System to do
a dance around it
why
and git'll be around for 15 or 20 years
if you still want to use it as I
mentioned you know Grace is all about
events event sourced fundamentally
um
I really like cqrs event sourcing goes
really well with functional code in a
reactive perspective on code
and I do a lot of projections off those
I'll show you a little thing about that
in a second I use the actor pattern
actors are based on or built into Dapper
I really like the actor pattern it makes
it super easy to reason about everything
that's happening actors that each
individual actor is single threaded but
you can have thousands of them running
and they scale up
to Infinity
I like I use them for event processing I
use them for an in-member networked in
memory cache they're really great and
you can have multiple State
representations for them that last for
however long you want
um now Grace is cloud native thanks to
Dapper I'm going to show you a picture
of dapper Dapper we're currently uh
supports 110 different platforms as
service products and growing all the
time the community's growing usage
adapter is growing I'm really happy
about the bet I made on Dapper like in
down in like 0.9 or something I started
using dapper at Azure AWS gcp
on-premises open source and containers
whatever you want to use under it and
here's a little picture of it it does
service mesh it does State Management
Pub sub it has the actors it has
monitoring and observability pieces you
can talk to it from any language there's
sdks of course I'm using.net it
communicates with any infrastructure
here's a little example if you were
going to distribute Grace
using Open Source Products well here's
my grace containers gray server with the
Dapper sidecar and they're running in
some kubernetes cluster wherever and
let's say for object storage I'm using
Min i o
something compatible with an i o I'm
using Cassandra as a document database I
might be using Radisson Kafka for Pub
sub I'd be using open Telemetry with
Zipkin for for monitoring and hashicor
fault for secrets cool that's great
that's my open source one now if I'm
deploying Grace to azure
same exact race container same code same
everything but now the at runtime I
configure copper differently I put
Cosmos DB under it I put service bus and
event hubs for Pub sub I put Azure
Monitor and Azure key Vault and again
like fill out your picture because I'm
an Azure guy I did this fill out your
AWS picture your gcp picture whatever
you want but that's a Dapper it works
very well I'm happy with it
um
projections
imagine that I have all these update
streaming in saves checkpoints commits
whatever and they hit the server as soon
as they hit the server what I do is I'm
going to trade some CPU and some disk
space for your performance and your ux
so I'm going to do stuff like if when
you do commit I might do a test
promotion see can I promote if I find a
problem with the promotion I might let
you know in fact what I can do today
thanks to large language models is they
might detect a promotion conflict and
actually send that code off to a
language model to give me a suggestion
to give back to you right away so I
don't just tell you that there's a
problem I tell you and give you a
suggestion
as to how you might think about change
dealing with it I might do temp branches
I might generate diffs right away so
that when you ask for the diff they just
show up you know under a second uh cic
that generates if I can do whatever I
can do all these projections off the
data I have and just keep the events
well again seven days later
until the end of time
right I'm keeping check I'm keeping
commits and promotions checkpoints by
the way like you don't need to keep
checkpoints forever maybe you keep them
for six months keep them for a year I
don't think you need to go look back on
your checkpoints from three years ago
whatever okay
that's Grace that's the architecture of
Grace
um I want to talk a little philosophy
and
Talk programming language I'm going to
talk about why F sharp is not that
radical but like
program language set on the Spectrum
from from imperative to declarative
right
um there's like the hardware there's
like tell the hardware exactly what to
do step by step and you know it's like I
like Assembly Language it's really
interesting
um but then there's the mathematical
category theoretic and more declarative
and here's like just a smattering of
languages and roughly where they fit on
this line you might agree or disagree
with where one of whatever just close
your eyes at it but there's this idea
that um
programming languages can be translated
in either direction
in fact we do that it's called compiling
when you move from the right side of
this to the toward the left side that's
called compiling
um when you move the other way it's
called decompiling but all I can say is
like from age 11 to age 45 I spend my
entire life on the left side of this
um and I finally got curious about the
mathematical side
so why F sharp well you know C sharp
compiles to Il
so does F sharp it's the same dot net
run time it's the same everything that
you love about c-sharp it just happens
to be there and then it gets compiled to
assembly and it runs just as fast so so
I'm not like an ivory Tower PhD I'm not
I'm really not that's not the statement
I'm trying to make using F sharp I'm
trying to say that it makes my code
better and and all I'm doing is I'm like
here and I'm just
doing that
really and like being over here just
gives me access to some cool stuff I
still have all the stuff over here I'm
not going over to Haskell you know
um which is cool if you do but that's
not what that's not what I'm doing
um so why F sharp
um you know Don Simon describes it as
succinct robust in performance 10 times
the creator of F sharp and Don it is
very fast on that's like unbelievably
fast that's why
um it's functional first but objects are
part of the language and especially if
you're integral interfacing with nuget
everything in nuget C sharp it's all
based on classes and methods and and
inheriting stuff and well fsharp has all
that it's got a great Community any F
sharp people here
yay
um
um and I just want like here's my
philosophical statement I think the
industry has just hit a ceiling on
quality and comprehensibility with
object oriented-ish
code and I mean like you know C sharp
Java but Ruby you know typescript like
there's a point like for small code base
it's fine medium code base it's fine
when you get to a large code base with
object oriented if you don't make a
serious investment in keeping that code
clean
you're going to run into major problems
and functional code lasts longer it
really does it stays clean longer
um
I mean assuming you don't do stupid
things
um uh I want to very quickly about this
I'm going to tell you my story but I
wrote three versions of Grace like like
for my learning curve the first version
I wrote as I was like getting
deeper into F sharp and functional
thinking first I wrote a version of
Grace where I ended up accidentally
writing writing C sharp and F sharp
because I hadn't made the mental leap to
functional thinking yet into
composability and so um
um so I did that and then I was like
this is not this doesn't smell good like
this is not what I was hoping to get out
of it so I threw that version Away
um and I and then I was like all right
I'm gonna get into category theorem I'm
gonna relearn my category Theory and I'm
going to moan at all the things
and that didn't work
um that made some really awkward code
um but I kept I learned stuff from it
through that version away and now the
version you're seeing is actually the
third version which is which is stock
which is it's more balanced it's more
practical and I use objects I use
classes but I use them very functionally
I think about them functionally so why F
sharp
this is why and like like my field
report as someone who spent his whole
life on the object oriented you know
assembly side
my field report is thinking functionally
gives you better safer more beautiful
code
like it really really does it is worth
the journey
to learn and it's going to be painful
it's going to take a little bit but but
it's so worth doing and it's made like
I'm very happy with the code are there
challenges with F sharp sure like
serializing deserializing records is
painful
um if you add a field to the to a record
now you can't deserialize anything you
ever saved before
um
so I I'm thinking
I still have this I'm probably going to
switch them all to objects in other
words classes but I'm going to treat
them functionally I'm gonna I'm gonna
I'm not gonna like
I'm not going to object Orient them I'm
just going to use objects and treat them
as immutable there's an interesting one
that I would never have imagined two
years ago kodaks is the open API model
that's underneath GitHub Copilot
um
codex is trained on c-sharp it's not
trained on F sharp turns out that the
suggestions I get out of copilot aren't
that great
on F sharp so I use fuse shot prompting
you know I actually go to chat EPT and
use few shop prompting to to get better
things
um in F sharp let's be honest there's
fewer samples there's fewer there's less
example code so what's the solution
work harder peasant
um but it's not that bad you know taking
once you learn it doesn't take it
doesn't take much
um so
um I want to close with a little bit of
code
and what I'm going to do is I'm just
going to pull up sharp lab really quick
and walk you through a little bit of
I have to
do that and then
I can go to Sharp lab
so um
so if you ever use sharp lab it's a
really cool site it lets you just type
code it compiles it in memory and then
shows you the decompilation on the other
side so I have F sharp code on one side
and C sharp code on the other
um in fact oh whatever I'm just gonna do
this so so here's here's some F sharp
code here's some very basic F-sharp code
now if you've never seen F sharp before
it's going to feel a little weird
because like wait a minute where's the
class you have these let statements what
are what are they just floating in space
like what are they hanging off of it
feels weird and what I want to show you
is like what that compiles to you well
so there's so a module so there's a
namespace there's a module and there's
these lets which are you know
definitions of fields or or functions or
whatever what does that compile to in C
sharp well here's my name space
cool
that module is a static class
static classes are awesome for
composition I'm going to talk about that
in a bit
the end value well that becomes just a
property with a get
again like net is fundamentally object
oriented the runtime knows classes and
properties and methods like that's what
it knows so when F-sharp compiles to Il
it has to compile to that so F-sharp
translates again we translate from the
right side of that diagram to the left
side of that diagram
so it translates it into properties and
this function well it's just a static
function
cool
let's add something to that let's add a
regular class we've seen this is like
class from from C sharp it's got two
properties it's got a great time and a
name
no big deal this is how you say
this is how you say
class public class and F sharp this is
how you declare property and if you look
at them well they compile two here's
your backing Fields here's your property
with a getter and a Setter here's your
property you've seen this in C sharp
it's no big deal
let's add a little more to this let's
add a method in that class and this
method
um
oh what is it what is it compile to well
here it compiles to a void so it's a
public this is a method you might have
on a class in your code base where it
returns a void and it just takes in a
value and changes something in the on
the object very common use for
object-oriented code
lovely but not composable like that's
the problem this is what I kind of want
to highlight
let's add
a couple of functions to this now we're
out of the class we're back in that
module which means that we're in static
we're in static class and I'm going to
add a functional change name
I wish I could move this but I can I'll
just move this over a little I'm going
to add two two classes two functions so
there's a change name and a change time
and you can see what they do they just
sort of assign this you pass in a name
and a time and you take in an instance
of that class
and you return an instance of the class
so that Pro that pattern where I take in
the thing that I want to Output I take
in a type and I'm going to Output that
type and I'm going to do something to
that type that is where you start to get
composable code it's a monad
don't tell anyone
um
so what does that compile to in C sharp
terms well guess what it compiles to a
static function on a static class static
method on a static class
and that's how you get composable code
my number one tip for for anyone is if
you want to use C sharp or Java or
whatever and create more composable
functional code start thinking stop
thinking in terms of methods on classes
and start thinking of creating static
classes with static methods that you use
to manipulate your code
and that will get you so far down the
road of being composable and testable
like these functions are super testable
they're pure functions if you give me
the same class and the same name every
time I will give you the same exact
result over and over and over these are
pure functions so what what now that I
have these pure functions in
in functional code what can I do well I
can compose them
right I have this new function called
updated my class and it updated my class
I take in an instance of class and I've
give it Scott and then to the current
time and look
composition
um
and I know this might look weird if
you've never seen F sharp or some of the
functional languages before but this
style of coding gets you much more
repeatable testable beautiful
easy to understand code and I've really
tried hard not just to make the ux of
Grace beautiful but I've tried as hard
as I can to make the code beautiful I
want maintainers to like it too
I have an entirely succeeded but I've
mostly succeeded I'm mostly happy with
it
um I have a few minutes left I want to
show you a little bit of Grace's code
itself
let's see what I want to show Let's do
let's do a little bit of validation so
this is kind of interesting
um so this is the CLI part so I want to
validate stuff that's on the CLI before
I even send it to the server you can't
rely on that validation the server's got
to check also but I just want to like
not let you send stupid stuff to the
server so here's a bunch of things like
when you have an owner ID and an owner
name and by the way this if you look at
the status here I have an owner an organ
a repository so Grace is multi-tenant
from the start and that's because I'm
from GitHub and I've seen some of the
pain of trying to retrofit
multi-tenancy
so I just built it in from the start of
some light multi-tenancy so here's
here's for here's part of my code from
when I was in step two of moan adding
all the things
I moan added all the things here I have
these functions and these functions
which are validation functions take in
this thing called a partial result which
is the which is the parameters on the
command line and those commit those
parameters broken into a structure that
understands what they are and what does
it return it returns a result type a
result is a successor or failure type
and for when it succeeds it Returns the
same exact two parameters I passed in
which means again just like I just
showed it's composable
um and when it fails it returns this
thing called a Grace error which is
which is a special error type I have now
all of these functions do the same thing
they all take in a parse result into
parameters and they return a result of a
parse result in parameters or an error
so what can I do with that
oh my God I can do bind
monadic bind magic but like I know like
this is new and if you've never seen it
before it looks a little weird but just
isn't it beautiful
just like looking at that code
it's kind of nice it's just like all
these things flow and I gave these
really nice verbose names to my
validations just like you would for a
test case
and like if they fail at any point the
whole thing kicks out the error of the
failed and stops processing that's how
that's monadic bind
it's really nice now four reasons that
are that take a few minutes to explain
and I won't I couldn't quite do it this
way on the server side but I want to
show you the same stuff I did on the
server side that's inspired by
what I did here
here is uh let's see which function is
this here's Commit This is this is Grace
commit on the server
and I want to validate all the things I
want to validate the owner ID and owner
name you might pass in I want to
validate the org ID the repository ID
the branch ID these are actually
parameters that get passed in by the
command line in the background that you
didn't see but like it does it for you I
want to check uh or you've given me
validines or does the organization and
owner and repository in the branch do
they even exist
I want to check all these things so what
do I have here I have a list of
validations and if you actually look at
all of my endpoint these are web
endpoints
this is like the equivalent of
controllers and actions this is a web
endpoint
and I just have this pattern over and
over again where I have these
validations and what do these
validations do they take in
um
what is validation validation is the
thing that takes in parameters and
returns a list of results really a list
of task of results but and because of
that I can start doing so so like I
couldn't do the inputs the same as the
output here for reasons but the outputs
the same on every single one of these
validations which then lets me which
gets me back into functional land where
I can apply functional code to say just
spin through these and if there's a
problem stop
so like I just like I I've really I'm
really proud of this one I really like
this code
um
anyway
um there's so much more I could show you
um I want to like here's here's some
cool stuff with types like I know C
sharp is finally getting this but this
idea of domain nouns or I can rename
it's not just a good or a string it's a
branch ID and a branch name and a
container name and a reference ID and
all that kind of stuff so like when I'm
looking at the code it's very clear to
me what's going on
um
there's so much more I'd love to show
you so but time to wrap up
as I said number one tip for c-sharp
programmers static classes and static
methods like just do it go home pick any
pick any uh a class from any code base
you want just for fun
and try to rewrite it where all the
methods that mod that modify State all
the methods that mutate State pull them
out write them as data classes with
static methods and you'll love what you
see
so
this is my like I believe Elegance
matters like correctness conciseness
maintainability and code matters more
than getting everything else that kind
of performance
it really does and I get most of the
millsockets don't get I don't leave many
on the floor but
um anyway thank you so much again thank
you so much for coming thank you so much
for watching this video
um I'd love your feedback I want the
stars star the repo if you have not
started the repo I want to do it
no excuses
um I'm in this to win this is real this
is not an experiment like I'm
git will be replaced like I'm in this
a hundred percent like I want to win
this
um and you know programming is art and
craft don't ever forget that it's all
about art and craft and thank you thank
you thank you
[Applause]
[Music]

File addition: 20230421_growing_programming_communities.md (----------)

[0.1]

[00:00.000 --> 00:06.040]  You're listening to Software Unscripted, I'm your host Richard Feldman.
[00:06.040 --> 00:10.060]  On today's episode, I'm talking with Ryan Haskell-Glatz, one of my co-workers at Vendor
[00:10.060 --> 00:14.400]  and author of the open source Elm projects, Elm SPA and Elm Land.
[00:14.400 --> 00:18.580]  We get into things like new user onboarding experiences, framework churn, and dynamics
[00:18.580 --> 00:21.600]  between authors and users in open source communities.
[00:21.600 --> 00:23.800]  And now, growing programming communities.
[00:23.800 --> 00:27.040]  Ryan, thanks so much for joining me.
[00:27.040 --> 00:28.160]  Thanks for having me.
[00:28.160 --> 00:31.560]  So a question I've always been interested to ask people who get involved in sort of
[00:31.560 --> 00:35.040]  like community building, and you and I have both been involved in, I would say like different
[00:35.040 --> 00:38.980]  eras of like Elm's community building, like me kind of more at the like very beginning
[00:38.980 --> 00:42.280]  and you kind of like sort of like the current like leading the charge, I would say.
[00:42.280 --> 00:46.560]  So one of the tools that you've built that I would say is like pretty important for community
[00:46.560 --> 00:48.960]  building is this Elm Land tool.
[00:48.960 --> 00:53.400]  And I'm curious, how did you arrive at like, this is the thing that I want to build to
[00:53.400 --> 00:55.720]  solve a problem that you see in the world?
[00:55.720 --> 00:56.720]  Totally.
[00:56.720 --> 00:57.720]  Yeah.
[00:57.720 --> 00:58.720]  It's exciting.
[00:58.720 --> 01:02.000]  It's exciting that it's like, oh, yeah, this is a cool, cool essential tool.
[01:02.000 --> 01:05.680]  Yeah, I so I don't know.
[01:05.680 --> 01:07.320]  This might be a crazy tangent.
[01:07.320 --> 01:08.480]  We'll see how it goes.
[01:08.480 --> 01:13.440]  But when I first heard of Elm, it was it was through the let's be mainstream.
[01:13.440 --> 01:14.440]  This is Evans.
[01:14.440 --> 01:15.440]  What was it?
[01:15.440 --> 01:16.440]  Curry on Prague.
[01:16.440 --> 01:17.440]  I want to say.
[01:17.440 --> 01:18.440]  Exactly.
[01:18.440 --> 01:19.440]  I think it was in 2015.
[01:19.440 --> 01:20.440]  He gave the talk.
[01:20.440 --> 01:21.440]  And I saw it.
[01:21.440 --> 01:24.200]  I think it was like September of 2016.
[01:24.200 --> 01:25.600]  I was web developer.
[01:25.600 --> 01:27.160]  I was doing a lot of Vue.js.
[01:27.160 --> 01:30.160]  I had just sold my you sold my company like, hey, we got to use Vue.js.
[01:30.160 --> 01:31.160]  It's this great thing.
[01:31.160 --> 01:32.840]  Like everyone should use it.
[01:32.840 --> 01:35.000]  I used checkout medium.
[01:35.000 --> 01:36.440]  I lived in the suburbs and I'd commute.
[01:36.440 --> 01:40.920]  So I used to, you know, pull out the medium app to learn new things or see what's going
[01:40.920 --> 01:41.920]  on.
[01:41.920 --> 01:42.920]  And I saw this.
[01:42.920 --> 01:45.720]  I saw this article that said, so you want to be a functional programmer?
[01:45.720 --> 01:46.720]  And I was like, I don't know.
[01:46.720 --> 01:47.720]  I don't know what that means.
[01:47.720 --> 01:48.720]  Do I?
[01:48.720 --> 01:49.720]  Yeah.
[01:49.720 --> 01:51.480]  Maybe that's it.
[01:51.480 --> 01:55.840]  So I read it and it was interesting and there was a, you know, at the bottom, it said, oh,
[01:55.840 --> 01:56.840]  if you like this, check out Elm.
[01:56.840 --> 01:57.840]  And I'm like, yeah, yeah, yeah.
[01:57.840 --> 01:58.840]  Whatever.
[01:58.840 --> 02:00.760]  I went to, you know, part two, there was four parts to it.
[02:00.760 --> 02:04.120]  And I kept saying, no, no, no, I don't care about Elm, you know, whatever, just like give
[02:04.120 --> 02:05.280]  me the content.
[02:05.280 --> 02:07.840]  And then I got to part four and there was, that was it.
[02:07.840 --> 02:09.400]  And I was like, I need more.
[02:09.400 --> 02:11.240]  So I finally got the Elm link.
[02:11.240 --> 02:16.240]  I went to the Elm, I think it was a Facebook page, some Elm Facebook group that shows how
[02:16.240 --> 02:17.240]  old it was.
[02:17.240 --> 02:18.240]  Yeah.
[02:18.240 --> 02:19.560]  I didn't even know there was one of those.
[02:19.560 --> 02:20.560]  Yeah.
[02:20.560 --> 02:22.760]  And there was a link to let's be mainstream.
[02:22.760 --> 02:26.800]  And that talk really connected with me because I had never done functional programming before.
[02:27.760 --> 02:32.680]  I just knew the pain points of, you know, getting it wrong in JavaScript every now and
[02:32.680 --> 02:35.880]  then and kind of like button my head up against stuff at work.
[02:35.880 --> 02:40.520]  And Evan had that, that the rainbow logo, the Elmland rainbow was taken from that talk
[02:40.520 --> 02:41.520]  that was like the happy place.
[02:41.520 --> 02:42.520]  Oh yeah.
[02:42.520 --> 02:43.520]  I remember that.
[02:43.520 --> 02:44.520]  Yeah.
[02:44.520 --> 02:45.520]  On the graph.
[02:45.520 --> 02:46.520]  Yeah.
[02:46.520 --> 02:47.520]  Exactly.
[02:47.520 --> 02:48.520]  Yeah.
[02:48.520 --> 02:49.520]  So it was like the whole, he had a whole setup where he's like, Hey, how do we get, you
[02:49.520 --> 02:54.320]  know, Haskell people to a place where it's easier to, you know, do functional programming?
[02:54.320 --> 02:56.040]  He's like, that's not what I'm looking for.
[02:56.040 --> 02:59.280]  I'm looking for how do we make it more reliable for JavaScript developers?
[02:59.280 --> 03:00.960]  And I'm like, Hey, that's me.
[03:00.960 --> 03:03.760]  And I think that that really aligned me with the vision.
[03:03.760 --> 03:09.680]  So Elmland, I guess this is back in, you know, last year, I was like, I really want to kind
[03:09.680 --> 03:14.680]  of return back to that, that feeling of like, this is really designed for, to be really
[03:14.680 --> 03:18.560]  easy and approachable for people that are doing front end development.
[03:18.560 --> 03:19.560]  And how do we get there?
[03:19.560 --> 03:24.040]  So, so yeah, when that project started, it was like, okay, I need to really take a good
[03:24.080 --> 03:26.080]  look at how does React do stuff?
[03:26.080 --> 03:27.080]  How does Vue do stuff?
[03:27.080 --> 03:28.080]  How does Spelt do stuff?
[03:28.080 --> 03:34.280]  And can I, can I make that learning curve a little bit more familiar for people?
[03:34.280 --> 03:37.200]  But yeah, that's, that's kind of how Elmland started.
[03:37.200 --> 03:38.600]  That was the inspiration for it.
[03:38.600 --> 03:43.760]  So basically just like kind of wanting to take that feeling that you had back in 2015
[03:43.760 --> 03:48.920]  watching Evan's talk and sort of like turn it into a tool and something that's like not
[03:48.920 --> 03:52.240]  just inspires people to try Elm, but like just helps them.
[03:52.280 --> 03:56.560]  It'll sort of achieve that goal of like, you can get to this, this wonderful land of like
[03:56.560 --> 03:59.720]  reliability and stuff, but like get up and running faster and easier.
[04:00.000 --> 04:00.680]  Exactly.
[04:00.680 --> 04:01.720]  Yeah, that's exactly it.
[04:02.520 --> 04:02.760]  Yeah.
[04:02.760 --> 04:06.840]  Cause I feel like in that space, there's a lot of, I don't know if maybe dependency is a
[04:06.840 --> 04:10.480]  strong word, but if you're doing front end development, there's a lot of tools that just
[04:10.480 --> 04:13.960]  you type a few letters in your terminal and like, oops, I have an application.
[04:14.000 --> 04:14.360]  Right.
[04:14.360 --> 04:18.560]  And so I feel like, you know, before I did Elm, I was, when I was using Vue, there was
[04:18.560 --> 04:22.840]  NuxJS and I was like, oh, I can just build an app and it'll tell me how to organize
[04:22.840 --> 04:27.720]  things and just kind of seem to streamline the process of like trying something out and
[04:27.720 --> 04:28.640]  like getting up and running.
[04:29.040 --> 04:33.200]  I think this is a really important point because so like Elm, like I've talked to Evan
[04:33.200 --> 04:36.520]  about this in the past and like one of the things that he really likes from like a teaching
[04:36.520 --> 04:42.160]  perspective is just to sort of reassure people that, you know, you only need to just
[04:42.160 --> 04:46.560]  download the Elm executable and then you can just start with like a main.Elm and just go
[04:46.560 --> 04:48.440]  from there and that's all you need to get up and running.
[04:48.440 --> 04:51.640]  However, there's a lot of people for whom that's just not what they're used to.
[04:51.720 --> 04:52.960]  That's not the workflow.
[04:52.960 --> 04:54.040]  That's like normal to them.
[04:54.040 --> 04:56.960]  They're used to like, no, I want like a CLI that's going to generate a bunch of stuff for
[04:56.960 --> 04:57.160]  me.
[04:57.160 --> 05:02.400]  And like, I want to, I don't want to have to start from, you know, hello world and plain
[05:02.400 --> 05:06.760]  texts, no style sheets, you know, like just build everything up from scratch.
[05:07.080 --> 05:12.400]  Some people like that, but there's always been a bit of mismatch between the only early
[05:12.400 --> 05:16.800]  Elm experience that was available and what a lot of, especially JavaScript developers
[05:16.840 --> 05:19.920]  are used to and kind of looking for as a way to on-ramp.
[05:19.960 --> 05:24.520]  And what I love about your approach is that you still have the same sort of destination,
[05:24.520 --> 05:26.560]  like you still end up getting the Elm experience.
[05:26.560 --> 05:30.440]  It's not like, you know, took Elm and like made it into something else.
[05:30.800 --> 05:35.320]  It's more that you're like, here is, you know, if this is the onboarding experience that
[05:35.320 --> 05:37.560]  you're used to and you're looking for, here it is.
[05:37.680 --> 05:42.480]  It's the same experience, but you're going to get to a much nicer destination once you
[05:42.480 --> 05:43.560]  go through that experience.
[05:43.720 --> 05:45.040]  Yeah, yeah, it's funny.
[05:45.360 --> 05:49.960]  It reminds me, Evan gave his talk, I think it was on storytelling where he was talking
[05:49.960 --> 05:52.000]  about, it's like radio versus television.
[05:52.120 --> 05:56.320]  At one point he was talking about like how television just takes away the work from
[05:56.320 --> 05:58.600]  having to like, you know, visualize what you're hearing.
[05:59.120 --> 06:02.200]  And I feel like Elmlands, it's like just giving you an app and like takes away the
[06:02.200 --> 06:03.800]  work of like, Oh, I can like do this.
[06:03.800 --> 06:06.440]  Oh, that means I could build, I could build GitHub or something.
[06:06.440 --> 06:07.600]  It's like, no, just show them GitHub.
[06:08.920 --> 06:12.800]  It's like, now you don't have to do that extra step, but yeah, totally.
[06:13.120 --> 06:15.200]  So yeah, I think that's a, that's a great summary of it.
[06:15.200 --> 06:17.880]  Just like kind of making that a familiar experience.
[06:18.480 --> 06:18.800]  Yeah.
[06:18.880 --> 06:19.640]  What they might be used to.
[06:19.840 --> 06:22.760]  What was the point, which you were like, this is the missing piece.
[06:22.760 --> 06:27.160]  Was there some, cause like oftentimes I found that when I decided to like invest
[06:27.160 --> 06:31.040]  as much time into a tool as I'm sure you have into Elmland, there's some moment
[06:31.120 --> 06:34.760]  where there's kind of like a trigger where I'm like, okay, this, this, I want
[06:34.760 --> 06:37.240]  this to exist badly enough that I'm going to put a bunch of hours into it.
[06:37.560 --> 06:38.920]  Was, did you have a moment like that?
[06:38.920 --> 06:42.760]  Or, or was it just kind of like, you know, eventually just like an accumulation
[06:42.760 --> 06:46.440]  of like hearing stories from people about where's X, but for Elm, you know,
[06:46.440 --> 06:48.280]  where's Nux for Elm or something like that.
[06:49.400 --> 06:53.480]  Yeah, I think it, I think this, this goes back because before Elmland, I was
[06:53.480 --> 06:57.760]  working on LMSPA, which was kind of, it was more focused on like the routing
[06:57.880 --> 07:00.040]  and like kind of scaffolding side of things.
[07:00.480 --> 07:04.000]  And then I like maybe shelled out isn't the right word, but I kind of shelled
[07:04.000 --> 07:07.720]  out the rest of the app building experience to like, Oh, like go look up
[07:07.720 --> 07:10.160]  how to do on UI or go look up how to use it out of scope.
[07:10.240 --> 07:10.560]  Yeah.
[07:10.680 --> 07:11.000]  Yeah.
[07:11.000 --> 07:13.400]  It's like, that's not my, it's not my wheelhouse, man.
[07:13.880 --> 07:18.360]  But I felt like, I feel like for a long time, and you know, I guess to this
[07:18.360 --> 07:22.720]  day, the Elm community is a bunch of kind of independent contributors that build
[07:22.720 --> 07:27.280]  really focused libraries that are designed well enough that they can plug in together.
[07:27.760 --> 07:31.960]  Um, but the thing that I would see a lot in the Elm Slack or, you know, people
[07:31.960 --> 07:34.040]  asking is like, how do I build an app?
[07:34.320 --> 07:34.600]  Right.
[07:34.880 --> 07:37.000]  And that means like, how do I glue all this stuff together?
[07:37.360 --> 07:40.880]  Even in the Elm SPA, you know, users channel in the Slack, they'd be like,
[07:40.880 --> 07:42.240]  Hey, can I use Elm UI with this?
[07:42.280 --> 07:44.040]  Like the question, can I use it?
[07:44.080 --> 07:48.160]  Like not even like how it's like, is this even viable at like a base level?
[07:48.160 --> 07:48.640]  Does it work?
[07:48.640 --> 07:48.920]  Yeah.
[07:49.000 --> 07:53.520]  I feel like, um, there was a moment where I was like, I think I just need to answer
[07:53.520 --> 07:55.880]  the, the high level question of how do I do it?
[07:56.320 --> 07:58.760]  Like, how do I make something real and how do I build something?
[07:59.400 --> 08:02.160]  Cause there's a, there's a lot of, uh, there's not a lot of tools like a lot
[08:02.160 --> 08:06.360]  of, in terms of like NPM comparison, but there's a lot of like separate kind
[08:06.360 --> 08:10.320]  of projects and it's not clear, um, to a newcomer necessarily, like how they're
[08:10.320 --> 08:12.000]  all supposed to kind of work together.
[08:12.160 --> 08:12.400]  Right.
[08:12.400 --> 08:15.600]  But it wasn't just like an answer in the sense of like an FAQ entry, right?
[08:15.600 --> 08:18.400]  Where it's like, oh, the answer is like, yes, you could use Elm UI with that.
[08:18.440 --> 08:19.560]  Go forth and do it.
[08:19.560 --> 08:22.440]  It's, it's more like, here's an answer to the whole set of questions.
[08:22.440 --> 08:23.560]  And like, here's a starting point.
[08:23.680 --> 08:26.040]  It reminds me a little bit of, um, and I'm sure you've like, you know, this
[08:26.040 --> 08:29.440]  is on your radar, it was like create react app where like that was, seemed
[08:29.440 --> 08:32.520]  like there was a similar motivation story there where there was a bunch of people
[08:32.520 --> 08:36.240]  saying like, well, reacts only designed to be responsible for rendering, but
[08:36.240 --> 08:37.920]  there's all this other stuff that goes into making an app.
[08:38.000 --> 08:39.840]  How do I glue all these pieces together?
[08:39.840 --> 08:42.520]  And then create react app was like an answer to that.
[08:42.560 --> 08:45.720]  And then, you know, you could like, you know, eject and whatever else.
[08:45.720 --> 08:47.560]  And I haven't really kept up with the react rule.
[08:47.560 --> 08:50.480]  I don't know if that's like still a thing that people do or not, but I know
[08:50.480 --> 08:54.040]  that that was like, that was originally like the motivation there, or
[08:54.040 --> 08:54.800]  at least that's what I heard.
[08:55.240 --> 08:55.680]  Totally.
[08:55.800 --> 08:56.120]  Yeah.
[08:56.160 --> 09:00.480]  I saw a tweet recently, someone saying like, uh, like junior developers being
[09:00.480 --> 09:04.440]  like, what's a CRA people like these days don't know what it is anymore.
[09:04.600 --> 09:06.040]  So is it, has it fallen out of favor?
[09:06.040 --> 09:06.680]  I don't even know.
[09:06.920 --> 09:07.480]  I don't know.
[09:07.480 --> 09:10.960]  I mean, I haven't, I haven't really been in the react ecosystem too much, but
[09:11.240 --> 09:14.520]  I think people are aware of it and they still kind of reference it as like a thing.
[09:14.720 --> 09:16.600]  I mean, the JavaScript ecosystem is blown up.
[09:16.720 --> 09:17.920]  There's all these new frameworks.
[09:18.120 --> 09:20.680]  There's like solid, there's Astro, there's quick.
[09:20.680 --> 09:22.920]  There's like all these new things that kind of look like react,
[09:23.040 --> 09:24.120]  but work a little different.
[09:24.440 --> 09:26.080]  It's good to know that some things never change.
[09:26.080 --> 09:29.080]  It felt like for a minute there, there was going to be a consensus
[09:29.360 --> 09:30.880]  in the JavaScript ecosystem.
[09:30.880 --> 09:32.880]  So I guess that didn't last long.
[09:33.120 --> 09:33.920]  No, yeah.
[09:34.080 --> 09:37.040]  It's actually really cool to see like what's going on now because the
[09:37.040 --> 09:38.800]  frameworks are getting more nuanced.
[09:39.240 --> 09:42.000]  It feels like they're, they're more, I've heard, I've heard you talk about this
[09:42.000 --> 09:45.200]  on the podcast before of like things to try to convince you they're good for
[09:45.200 --> 09:48.560]  everything and you kind of find out later on, like, that's not really true.
[09:49.000 --> 09:49.240]  Yeah.
[09:49.480 --> 09:53.160]  If you look at the projects coming out now, it's like the, I've noticed two big
[09:53.160 --> 09:58.000]  changes since, you know, 2015 or 2016 when I was like really in the space, which
[09:58.000 --> 10:02.520]  is the framework wars seem to have died down or at least the strategy has changed
[10:02.520 --> 10:03.960]  where everyone's like polite online.
[10:04.600 --> 10:06.240]  Maybe they're like secretly at war.
[10:06.360 --> 10:06.720]  Okay.
[10:06.720 --> 10:07.080]  Okay.
[10:07.120 --> 10:07.400]  Yeah.
[10:07.600 --> 10:12.200]  Everyone seems very like reasonable and honest about like trade-offs.
[10:12.280 --> 10:17.160]  So like the Ryan, I think carniado is how you pronounce his last name.
[10:17.240 --> 10:19.760]  He's the author of solid JS and he's like, it's really fast.
[10:19.760 --> 10:20.440]  You're the benchmarks.
[10:20.440 --> 10:21.040]  It's for this.
[10:21.080 --> 10:21.320]  Yeah.
[10:21.400 --> 10:24.680]  And like people, you know, making Astro like, Hey, this is great for making
[10:24.680 --> 10:30.080]  websites and I feel like, I feel like there's more nuance now, which it feels
[10:30.080 --> 10:32.360]  like there always should have been, for example, Elmland, don't build a
[10:32.360 --> 10:34.720]  website with any SSR.
[10:34.720 --> 10:37.640]  It doesn't do any SS SEO that that's like, not the focus.
[10:37.640 --> 10:41.320]  Like if you're trying to build like an app behind login, like if you're, if
[10:41.320 --> 10:44.720]  you're at, you know, if you're working at vendor, for example, if you ever heard
[10:44.720 --> 10:49.120]  of that company, if you're working at that, if you're working at vendor,
[10:49.200 --> 10:52.280]  your apps behind the login, you don't need, you know, server side rendering
[10:52.280 --> 10:55.640]  and every page, you just want a really nice, reliable experience for the
[10:55.640 --> 10:58.240]  customer and you want to be able to add features quick, right?
[10:58.280 --> 11:01.080]  That's what online sports, you know, it's not for everything, but the web
[11:01.080 --> 11:06.200]  platform is so expansive that it can be like a blurry line between those things
[11:06.200 --> 11:08.520]  sometimes, and I feel like there's a lot more nuance these days, which
[11:08.520 --> 11:10.080]  is just great to see that.
[11:10.080 --> 11:14.680]  Um, yeah, that, that, that framework wars comment takes me back to, it was a
[11:14.680 --> 11:18.120]  conference, this is like more than five years ago now, I think, and it was
[11:18.120 --> 11:22.040]  called framework summit and kind of the theme was like, let's get all the
[11:22.120 --> 11:25.960]  like JavaScript frameworks together and like, you know, give people
[11:25.960 --> 11:29.600]  presentations about them and let people, you know, understand which one is for
[11:29.600 --> 11:31.240]  them and which one isn't for them and so forth.
[11:31.480 --> 11:35.200]  So they also had this like creators day that was like the day before the
[11:35.200 --> 11:39.560]  presentations, it was like me representing Elm and like Tom Dale from
[11:39.560 --> 11:44.920]  Ember and Andrew from react and, um, some people whose names I'm forgetting
[11:45.040 --> 11:47.080]  for, uh, from UJS and AngularJS.
[11:47.480 --> 11:50.240]  And so basically we, we all got together and just sort of talked about
[11:50.240 --> 11:52.720]  stuff that affected all of us, which is pretty interesting discussion.
[11:52.880 --> 11:55.240]  And so there was kind of this spirit of like, okay, we're not really like
[11:55.240 --> 11:58.120]  competing, we're just kind of, you know, hanging out and like having fun and
[11:58.120 --> 12:01.240]  we're all kind of doing our own thing and have our own trade-offs and, you
[12:01.240 --> 12:02.960]  know, there's some commonalities and some differences.
[12:03.040 --> 12:06.600]  And so the next day, you know, the, the organizers had said like, okay, when
[12:06.600 --> 12:09.880]  everybody presents, you're sort of like, I forget how long it was, it was like a
[12:09.880 --> 12:13.800]  15, 20 minute pitch for like, you know, your latest, latest version of your thing.
[12:14.240 --> 12:16.440]  Um, so I was talking about like, you know, here's the latest in Elm.
[12:16.920 --> 12:20.640]  Uh, and, uh, and he was like, you know, and please, the organized like, please
[12:20.640 --> 12:23.560]  don't, don't like, you know, hate on the other frameworks, you know, if you have
[12:23.560 --> 12:25.720]  to make comparisons, like be respectful at some point.
[12:25.880 --> 12:28.960]  And so everybody pretty much took this to heart, except at the very beginning
[12:28.960 --> 12:33.320]  of his Tom Dale from Ember stands up and he's like, all right, so I'd like to
[12:33.320 --> 12:37.240]  welcome everybody to the Comedy Central roast of JavaScript frameworks, and then
[12:37.240 --> 12:40.960]  proceeds to just like, just roast, like all of the other frameworks.
[12:41.040 --> 12:42.080]  Oh my gosh.
[12:42.120 --> 12:43.120]  He started it off.
[12:43.280 --> 12:46.440]  Yeah, I don't think this was the first presentation, but that was how he started
[12:46.440 --> 12:48.720]  off his, you know, comparison of Ember to them.
[12:49.040 --> 12:53.040]  Now what's funny though, in retrospect is that the dig that he had on Elm was he
[12:53.040 --> 12:56.480]  said, he's like, Elm is here, you know, really, really glad to see Elm represented.
[12:56.480 --> 13:00.360]  It's nice to see Elm here because it makes Ember look popular by comparison,
[13:01.400 --> 13:04.920]  which maybe at the time was true, but I actually don't think that's true anymore.
[13:04.920 --> 13:06.680]  I think it's, it's probably the other way around.
[13:06.800 --> 13:09.320]  I think Elm has almost certainly at this point, eclipsed Ember in
[13:09.320 --> 13:11.720]  terms of like current, like present day use.
[13:11.880 --> 13:12.480]  Interesting.
[13:12.480 --> 13:13.000]  Could be wrong.
[13:13.240 --> 13:14.120]  I have no idea.
[13:14.160 --> 13:14.480]  Yeah.
[13:14.920 --> 13:18.800]  Based on like state of JS surveys and like, I don't know if those are, how
[13:18.800 --> 13:22.160]  indicative those are, like on the one hand, maybe there's a lot of people using
[13:22.200 --> 13:25.240]  Ember apps that have been using them for so long that they just don't care to
[13:25.240 --> 13:26.760]  bother to like respond to state of JS.
[13:26.760 --> 13:30.080]  Cause they're not interested in like the latest and you know, most cutting edge
[13:30.080 --> 13:34.040]  stuff, but then again, I also, you know, know that like a lot of Elm people are
[13:34.040 --> 13:35.400]  just don't care about JS anymore.
[13:35.400 --> 13:36.720]  So it's just like, I moved on.
[13:36.720 --> 13:39.680]  And so who knows what percentage of Elm developers respond to state of JS.
[13:40.080 --> 13:43.160]  So yeah, there's a lot of, there's a lot of factors there, but it's interesting.
[13:43.360 --> 13:44.440]  I'm one of the crazy ones.
[13:44.440 --> 13:45.600]  That's like every state of JS.
[13:45.600 --> 13:46.960]  I'm like, get in there, you gotta put Elm in there.
[13:47.280 --> 13:47.680]  It's funny.
[13:47.680 --> 13:53.280]  If you look at the last state of JS, it was like most writings, Elm people
[13:53.280 --> 13:57.160]  just like, please include me on these lists, but it's, it's fun.
[13:57.400 --> 13:58.760]  I'm in the same boat in that.
[13:58.760 --> 14:02.120]  Like I used to look at state of JS, like before I got into Elm.
[14:02.520 --> 14:06.000]  And then like, since I got into Elm, like, yeah, I, I still, I just like, kind
[14:06.000 --> 14:09.160]  of always want to make sure it's like, yeah, you know, like just so you know,
[14:10.080 --> 14:12.640]  I'm not using JS anymore, but, but FYI Elm.
[14:12.960 --> 14:13.200]  Yeah.
[14:13.200 --> 14:14.880]  And I'm sure some number of people do that.
[14:14.880 --> 14:18.280]  Like I always see on Elm Slack, somebody posts a link to state of JS every year.
[14:18.280 --> 14:22.000]  It's like, Hey, you know, don't forget the JavaScript people don't know we exist
[14:22.000 --> 14:27.680]  unless we tell what we do, but it's weird because that's, it feels to me like state
[14:27.680 --> 14:31.200]  of JS for a lot of Elm programmers who are, who are like using it professionally.
[14:31.440 --> 14:35.040]  That's their main interaction with JavaScript anymore or the world of JavaScript.
[14:35.040 --> 14:36.160]  And maybe you do some like interop.
[14:36.200 --> 14:39.640]  And so that's how you like interact with JavaScript code, but it's like the JS
[14:39.640 --> 14:42.920]  community and all the different frameworks and the solid JS and you know,
[00:00.000 --> 00:03.800]  good state of JS, like before I got into Elm and then like, since I got into Elm,
[00:03.800 --> 00:07.240]  like, yeah, I, I still, I just like kind of always want to make sure it's like,
[00:07.320 --> 00:12.280]  yeah, you know, like, just so you know, I'm not using JS anymore, but, but FYI,
[00:12.280 --> 00:14.840]  um, yeah, and I'm sure some number of people do that.
[00:14.840 --> 00:18.240]  Like I always see on Elm Slack, somebody posts a link to state of JS every year
[00:18.240 --> 00:21.960]  to like, Hey, you know, don't forget the JavaScript people don't know we exist
[00:21.960 --> 00:27.040]  unless we tell what we do, but it's weird because that's, it feels to me like
[00:27.240 --> 00:30.600]  state of JS for a lot of Elm programmers who are, who are like using it
[00:30.600 --> 00:34.120]  professionally, that's their main interaction with JavaScript anymore or
[00:34.120 --> 00:34.960]  the world of JavaScript.
[00:34.960 --> 00:37.320]  I mean, maybe you do some like interop and so that's how you like interact
[00:37.320 --> 00:40.960]  with JavaScript code, but it's like the JS community and all the different
[00:40.960 --> 00:44.560]  frameworks and the solid JS and you know, whatever the latest thing is, it's
[00:44.560 --> 00:48.240]  like, I hear about those things, you know, but it's, it's almost even like,
[00:48.640 --> 00:52.480]  as if I'm not plugged into the front end world at all, because so much of the
[00:52.480 --> 00:56.240]  front end world is just like JavaScript, you know, I don't want to say drama,
[00:56.240 --> 01:02.120]  but like, you know, JavaScript framework churn there's, there's always so much
[01:02.120 --> 01:06.160]  like new stuff that seems like it's some tweak on the last thing.
[01:06.160 --> 01:08.720]  Whereas in the Elm community, I don't really get that sense.
[01:08.720 --> 01:12.880]  It seems like it's, it's much more common that you'll have an existing
[01:12.880 --> 01:14.560]  thing that continues to evolve.
[01:14.840 --> 01:19.200]  Like for example, Elm CSS, which like I started out and worked on for many years
[01:19.200 --> 01:22.840]  and kind of, I've not had time anymore because all of my, well, first of all,
[01:23.200 --> 01:26.800]  back when I had free time, before I had a kid, all of that time was going into
[01:26.800 --> 01:30.560]  rock and so I've just like, all of my non-rock things just kind of slid to
[01:30.560 --> 01:32.920]  the backboard or by default, it's funny.
[01:32.920 --> 01:36.240]  I was, I was catching up with Evan, was this last year, two years ago, whatever.
[01:36.240 --> 01:38.680]  I, at some point I was in Copenhagen for a conference.
[01:38.680 --> 01:41.720]  So I hung out with him and Teresa and we caught up about various things.
[01:41.720 --> 01:45.560]  And I was commenting on how like, I don't really like have time to maintain a lot
[01:45.560 --> 01:48.960]  of my Elm projects anymore, because I just, every weekend, I'm like this
[01:48.960 --> 01:51.160]  weekend, I'm going to like go through PRs on this thing.
[01:51.520 --> 01:54.200]  And then by the end of the weekend, I would have done a bunch of rock stuff.
[01:54.240 --> 01:56.480]  And I was like, and I still had more rock stuff to do, but I didn't even
[01:56.480 --> 01:57.600]  get to any of the Elm stuff.
[01:57.960 --> 01:58.200]  All right.
[01:58.200 --> 01:59.120]  Next weekend, next weekend.
[01:59.360 --> 02:00.520]  And then that would just keep happening.
[02:01.280 --> 02:02.520]  How's, I was joking to Evan.
[02:02.520 --> 02:04.920]  I was like, yeah, it turns out like making a programming language.
[02:05.160 --> 02:06.560]  It's really, really time consuming.
[02:06.880 --> 02:07.640]  Is that sounds funny?
[02:07.640 --> 02:08.200]  He just laughed.
[02:08.200 --> 02:13.280]  It's like, yeah, it turns out, I'm sure that's a universal thing.
[02:13.680 --> 02:16.680]  I mean, I guess like if you're making a toy language, that's like just for you
[02:16.680 --> 02:18.640]  and like just a hobby thing, then that's, that's one thing.
[02:18.680 --> 02:21.080]  But if you're like trying to make something that other people are actually
[02:21.080 --> 02:24.520]  going to use like professionally, it's like kind of a, yeah, there's a lot there.
[02:24.920 --> 02:28.640]  But I was thinking about this in the context of this sort of like framework
[02:28.640 --> 02:32.000]  churn in the JavaScript ecosystem, but never use the word churn to describe
[02:32.000 --> 02:33.920]  what happens in the Elm package ecosystem.
[02:33.960 --> 02:37.720]  And like in the Elm CSS case, it's like, okay, I'm not working on that actively
[02:37.720 --> 02:42.000]  anymore, but there's, there's a longtime contributor who had been working on
[02:42.000 --> 02:46.480]  this sort of like big performance oriented under the hood rewrite that I'd gotten
[02:46.520 --> 02:48.400]  started and never got all the way through.
[02:48.560 --> 02:52.240]  He just was like, Hey, is it cool if I like fork this and like continue that work?
[02:52.240 --> 02:53.400]  And I was like, yes, please do that.
[02:53.440 --> 02:56.960]  That's awesome because it's not like you're redoing the whole thing.
[02:56.960 --> 02:58.080]  Like fingers crossed.
[02:58.080 --> 03:01.920]  I would love to see him finish that because, and publish it because if he can
[03:01.920 --> 03:05.680]  actually make it across the finish line, it should feel like using the most
[03:05.680 --> 03:09.840]  recent release of the Elm CSS that I built up, but it should run way faster.
[03:10.200 --> 03:12.000]  Which in my mind is, is just like, awesome.
[03:12.000 --> 03:15.240]  If you can get something where it's like, this is already the experience that
[03:15.240 --> 03:17.880]  people want and are happy with, but it runs way faster.
[03:17.880 --> 03:20.080]  That's an amazing way to like evolve something.
[03:20.640 --> 03:24.520]  Whereas the like, well, we redid everything from scratch, but it was, you
[03:24.520 --> 03:27.000]  know, you use that description of like, it's kind of like react, but with a
[03:27.000 --> 03:28.520]  twist or like a little bit different.
[03:28.920 --> 03:32.000]  I'm really glad we don't, we don't see that happening in the Elm community.
[03:32.120 --> 03:32.640]  Yeah.
[03:32.680 --> 03:35.800]  I feel like every now and then we'll get the Elm Slack and there'll be, there'll
[03:35.800 --> 03:39.160]  be something new, new will come out in the react space and I'll see someone like,
[03:39.160 --> 03:43.400]  Oh, like how, how can we do like hooks and Elm or like, how do we, as felt
[03:43.400 --> 03:46.880]  doesn't do virtual DOM, like how do we do Elm without virtual DOM?
[03:46.880 --> 03:51.000]  And like, I see posts like that, but yeah, I don't think they get too much traction,
[03:51.000 --> 03:54.840]  but I feel like it kind of, I think there's a, just a general anxiety.
[03:55.320 --> 04:00.520]  Just like, if we're not doing the latest thing, like is, is, are we dead or something?
[04:00.520 --> 04:02.680]  You know, there's like that kind of energy to it.
[04:02.840 --> 04:06.520]  So I'm glad you brought that up because I, the way that I've seen those discussions
[04:06.520 --> 04:10.280]  typically go on Elm Slack is someone will post that and then two or three people
[04:10.280 --> 04:12.080]  will respond like, no, everything's cool.
[04:12.360 --> 04:12.680]  Yeah.
[04:12.760 --> 04:14.120]  Do we need to create a problem here?
[04:15.040 --> 04:15.760]  Like we're good.
[04:15.800 --> 04:18.480]  Like what's, what's the actual problem we're trying to solve here?
[04:18.480 --> 04:19.400]  Is it just FOMO?
[04:19.480 --> 04:22.280]  Like what's the user experience problem that we have here?
[04:22.280 --> 04:23.840]  And then like, let's figure out a solution to that.
[04:24.040 --> 04:25.640]  Is there a user experience problem here?
[04:25.640 --> 04:28.280]  Or is this just like, someone else is doing X.
[04:28.280 --> 04:29.360]  Shouldn't we be doing X?
[04:29.360 --> 04:29.880]  It's like, no.
[04:29.880 --> 04:33.960]  And I think that's, and maybe I'm being dismissive here, but feels like a
[04:33.960 --> 04:37.960]  cultural carryover from JavaScript because that's totally a cultural norm in
[04:37.960 --> 04:41.800]  the JavaScript community is just like, Oh man, like X came out, like,
[04:42.240 --> 04:43.400]  shouldn't we be doing X?
[04:43.480 --> 04:46.880]  And, and there's just like, kind of this, like this constant magnetic pull
[04:46.880 --> 04:48.280]  towards the latest shiny thing.
[04:48.640 --> 04:51.160]  And there's almost like a, I mean, at least among people who've been around
[04:51.160 --> 04:54.280]  the community long enough, like in Elm, it seems like there's a, there's a, an
[04:54.280 --> 04:58.440]  instinctive resistance to that, where it's like, anytime, like the X, Y problem
[04:58.440 --> 05:02.240]  is the classic example of this, where it's like, and people are always citing that
[05:02.240 --> 05:05.160]  and linking, I think it's like, what is X, Y problem dot info or something is the
[05:05.160 --> 05:05.480]  link.
[05:05.920 --> 05:06.760]  Yeah, something like that.
[05:06.800 --> 05:07.040]  Yeah.
[05:07.280 --> 05:08.960]  That's where I learned about X, Y problems.
[05:09.000 --> 05:11.120]  I think Elm Slack educated me.
[05:11.800 --> 05:12.800]  Yeah, it's Elm.
[05:12.800 --> 05:13.440]  Yeah, me too.
[05:13.840 --> 05:17.080]  I, I'd never heard of it before Elm, but yeah, it's like this for those who
[05:17.080 --> 05:20.840]  aren't familiar, it's, it's this idea of like, you know, you say like hooks,
[05:20.840 --> 05:21.840]  let's use that as an example.
[05:21.840 --> 05:24.800]  You come in saying like, Hey, you know, how does Elm do hooks?
[05:24.880 --> 05:28.200]  And you say, well, hang on, let's, let's take a step back and ask like, what's
[05:28.200 --> 05:29.160]  the real problem here?
[05:29.160 --> 05:30.480]  Like, what's the direct problem?
[05:30.480 --> 05:34.520]  Like we're starting to work on a solution and we have a question about the solution,
[05:34.560 --> 05:38.080]  but let's step, let's step all the way back and see like, what's the immediate
[05:38.080 --> 05:38.440]  problem?
[05:38.440 --> 05:40.360]  What's the pain point that we're trying to solve?
[05:40.640 --> 05:44.040]  And then we can talk about solutions kind of from scratch and maybe we'll end up
[05:44.040 --> 05:48.240]  going down the same road that this solution is like presupposing, but maybe
[05:48.240 --> 05:51.480]  not, maybe it'll turn out that there's actually a better category of solution
[05:51.480 --> 05:51.720]  here.
[05:53.440 --> 05:58.040]  And hooks are an interesting example because I remember when, when hooks and
[05:58.040 --> 06:02.080]  suspense were announced, which I think might've been the same talk or it might
[06:02.080 --> 06:02.840]  have been different talks.
[06:02.880 --> 06:06.520]  I don't remember, but I remember hearing about them.
[06:06.520 --> 06:10.840]  And I was at that point, like very into Elm and like has really had not been
[06:10.840 --> 06:14.360]  spending any time with react in a while, like months or years.
[06:15.240 --> 06:19.440]  And I remember hearing it and I was like, I don't understand what problem this is
[06:19.440 --> 06:19.840]  solving.
[06:19.840 --> 06:23.680]  If you're not like Facebook, if you're like literally Facebook and you have like
[06:23.680 --> 06:27.200]  a gazillion different widgets on the screen and, and no, they're all like, you
[06:27.200 --> 06:28.680]  know, customizable in different ways.
[06:28.680 --> 06:32.040]  And some of them need to be like real time, like chat, but then others don't
[06:32.040 --> 06:33.240]  need to be like the newsfeed.
[06:33.240 --> 06:37.440]  And I was like, okay, if you're literally Facebook, I can see how this might be
[06:37.440 --> 06:42.200]  solving a practical problem, but if you're not literally Facebook and there's
[06:42.200 --> 06:47.480]  like 99.9% of, you know, that's huge underestimate, basically everyone else.
[06:47.760 --> 06:51.120]  Like, like what, why, why are people excited about this?
[06:51.160 --> 06:55.720]  And it felt to me like an XY problem example where it's like, you know, yeah,
[06:56.000 --> 06:58.800]  you can see getting excited about it, you know, for the sake of, oh, it's a new
[06:58.800 --> 07:01.360]  shiny thing at that conceptually.

File addition: 20230206_3_questions_on_the_sapling_source_control_system.md (----------)

[0.1]

(upbeat music)
- Hi there, I'm Muir Manders,
a software developer on the
source control team at Meta.
(upbeat music)
First, I think it's useful
to start with a definition
of source control.
Source control is the practice
of tracking changes to source code.
Fundamentally, source control helps
software developers read,
write, and maintain source code.
Another way to think about it
is source control helps
developers collaborate
by sending and receiving
changes with other developers
and by tracking different
branches of work.
Source control also provides
critical metadata to developers
so that they can understand
when and why source code has changed.
Now, looking at the current
landscape of source control,
I think it's safe to say
that it's dominated by Git.
Git is popular for a reason,
it does a lot of things right.
But when the Sapling
project first started,
Git didn't quite meet
our scalability needs.
(upbeat music)
As I mentioned before,
initially scalability
was the primary focus
of the Sapling project.
To keep up with the pace of
code growth over the years,
we've redesigned many aspects
of our source control system.
One of the key elements to our
scalability is lazy fetching.
By lazy, I mean that
Sapling doesn't fetch data
until it's needed.
For example, file history and
the commit graph are both lazy
and more than just the repo
data behind the scenes,
your working copy is lazy as well.
We use a virtualized file system
to defer fetching of files
contents until it's accessed.
Together, this means you can clone a repo
with tens of millions of
files in a matter of seconds.
It also means that giant
repo can fit on your laptop.
There is a catch with the laziness.
You must be online to perform
many source control operations.
This trade-off is worth it for us
but it may not be worth
it for smaller repos.
Beyond scalability, we've focused a lot
on the user experience.
We aim to hide unnecessary complexity,
while providing a rich set of
tools right out of the box.
A good example to start with is undo.
Just like in any software,
when you make a mistake,
or you just change your mind,
you wanna undo your changes.
In Sapling, undoing most operations
is as easy as "sl undo".
Undo demonstrates how Sapling
has developed first-class
integrated concepts
that improve the developer experience.
In the same vein as undo,
but perhaps even more core to Sapling
is the concept of stacked commits.
A commit stack is a
sequence of local commits
similar on the surface to a Git branch.
Commit stacks differ from Git
branches in two main ways.
First, a Git branch is essentially
a name that points to a commit.
With a sapling stack,
there is is no indirection,
the stack is the set of commits.
What does that mean?
For one, you don't even have
to give your stack a name
if you don't want,
and if you check out a commit
in the middle of the stack,
you're still on your stack,
and you can use normal
commands to amend that commit.
Another difference between sapling stacks
and Git branches is that stack commits
don't have to be merged all or nothing.
As early commits in your stack
are being code reviewed and merged,
you can continue pushing more commits
to that same line of work.
Similarly, if you push a
large stack of commits,
you can incrementally
merge the early commits
while you continue to
iterate on the later commits.
(upbeat music)
In November, 2022, we
released the Sapling client
which is compatible with Git repos.
To try it out, go to sapling-scm.com
and follow the instructions
to install the Sapling client
and clone an existing Git repo.
There's a couple other cool
things I want to mention.
On the website, under the add-on section,
you can see that Sapling
comes with a fully featured GUI
called the Interactive Smartlog,
it's really a game changer.
It's a high level UI that
hides unnecessary details
while still given the user powerful tools
like drag and drop rebase.
Also, we've released a proof of concept
code review website called ReviewStack
that's designed for the
stacked commit workflow.
Finally, I'd like to note
that the Sapling client is just one piece
of Meta's source control system.
In the future, we hope to release
our virtualized file system
and our server implementation.
Together, these three
integrated components
really take source control scalability
and developer experience
to the next level.
If you wanna learn more about Sapling
please visit sapling-scm.com.
If you're interested in
getting involved directly,
please check out our GitHub project page,
we welcome contributions.
(upbeat music)

File addition: 20230109_beyond_git.md (----------)

[0.1]

# Document Title
Beyond Git: The other version control systems developers use
Our developer survey found 93% of developers use Git. But what are the other 7% using?
Article hero image
At my first job out of college (pre-Y2K), I got my first taste of version control systems. We used Microsoft’s Visual SourceSafe (VSS), which had a repository of all the files needed for a release, which was then burned onto a disk and sent to people through the mail. If you wanted to work on one of those files, you had to check it out from the repo—literally, like a library book. That file would be locked until you checked it back in; no one else could edit it. In essence, VSS was a shield on top of a shared file folder.
Microsoft discontinued VSS in 2005, coincidently the same year as the first release of Git. While technology has shifted and improved quite a bit since then git has come out as the dominant choice for version control systems. This year, we asked what version control systems people used, and git came out as the clear overall winner.
But it’s not quite a blow out; there are two other systems on the list: SVN (Apache Subversion) and Mercurial. There was a time when both of these were prominent in the market, but not everyone remembers those days. Stack Overflow engineering has used both of these in the past, though we now use git like almost everybody else.
This article will look at what those version control systems are and why they still have a hold of some engineering teams.
Apache Subversion
Subversion (SVN) is an open-source version control system that maintains source code in a central server; anyone looking to change code accesses these files from clients. This client server model is an older style, compared to the distributed model git uses, where changes can be stored locally then distributed to the central history (and other branches) when pushed to an upstream repository. In fact, SVN build on historical version control—it was initially intended to be a mostly compatible successor to CVS (Concurrent Versions System), which is itself a front end and expansion to Revision Control System (RCS), initially released way back in 1982.
This earlier generation of version control worked great for the way software was built ten to fifteen plus years ago. A piece of software would be built as a central repository, with any and all feature additions merged into a trunk. Branches were rare and eventually absorbed into the mainline. Important files, particularly large binaries, could be “locked” to prevent other developers from changing them while you worked on them. And everything existed as directories—files, branches, tags, etc. This model worked great for a centrally located team that eventually shipped a release, whether as a disc or a download.
SVN is a free, open-source version of this model. One of the paid client-server version control systems, Perforce (more on this below), had some traction at enterprise-scale companies, notably Google, but for those unwilling to pay the price for it, SVN was a good option. Plenty of smaller companies (including us at the beginning) used centralized version control to manage their code, and I’m sure plenty of folks still do, whether out of habit or preference.
But the ways that engineering organizations work has changed pretty drastically in the last dozen years. There is no longer a central dev team working on a single codebase; you have multiple independent teams each responsible for one or more services. Stack Overflow user VonC has made himself a bit of a version control expert and has guided plenty of companies away from SVN. He sees it a technology built for a less agile way of working. “It does get in the way, in term of management, repository creation, registration, and the general development workflow. As opposed to a distributed model, which is much more agile in those aspects. I suspect the recent developments with remote working will not help those closed environment systems.”
The other reason that SVN grew less used was that git showed how things could be better. Quentin Headen, Senior Software Engineer here at Stack Overflow, used SVN early in his career. “In my opinion, the two biggest drawbacks of SVN are that first, it is centralized, which requires a the SVN server to be up for you to commit changes. If your internet is down, you can't commit at all. Second, the branching is very heavy. Once a branch is created, you can't delete it (if I remember correctly). I think there is a command to remove, but it stays in history regardless. Git branches are cheap and can be deleted easily if need be.”
Clearly, SVN lost prominence when the new generation of version control arrived. But git wasn’t the only member of that generation.
Mercurial
Git wasn’t the only member of the distributed version control generation. Mercurial first arrived the same year as Git—2005—and became the two primary players. Early on, many people wondered what differences, if any, the two systems had. When Stack Overflow moved away from SVN, Mercurial won out mostly because we had easy access to hosting through Fog Creek Software (now Glitch), another of our co-founder Joel Spolsky’s companies. Eventually, we too gave in to Git.
Initially, Mercurial seemed to be the natural fit for developers coming from earlier VC systems. VonC notes, “It's the story of VHS versus Betamax.”
I reached out to Raphaël Gomès and Pierre-Yves David, both Mercurial core developers, about where Mercurial fits into the VC landscape. They said that plenty of large companies still use Mercurial in one form or another, including Mozilla, Facebook (though they may have moved to a Mercurial fork ported to Rust called Eden), Google (though as part of a custom VC codebase called Piper), Nokia, and Jane Street. “One of main advantages of Mercurial these days is its ability to scale on a very large project (millions of commits, millions of files). Over the years, companies have contributed performance improvements and dedicated features that make Mercurial a viable option for extreme scale monorepos.”
Ry4an Brase, who works at Google and uses their VC, expanded on why: “git is wed to the file system. Even GitHub accesses repositories as files on disk. The concurrency requirements of very large user bases on a single repo scale past filesystem access, and both Google and Facebook found Mercurial could be adapted to a database-like datastore and git could not.” However, with the recent release of Git v2.38 and Scalar, that advantage may be lessened.
But another reason that Mercurial may stay at these companies with massive monorepos is that it’s portable and extendable. It’s written in Python, which means it doesn’t need to be compiled to native code, and therefore it can be a viable VC option on any OS with a Python interpreter. It also has a robust extension system. “The extension system allows modifying any and all aspects of Mercurial and is usually greatly appreciated in corporate contexts to customize behavior or to connect to existing systems,” said Gomès and David.
Mercurial still has some big fans. Personally, I had never heard of it until some very enthusiast Mercurialists commented on an article of ours, A look under the hood: how branches work in Git.
babaloomer: Branches in mercurial are so simple and efficient! You never struggle to find the origin of a branch. Each commit has the name of its branch embedded in it, you can’t get lost! I don’t know how many times I had to drill down git history just to find the origin of a branch.
Scott: Mercurial did this much more intuitively than Git. You can tell the system is flawed when the standard practice in many workflows is to use “push -f” to force things. As with any tool, if you have to force it something is wrong.
Of course, different developers have different takes on this. Brase doesn’t think that Mercurial’s branching is necessary better. “Mercurial has four ways to do branches,” he said, “and the one that was exactly like git's was called ‘bookmarks’, which the core developers were slow to support. What Mercurial called branches have no equivalent in git (every commit is on one and only one branch and it's part of the commit info and revision hash), but no one wanted that kind.” Well, maybe not no one.
Mercurial is still and active project, as Gomès and David attest. They contribute to the code, manage the release cycles, and hold yearly conferences. While not the leading tool, it still has a place.
Other version control systems
In talking to people about version control, I found a few other interesting use cases, primarily around paid version control products.
Remember when I said I’d have more on Perforce? It turns out that several people mentioned it even though it didn’t even register on our survey. It turns out that Perforce has a strong presence in the video game industry—some even consider it the standard there. Rob Oates, an industry veteran who is currently the senior director of technology and partnerships at Exploding Kittens said, “Perforce still sees use in the game industry because c video game projects (by variety, count, and total volume of assets) are almost entirely not code.”
He gave four requirements that any version control system would need to fulfill in order to work for video game development:
    Must be useable by laypersons - Artists and designers will be working in this system day-to-day.
    Must lock certain files/types on checkout - Many of our files cannot be conceptually or technically merged.
    Must be architected to handle many large files as the primary use case - Many of our files will be dozens or hundreds of megabytes.
    Must avoid degenerate case with delta compression schemes - Many of our large files change entirely between revisions.
Perforce, because of its centralized server and file locking mechanism, fits perfectly. So why not separate the presentation layer from the simulation logic and store the big binary assets in one place and the code in a distributed system that excels at merging changes? The code in video games often depends on the assets. “For example, it would not be unusual for a game's combat system to depend on the driving code, the animations, the models, and the tuning data,” said Oates. “Or a pathfinding system may depend on a navigation mesh generated from the level art. Keeping these concerns in one repo is faster and less confusing when a team of programmers, artists, and designers are working to rapidly iterate on the ‘feel’ of the combat system.”
The engineers at these companies often prefer git. When they have projects that don’t have artists and designers, they can git what they want. “Game engines and middleware have an easier time living on distributed version control as their contributors are mostly, if not entirely, engineers,” said Oates. Unfortunately for the devs on video games, most projects have a lot of people creating non-code assets.
Another one mentioned was Team Foundation Version Control (TFVC). This was a Microsoft product originally included in Team Foundation Server and still supported in Azure DevOps. It’s considered the spiritual successor to VSS and is another central server style VC system. Art Gola, a solutions architect with Federated Hermes, told me about it. “It was great for its time. It had an API, was supported on Linux (Team Foundation Everywhere) and tons of people using it that no one ever heard from since they were enterprise devs.”
But Gola’s team is actively trying to move their code out of the TFVC systems they have, and he suspects that a lot of other enterprise shops are too. Compared to the agility git provides, TFVC felt clunky. “It requires you to have a connection to the central server. Later versions allow you to work offline, but you only had the latest version of the code, unlike git. There is no built in pull request type of process. Branching was a pain.”
One could assume that now that the age of centralized version control is waning and distributed version control is ascendant, there is no innovation in the VC space. But you’d be mistaken. “There are a lot of cool experiments in the VCS space,” said Patrick Thomson, a GitHub engineer who compared Git and Mercurial in 2008, “Pijul and the theory of patch algebra, especially—but Git, being the most performant DVCS, is the only one I use in industry. I work on very large codebases.”
Why did Git win?
After seeing what the version control landscape looks like in 2022, it may be obvious why distributed version control won out as the VC of choice for software developers. But it may not be immediately obvious why Git has such a commanding share of the market over Mercurial. Both of them first came out around the same time and have similar features, though certainly not one to one. Certainly, many people prefer it. “For personal projects, I pick Mercurial. If I was starting another company, I'd use Git to avoid having to retrain and argue with new hires,” said Brase.
In fact, it should have had an advantage because it was familiar to SVN users and the centralized way of doing things. “Mercurial was certainly the most easy to use and more familiar to use because it was a bit like using subversion, but in a distributed fashion,” said VonC. But that fealty to the old ways may have hurt it as well. “That is also one aspect which was ultimately against Mercury because just having the vision of using an old tool in a distributed fashion was not necessarily the be best fit to develop in a decentralized way.”
The short answer why it won comes down to a strong platform and built-in user base. “Mercurial lost the popularity battle in the early 2010s to Git. It's something we attribute in large part to the soaring rise of GitHub at that time, and to the natural endorsement of Git by the Linux community,” said Gomès and David.
Mercurial may have started out in a better position, but it may have lost ground over time. “Mercurial's original fit was a curated, coherent user experience with a built-in web UI,” said Brase. “GitHub gave git the good web UI and coherent couldn't beat the feature avalanche from Git contributors and the star power of its founder.”
That feature avalanche and focus on user needs may have been a hidden factor in pushing adoption. Thomson, in his comparison nearly fifteen years ago, likened Git to MacGyver and Mercurial to James Bond. Git let you scrape together a bespoke solution to nearly every problem if you were a command-line wizard, while Mercurial—if given the right job—could be fast and efficient. So where does Thomson stand now? “My main objection to Git—the UI—has improved over the years (I now use an Emacs-based Git frontend, which is terrific), whereas Mercurial’s primary drawback, its slow speed on large repositories, is still, as far as I can tell, an extant problem.”
Like MacGyver, Git has been improvising and adapting to fit whatever challenges come its way. Like James Bond, Mercurial has its way of doing things. It works great for some situations, but it has a distinct point of view. “My favorite example of a difference in how git and Mercurial approach new features is the `config` command,” said Brase. “Both `git config` and `hg config` are commands to edit settings such as the user's email address. The `git config` command modifies `~/.gitrc` for you and usually gets it right. The Mercurial author refused all contributions that edited a config file for you. Instead `hg config` launched your text editor on `~/.hgrc`, saying ‘What is it with coders who are intimidated by text-based config files? Like doctors that can't stand blood.’”
Regardless, it seems that while Git feels like the only version control game in town, it isn’t. Options for how to solve your problems are always a plus, so if you’ve been frustrated with the way it seems that everyone does things, know that there are other ways of working, and commit to learning more.
Authors
Ryan Donovan
Staff
Image of Ryan Donovan
Code for a Living
git
mercurial
perforce
svn
version control
Recent articles
June 20, 2024
Enterprise 2024.4: Demonstrating and improving community impact
June 19, 2024
The real 10x developer makes their whole team better
June 10, 2024
Generative AI Is Not Going To Build Your Engineering Team For You
June 6, 2024
Breaking up is hard to do: Chunking in RAG applications
Latest Podcast
June 25, 2024
A very special 5-year-anniversary edition of the Stack Overflow podcast!

File addition: 20230105_where_are_my_git_ui_features_from_the_futures.md (----------)

[0.1]

# Document Title
Steno & PL
About
Where are my Git UI features from the future?
Jan 5, 2023
Czytaj po polsku 🇵🇱
    Git sucks
    Rubric
    Clients
    Awards
    Related posts
    Comments
Git sucks
The Git version control system has been causing us misery for 15+ years. Since its inception, a thousand people have tried to make new clients for Git to improve usability.
But practically everyone has focused on providing a pretty facade to do more or less the same operations as Git on the command-line — as if Git’s command-line interface were already the pinnacle of usability.
@SeanScherer — One thing I didn't get about Pijul (which seems to be a pretty promising approach to VCS - in case you've not stumbled over it before) - is that they seem to be aiming to more or less emulate the Git workflow (when the main developer himself argues that the backend offers far smoother ones... :/ ).
➥ 1 reply
No one bothers to consider: what are the workflows that people actually want to do? What are the features that would make those workflows easier? So instead we get clients which think that git rebase -i as the best possible way to reword a commit message, or edit an old commit, or split a commit, or even worth exposing in the UI.
Rubric
I thought about some of the workflows I carry out frequently, and examined several Git clients (some of which are GUIs and some of which are TUIs) to see how well they supported these workflows.
Many of my readers won’t care for these workflows, but it’s not just about the workflows themselves; it’s about the resolve to improve workflows by not using the faulty set of primitives offered by Git. I do not care to argue about which workflows are best or should be supported.
Workflows:
    reword: It should be possible to update the commit message of a commit which isn’t currently checked out.
        Rewording a commit is guaranteed to not cause a merge conflict, so requiring that the commit be checked out is unnecessary.
        It should also be possible to reword a commit which is the ancestor of multiple branches without abandoning some of those branches, but let’s not get our hopes up…
    @Azeirah — I found that Sublime Merge does support renaming commits. Just right click any commit, select "edit" and choose the "edit commit message" option
    ➥ 2 replies
    sync: It should be possible to sync all of my branches (or some subset) via merge or rebase, in a single operation.
        I do this all the time! Practically the first thing every morning when coming into work.
    @yasyf — I've found Graphite (graphite.dev) makes this pretty easy! If your whole team is using it, that is.
    ➥ 0 replies
    split: There should be a specific command to split a commit into two or more commits, including commits which aren’t currently checked out.
        Splitting a commit is guaranteed to not cause a merge conflict, so requiring that the commit be checked out is unnecessary.
        Not accepting git rebase -i solutions, as it’s very confusing to examine the state of the repository during a rebase.
    preview: Before carrying out a merge or rebase, it should be possible to preview the result, including any conflicts that would arise.
        That way, I don’t have to start the merge/rebase operation in order to see if it will succeed or whether it will be hard to resolve conflicts.
        Merge conflicts are perhaps the worst part about using Git, so it should be much easier to work with them (and avoid dealing with them!).
        The only people who seem to want this feature are people who come from other version control systems.
    undo: I should be able to undo arbitrary operations, ideally including tracked but uncommitted changes.
        This is not the same as reverting a commit. Reverting a commit creates an altogether new commit with the inverse changes, whereas undoing an operation should restore the repository to the state it was in before the operation was carried out, so there would be no original commit to revert.
    large-load: The UI should load large repositories quickly.
        The UI shouldn’t hang at any point, and should show useful information as soon as it’s loaded. You shouldn’t have to wait for the entire repository to load before you can examine commits or branches.
        The program is allowed to be slow on the first invocation to build any necessary caches, but must be responsive on subsequent invocations.
    large-ops: The UI should be responsive when carrying out various operations, such as examining commits and branches, or merging or rebasing.
Extra points:
    I will award honorary negative points for any client which dares to treat git rebase -i as if it were a fundamental primitive.
    I will award honorary bonus points for any client which seems to respect the empirical usability research for Git (or other VCSes). Examples:
        Gitless: https://gitless.com/
        IASGE: https://investigating-archiving-git.gitlab.io/
Since I didn’t actually note down any of this, these criteria are just so that any vendors of these clients can know whether I am impressed or disappointed by them.
Clients
I picked some clients arbitrarily from this list of clients. I am surely wrong about some of these points (or they’ve changed since I last looked), so leave a comment.
    Update 2022-01-09: Added IntelliJ.
    Update 2022-01-10: Added Tower.
    Update 2023-05-28: Upgraded Magit’s reword rating.
I included my own project git-branchless, so it doesn’t really count as an example of innovation in the industry. I’m including it to demonstrate that many of these workflows are very much possible.
	
Git CLI
	
GitKraken
	
Fork
	
Sourcetree
	
Sublime Merge
	
SmartGit
	
Tower
	
GitUp
	
IntelliJ
	
Magit
	
Lazygit
	
Gitui
	
git-branchless
	
Jujutsu
reword 	❌ 1 	❌ 	❌ 	❌ 	⚠️ 2 	❌ 	⚠️ 2 	✅ 	⚠️ 2 	⚠️ 2 	❌ 	❌ 	✅ 	✅
sync 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	✅ 	❌
split 	❌ 1 	❌ 	❌ 	❌ 	❌ 	❌ 	❌ 	✅ 	❌ 	❌ 	❌ 	❌ 	❌ 	✅
preview 	❌ 	❌ 	⚠️ 3 	❌ 	⚠️ 3 	❌ 	⚠️ 3 	❌ 	❌ 	✅ 4 	❌ 	❌ 	⚠️ 5 	✅ 6
undo 	❌ 	✅ 	❓ 	✅ 	✅ 	❌ 	✅ 	✅ 	❌ 	❌ 	⚠️ 7 	❌ 	✅ 	✅
large-load 	✅ 8 	❌ 	❌ 	❌ 	✅ 	❌ 	❌ 	❌ 	✅ 	✅ 9 	✅ 	✅ 	✅ 	❌
large-ops 	✅ 8 	❌ 	❌ 	✅ 	✅ 	❌ 	❌ 	✅ 	✅ 	✅ 9 	✅ 	✅ 	✅ 	❌
Notes:
    1 It can be done via git rebase -i or equivalent, but it’s not ergonomic, and it only works for commits reachable from HEAD instead of from other branches.
    2 Rewording can be done without checking out the commit, but only for commits reachable from HEAD. There may be additional limitations.
    3 Partial support; it can show whether the merge is fast-forward or not, but no additional details.
    4 Can be done via magit-merge-preview.
    5 Partial support; if an operation would cause a merge conflict and --merge wasn’t passed, then instead aborts and shows the number of files that would conflict.
    6 Jujutsu doesn’t let you preview merge conflicts per se, but merges and rebases always succeed and the conflicts are stored in the commit, and then you can undo the operation if you don’t want to deal with the merge conflicts. You can even restore the old version of the commit well after you carried out the merge/rebase, if desired. This avoids interrupting your workflow, which is the ultimate goal of this feature, so I’m scoring it as a pass for this category.
    7 Undo support is experimental and based on the reflog, which can’t undo all types of operations.
    8 Git struggles with some operations on large repositories and can be improved upon, but we’ll consider this to be the baseline performance for large repositories.
    9 Presumably Magit has the same performance as Git, but I didn’t check because I don’t use Emacs.
Awards
Commendations:
    GitUp: the most innovative Git GUI of the above.
    GitKraken: innovating in some spaces, such as improved support for centralized workflows by warning about concurrently-edited files. These areas aren’t reflected above; I just noticed them on other occasions.
    Sublime Merge: incredibly responsive, as to be expected from the folks responsible for Sublime Text.
    Tower: for having a pleasing undo implementation.
Demerits:
    Fork: for making it really hard to search for documentation (“git fork undo” mostly produces results for undoing forking in general, not for the Fork client).
    SmartGit: for being deficient in every category tested.
Related posts
The following are hand-curated posts which you might find interesting.
Date 		Title
19 Jun 2021 		git undo: We can do better
12 Oct 2021 		Lightning-fast rebases with git-move
19 Oct 2022 		Build-aware sparse checkouts
16 Nov 2022 		Bringing revsets to Git
05 Jan 2023 	(this post) 	Where are my Git UI features from the future?
11 Jan 2024 		Patch terminology
Want to see more of my posts? Follow me on Twitter or subscribe via RSS.
Comments
    Discussion on Hacker News
    Discussion on Lobsters
Steno & PL
subscribe via RSS
    Waleed Khan
    me@waleedkhan.name
arxanas
    arxanas
This is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.

File addition: 20221116_bringing_revsets_to_git.md (----------)

[0.1]

# Document Title
Bringing revsets to Git
Nov 16, 2022
Intended audience 	
    Intermediate to advanced Git users.
    Developers of version control systems.
Origin 	
    Experience with Mercurial at present-day Meta.
    My work on git-branchless.
Revsets are a declarative language from the Mercurial version control system. Most commands in Mercurial that accept a commit can instead accept a revset expression to specify one or more commits meeting certain criteria. The git-branchless suite of tools introduces its own revset language which can be used with Git.
    Try it out
    Existing Git syntax
    Better scripting
    Better graph view
    Better rebasing
    Better testing
    Prior work
    Related posts
    Comments
Try it out
To try out revsets, install git-branchless, or see Prior work for alternatives.
Sapling SCM
Existing Git syntax
Git already supports its own revision specification language (see gitrevisions(7)). You may have already written e.g. HEAD~ to mean the immediate parent of HEAD.
However, Git’s revision specification language doesn’t integrate well with the rest of Git. You can write git log foo..bar to list the commits between foo and bar, but you can’t write git rebase foo..bar to rebase that same range of commits.
It can also be difficult to express certain sets of commits:
    You can only express contiguous ranges of the commits, not arbitrary sets.
    You can’t directly query for the children of a given commit.
git-branchless introduces a revset language which can be used directly via its git query or with its other commands, such as git smartlog, git move, and git test.
The rest of this article shows a few things you can do with revsets. You can also read the Revset recipes thread on the git-branchless discussion board.
Better scripting
Revsets can compose to form complex queries in ways that Git can’t express natively.
In git log, you could write this to filter commits by a certain author:
$ git log --author="Foo"
But negating this pattern is quite difficult; see Stack Overflow question equivalence of: git log –exclude-author?.
With revsets, the same search can be straightforwardly negated with not:
$ git query 'not(author.name(Foo))'
It’s easy to add more filters to refine your query. To additionally limit to files which match a certain pattern and commit messages which contain a certain string, you could write this:
$ git query 'not(author.name(Foo)) & paths.changed(path/to/file) & message(Ticket-123)'
You can express complicated ad-hoc queries in this way without having to write a custom script.
Better graph view
Git has a graph view available with git log --graph, which is a useful way to orient yourself in the commit graph. However, it’s somewhat limited in what it can render. There’s no way to filter commits to only those matching a certain condition.
git-branchless offers a “smartlog” command which attempts to show you only relevant commits. By default, it includes all of your local work up until the main branch, but not other people’s commits. Mine looks like this right now:
Image of the smartlog view with a few draft commits and branches.
Image of the smartlog view with a few draft commits and branches.
But you can also filter commits using revsets. To show only my draft work which touches the git-branchless-lib/src/git directory, I can issue this command:
Image of the smartlog view as before, but with only two draft commits visible (excluding those on the main branch).
Image of the smartlog view as before, but with only two draft commits visible (excluding those on the main branch).
Another common use-case might be to render the relative topology of branches in just this stack:
Image of a different smartlog view as before, showing branch-1 and branch-3 with an omitted commit between them.
Image of a different smartlog view as before, showing branch-1 and branch-3 with an omitted commit between them.
You can also render commits which have already been checked into the main branch, if so desired.
Better rebasing
Not only can you render the commit graph with revsets, but you can also modify it. Revsets are quite useful when used with “patch-stack” workflows, such as those used for the Git and Linux projects, or at certain tech companies practicing trunk-based development.
For example, suppose you have some refactoring changes to the file foo on your current branch, and you want to separate them into a new branch for review:
Image of a feature branch with four commits. Each commit shows two touched files underneath it.
Image of a feature branch with four commits. Each commit shows two touched files underneath it.
You can use revsets to select just the commits touching foo in the current branch:
Image of the same feature branch as before, but with the first and third commits outlined in red and the touched file 'foo' in red.
Image of the same feature branch as before, but with the first and third commits outlined in red and the touched file ‘foo’ in red.
Then use git move to pull them out:
$ git move --exact 'stack() & paths.changed(foo)' --dest 'main'
Image of the same feature branch as before, but the first and third commits are shown to be missing from the original feature branch, with dotted outlines indicating their former positions. They have been moved to a new feature branch, still preserving their relative order.
Image of the same feature branch as before, but the first and third commits are shown to be missing from the original feature branch, with dotted outlines indicating their former positions. They have been moved to a new feature branch, still preserving their relative order.
If you want to reorder the commits so that they’re at the base of the current branch, you can just add --insert:
$ git move --exact 'stack() & paths.changed(foo)' --dest 'main' --insert
Image of the same feature branch as before, but the first and third commits are shown to be missing from their original positions in the feature branch, with dotted outlines indicating their former positions. They have been moved to the beginning of that same feature branch, still preserving their relative order, now before the second and fourth commits, also preserving their relative order.
Image of the same feature branch as before, but the first and third commits are shown to be missing from their original positions in the feature branch, with dotted outlines indicating their former positions. They have been moved to the beginning of that same feature branch, still preserving their relative order, now before the second and fourth commits, also preserving their relative order.
Of course, you can use a number of different predicates to specify the commits to move. See the full revset reference.
Better testing
You can use revsets with git-branchless’s git test command to help you run (or re-run) tests on various commits. For example, to run pytest on all of your branches in parallel and cache the results, you can run:
$ git test run --exec 'pytest' --jobs 4 'branches()'
You can also use revsets to aid the investigation of a bug with git test. If you know that a bug was introduced between commits A and B, and has to be in a commit touching file foo, then you can use git test like this to find the first commit which introduced the bug:
$ git test run --exec 'cargo test' 'A:B & paths.changed(foo)'
This can be an easy way to skip commits which you know aren’t relevant to the change.
Versus git bisect
Caching test results
Prior work
This isn’t the first introduction of revsets to version control. Prior work:
    Of course, Mercurial itself introduced revsets. See the documentation here: https://www.mercurial-scm.org/repo/hg/help/revsets
    https://github.com/quark-zju/gitrevset: the immediate predecessor of this work. git-branchless uses the same back-end “segmented changelog” library (from Sapling SCM, then called Eden SCM) to manage the commit graph. The advantage of using revsets with git-branchless is that it integrates with several other commands in the git-branchless suite of tools.
    https://sapling-scm.com/: also an immediate predecessor of this work, as it originally published the segmented changelog library which gitrevset and git-branchless use. git-branchless was inspired by Sapling’s design, and has similar but non-overlapping functionality. See https://github.com/arxanas/git-branchless/discussions/654 for more details.
    https://github.com/martinvonz/jj: Jujutsu is a Git-compatible VCS which also offers revsets. git-branchless and jj have similar but non-overlapping functionality. It’s worth checking out if you want to use a more principled version control system but still seamlessly interoperate with Git repositories. I expect git-branchless’s unique features to make their way into Jujutsu over time.

File addition: 20221115_sapling_source_control_scalable.md (----------)

[0.1]

# Document Title
Sapling: Source control that’s user-friendly and scalable
Sapling Meta	
By Durham Goode, Michael Bolin	
    Sapling is a new Git-compatible source control client.
    Sapling emphasizes usability while also scaling to the largest repositories in the world.
    ReviewStack is a demonstration code review UI for GitHub pull requests that integrates with Sapling to make reviewing stacks of commits easy.
    You can get started using Sapling today. 
Source control is one of the most important tools for modern developers, and through tools such as Git and GitHub, it has become a foundation for the entire software industry. At Meta, source control is responsible for storing developers’ in-progress code, storing the history of all code, and serving code to developer services such as build and test infrastructure. It is a critical part of our developer experience and our ability to move fast, and we’ve invested heavily to build a world-class source control experience.
We’ve spent the past 10 years building Sapling, a scalable, user-friendly source control system, and today we’re open-sourcing the Sapling client. You can now try its various features using Sapling’s built-in Git support to clone any of your existing repositories. This is the first step in a longer process of making the entire Sapling system available to the world. 
What is Sapling?
Sapling is a source control system used at Meta that emphasizes usability and scalability. Git and Mercurial users will find that many of the basic concepts are familiar — and that workflows like understanding your repository, working with stacks of commits, and recovering from mistakes are substantially easier.
When used with our Sapling-compatible server and virtual file system (we hope to open-source these in the future), Sapling can serve Meta’s internal repository with tens of millions of files, tens of millions of commits, and tens of millions of branches. At Meta, Sapling is primarily used for our large monolithic repository (or monorepo, for short), but the Sapling client also supports cloning and interacting with Git repositories and can be used by individual developers to work with GitHub and other Git hosting services.
Why build a new source control system?
Sapling began 10 years ago as an initiative to make our monorepo scale in the face of tremendous growth. Public source control systems were not, and still are not, capable of handling repositories of this size. Breaking up the repository was also out of the question, as it would mean losing monorepo’s benefits, such as simplified dependency management and the ability to make broad changes quickly. Instead, we decided to go all in and make our source control system scale.
Starting as an extension to the Mercurial open source project, it rapidly grew into a system of its own with new storage formats, wire protocols, algorithms, and behaviors. Our ambitions grew along with it, and we began thinking about how we could improve not only the scale but also the actual experience of using source control.
Sapling’s user experience
Historically, the usability of version control systems has left a lot to be desired; developers are expected to maintain a complex mental picture of the repository, and they are often forced to use esoteric commands to accomplish seemingly simple goals. We aimed to fix that with Sapling.
A Git user who sits down with Sapling will initially find the basic commands familiar. Users clone a repository, make commits, amend, rebase, and push the commits back to the server. What will stand out, though, is how every command is designed for simplicity and ease of use. Each command does one thing. Local branch names are optional. There is no staging area. The list goes on.
It’s impossible to cover the entire user experience in a single blog post, so check out our user experience documentation to learn more.
Below, we’ll explore three particular areas of the user experience that have been so successful within Meta that we’ve had requests for them outside of Meta as well. 
Smartlog: Your repo at a glance
The smartlog is one of the most important Sapling commands and the centerpiece of the entire user experience. By simply running the Sapling client with no arguments, sl, you can see all your local commits, where you are, where important remote branches are, what files have changed, and which commits are old and have new versions. Equally important, the smartlog hides all the information you don’t care about. Remote branches you don’t care about are not shown. Thousands of irrelevant commits in main are hidden behind a dashed line. The result is a clear, concise picture of your repository that’s tailored to what matters to you, no matter how large your repo.
Having this view at your fingertips changes how people approach source control. For new users, it gives them the right mental model from day one. It allows them to visually see the before-and-after effects of the commands they run. Overall, it makes people more confident in using source control.
We’ve even made an interactive smartlog web UI for people who are more comfortable with graphical interfaces. Simply run sl web to launch it in your browser. From there you can view your smartlog, commit, amend, checkout, and more.
Fixing mistakes with ease
The most frustrating aspect of many version control systems is trying to recover from mistakes. Understanding what you did is hard. Finding your old data is hard. Figuring out what command you should run to get the old data back is hard. The Sapling development team is small, and in order to support our tens of thousands of internal developers, we needed to make it as easy as possible to solve your own issues and get unblocked.
To this end, Sapling provides a wide array of tools for understanding what you did and undoing it. Commands like sl undo, sl redo, sl uncommit, and sl unamend allow you to easily undo many operations. Commands like sl hide and sl unhide allow you to trivially and safely hide commits and bring them back to life. There is even an sl undo -i command for Mac and Linux that allows you to interactively scroll through old smartlog views to revert back to a specific point in time or just find the commit hash of an old commit you lost. Never again should you have to delete your repository and clone again to get things working.
See our UX doc for a more extensive overview of our many recovery features.
First-class commit stacks
At Meta, working with stacks of commits is a common part of our workflow. First, an engineer building a feature will send out the small first step of that feature as a commit for code review. While it’s being reviewed, they will start on the next step as a second commit that will later be sent for code review as well. A full feature will consist of many of these small, incremental, individually reviewed commits on top of one another.
Working with stacks of commits is particularly difficult in many source control systems. It requires complex stateful commands like git rebase -i to add a single line to a commit earlier in the stack. Sapling makes this easy by providing explicit commands and workflows for making even the newest engineer able to edit, rearrange, and understand the commits in the stack.
At its most basic, when you want to edit a commit in a stack, you simply check out that commit, via sl goto COMMIT, make your change, and amend it via sl amend. Sapling automatically moves, or rebases, the top of your stack onto the newly amended commit, allowing you to resolve any conflicts immediately. If you choose not to fix the conflicts now, you can continue working on that commit, and later run sl restack to bring your stack back together once again. Inspired by Mercurial’s Evolve extension, Sapling keeps track of the mutation history of each commit under the hood, allowing it to algorithmically rebuild the stack later, no matter how many times you edit the stack.
Beyond simply amending and restacking commits, Sapling offers a variety of commands for navigating your stack (sl next, sl prev, sl goto top/bottom), adjusting your stack (sl fold, sl split), and even allows automatically pulling uncommitted changes from your working copy down into the appropriate commit in the middle of your stack (sl absorb, sl amend –to COMMIT).
ReviewStack: Stack-oriented code review
Making it easy to work with stacks has many benefits: Commits become smaller, easier to reason about, and easier to review. But effectively reviewing stacks requires a code review tool that is tailored to them. Unfortunately, many external code review tools are optimized for reviewing the entire pull request at once instead of individual commits within the pull request. This makes it hard to have a conversation about individual commits and negates many of the benefits of having a stack of small, incremental, easy-to-understand commits.
Therefore, we put together a demonstration website that shows just how intuitive and powerful stacked commit review flows could be. Check out our example stacked GitHub pull request, or try it on your own pull request by visiting ReviewStack. You’ll see how  you can view the conversation and signal pertaining to a specific commit on a single page, and you can easily move between different parts of the stack with the drop down and navigation buttons at the top.
Sapling
Scaling Sapling
Note: Many of our scale features require using a Sapling-specific server and are therefore unavailable in our initial client release. We describe them here as a preview of things to come. When using Sapling with a Git repository, some of these optimizations will not apply.
Source control has numerous axes of growth, and making it scale requires addressing all of them: number of commits, files, branches, merges, length of file histories, size of files, and more. At its core, though, it breaks down into two parts: the history and the working copy.
Scaling history: Segmented Changelog and the art of being lazy
For large repositories, the history can be much larger than the size of the working copy you actually use. For instance, three-quarters of the 5.5 GB Linux kernel repo is the history. In Sapling, cloning the repository downloads almost no history. Instead, as you use the repository we download just the commits, trees, and files you actually need, which allows you to work with a repository that may be terabytes in size without having to actually download all of it. Although this requires being online, through efficient caching and indexes, we maintain a configurable ability to work offline in many common flows, like making a commit.
Beyond just lazily downloading data, we need to be able to efficiently query history. We cannot afford to download millions of commits just to find the common ancestor of two commits or to draw the Smartlog graph. To solve this, we developed the Segmented Changelog, which allows the downloading of the high-level shape of the commit graph from the server, taking just a few megabytes, and lazily filling in individual commit data later as necessary. This enables querying the graph relationship between any two commits in O(number-of-merges) time, with nothing but the segments and the position of the two commits in the segments. The result is that commands like smartlog are less than a second, regardless of how big the repository is.
Segmented Changelog speeds up other algorithms as well. When running log or blame on a file, we’re able to bisect the segment graph to find the history in O(log n) time, instead of O(n), even in Git repositories. When used with our Sapling-specific server, we go even further, maintaining per-file history graphs that allow answering sl log FILE in less than a second, regardless of how old the file is.
Scaling the working copy: Virtual or Sparse
To scale the working copy, we’ve developed a virtual file system (not yet publicly available) that makes it look and act as if you have the entire repository. Clones and checkouts become very fast, and while accessing a file for the first time requires a network request, subsequent accesses are fast and prefetching mechanisms can warm the cache for your project.
Even without the virtual file system, we speed up sl status by utilizing Meta’s Watchman file system monitor to query which files have changed without scanning the entire working copy, and we have special support for sparse checkouts to allow checking out only part of the repository.
Sparse checkouts are particularly designed for easy use within large organizations. Instead of each developer configuring and maintaining their own list of which files should be included, organizations can commit “sparse profiles” into the repository. When a developer clones the repository, they can choose to enable the sparse profile for their particular product. As the product’s dependencies change over time, the sparse profile can be updated by the person changing the dependencies, and every other engineer will automatically receive the new sparse configuration when they checkout or rebase forward. This allows thousands of engineers to work on a constantly shifting subset of the repository without ever having to think about it.
To handle large files, Sapling even supports using a Git LFS server.
More to Come
The Sapling client is just the first chapter of this story. In the future, we aim to open-source the Sapling-compatible virtual file system, which enables working with arbitrarily large working copies and making checkouts fast, no matter how many files have changed.
Beyond that, we hope to open-source the Sapling-compatible server: the scalable, distributed source control Rust service we use at Meta to serve Sapling and (soon) Git repositories. The server enables a multitude of new source control experiences. With the server, you can incrementally migrate repositories into (or out of) the monorepo, allowing you to experiment with monorepos before committing to them. It also enables Commit Cloud, where all commits in your organization are uploaded as soon as they are made, and sharing code is as simple as sending your colleague a commit hash and having them run sl goto HASH.
The release of this post marks my 10th year of working on Sapling at Meta, almost to the day. It’s been a crazy journey, and a single blog post cannot cover all the amazing work the team has done over the last decade. I highly encourage you to check out our armchair walkthrough of Sapling’s cool features. I’d also like to thank the Mercurial open source community for all their collaboration and inspiration in the early days of Sapling, which started the journey to what it is today.
I hope you find Sapling as pleasant to use as we do, and that Sapling might start a conversation about the current state of source control and how we can all hold the bar higher for the source control of tomorrow. See the Getting Started page to try Sapling today.

File addition: 20221019_build_aware_sparse_checkouts.md (----------)

[0.1]

# Document Title
Steno & PL
About
Build-aware sparse checkouts
Oct 19, 2022
My team and I recently gave a talk at Git Merge 2022 on our tool Focus, which integrates Git sparse checkouts with the Bazel build system.
Talk recording (~15 minutes):
And here are some slides I published internally. These go into the technical details of content-hashing, and also have doodles:
Related posts
The following are hand-curated posts which you might find interesting.
Date 		Title
19 Jun 2021 		git undo: We can do better
12 Oct 2021 		Lightning-fast rebases with git-move
19 Oct 2022 	(this post) 	Build-aware sparse checkouts
16 Nov 2022 		Bringing revsets to Git
05 Jan 2023 		Where are my Git UI features from the future?
11 Jan 2024 		Patch terminology
Want to see more of my posts? Follow me on Twitter or subscribe via RSS.
Comments
Steno & PL
subscribe via RSS
    Waleed Khan
    me@waleedkhan.name
arxanas
    arxanas
This is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.

File addition: 20221018_jujutsu.md (----------)

[0.1]

um yeah so my name is Martin Von
schweigberg
I
expected my speaker notes to be here
somewhere
um but yeah
um I work for uh on Source control at
Google
and I'm going to talk about uh a project
I've been working on for almost three
years it's a git compatible VCS called
Jiu Jitsu and in case you're wondering
the name has nothing to do with
jiu-jitsu kaisen the anime
um it's called Jujitsu just because the
binary is called JJ and the binary is
called JD because it's easy to type
um
oh okay now there it is
um so here's an overview of the
presentation
um
first I'm going to give you some
background about me and about the
history of source control at Google
and I'm going to go through the
workflows and architecture of JJ
um and then at the end I'll explain
what's what's next for the open source
project and uh and um our the
integration at Google
so background about me I after
graduating I worked for Eric's son for
about 70 years
and while there I think it was it's fair
to say I drove the immigration to get
there from from clear case
and I cleaned up some get three base
scripts in my spare time
then I joined Google and worked on
a compensation app
and for the last eight years I've worked
on fig which is a which is a project to
integrate material as a client for our
in-house monorepo
um so for for context uh let me tell you
a bit about the history our Version
Control at Google
so long time ago we supposedly started
with CVS
um
and uh then we switched to the perforce
and after a while proforce uh wasn't
able to handle the repository anymore
because it got too large so we wrote Our
Own VCS called Piper
um but the working copy was still too
big so we created a virtual flight
system called City
on top of Piper
and that's what almost every user go at
Google uses now
um
people who are still missing the dbcs
workflows that they were used to from
outside Google so we added Mercury on
top of that and as I said that's what
I've been working on for the last eight
years
um and also in case you you didn't know
um this our monoree but Google is
extremely large and has like all the
source code
um at Google
and you can you can watch this uh uh
talk by Rachel plattwood and potvin
um from at scale that if you're curious
so generally people really like fig but
there's still some major problems uh
we're having
the probably the biggest one is
performance
um and and that's partly because of
python python is slow
uh and partly because of
um eager data structures that don't
scale to the size of the repo
another problem is uh with uh
consistency we're seeing right races
because Mercurial is not designed for
distributed storage so we get corruption
when when we store it on top of our
distributed file system
um and another pain has been
integration because we're calling
Mercurial the Mercurial CLI and parsing
the outputs which is not fun
so
a few years ago we started talking about
what we would want from a next-gen VCS
and one of the ideas that came up there
was to automatically commit or to to
make the
make to make a make a commit from every
save from from your editor
and I got really excited by that idea
and started JD to experiment with it
um
and then I worked on it as my 20 product
for about two years
and this spring we decided to invest
more so now it's my 100 project
um
next
so you may be wondering why we're why I
didn't decide to to just add these
features to to git but as I said I want
I was want to experiment with a
different ux so I I think that would end
up being a completely separate
set of commands inside of git and that
would be
really ugly and wouldn't and shouldn't
get up accepted Upstream
and also I wanted to be able to
integrate into Google's ecosystem and we
had already decided what fig that
against using git there because because
of the problems with integrating it with
um our ecosystem at Google
and one of the problems is that there
are multiple implementations I'll get
that read from the file system so we
would have to add any new features in in
at least two or three places
okay so let's take a look at the
workflows
so the first the first teacher
um is anonymous branches which is
something I copied from mercurial
so instead of this
gets scary uh detached head workflow or
state
um JJ keeps track of all your commits
without you having to name them
so that means they didn't show up in log
outputs and they will not be garbage
collected
um so it may seem like you would get a
very cluttered log output very quickly
but
um whenever you rewrite the commit so if
you amend it or rebate it for example
the the old version gets gets hidden and
you can also manually hide commits with
JG abandon
and um so one of the first things you'll
notice if you start using JJ is that the
working copy is is an actual commit that
gets automatically committed
and it shows up in the log output at
with an at symbol
and whenever you make any changes in the
working copy
and run the run any JJ command after
that it will get automatically amended
into that that commit
um so and you can use the JD checkout
command and that actually creates
a new commit on top of the
committee specify to keep to store your
working working copy changes and and if
you instead wanted to resume
um editing an existing commit you can
use JJ edits
so this is some very interesting
consequences like the one important one
is the the working copy would never be
dirty
so if you check out the different commit
or rebase or something it will never
tell you that you have unstaged changes
you get automatic backup because every
command you run trace is another
automatic backup for you
it makes stash unnecessary because your
working copy commits is effectively a
stash
it also makes commit unnecessary
because the well it's already committed
whenever you run a command you your
working copy is committed so
and you can
um if you if you want to set the
description commit message you can run
jda describe to do that at any time
we get more consistent CLI because the
um the the ads commit the working copy
commits behaves just like any other
commit so there are no special Flags to
to work um to act on the working copy
um and for example JD restore can
restore files between any two commits
it defaults to to restoring from the
parent of the working copy and into the
working copy just like gets restored as
I think
um but you can pass any two commits
and also uncommitted changes stay in
place so they don't move around with you
like like they do with Git checkouts
so um
another one of of JD's distinguishing
features is a first-class conflicts
um
so if you if you look at the screenshot
there we merge the working copy at with
some other commits from another branch
and that succeeds
even though there are conflicts and we
can see in the log output afterwards
that there are conflicts in the working
copy
and that's so as you see the the working
copy commit there with the ad is is a
merge commits with conflicts
so these conflicts are recorded in the
commit as in a structured way so they're
not they're not just conflict markers
stored in a file
um
and and this like design leads to even
more magic like um maybe obvious one is
that you can
you can delay
resolving conflicts until you're ready
ready until you feel like it so if in
this case
when we had this conflict we can just
check out any another commit and deal
with the conflicts later who want to
and you can collaborate on on conflict
resolutions
you can resolve some some of the
conflicts in in these files and and
leave the rest for your co-worker for
example
um maybe less obvious is that a rebase
never fails so
um and same I mean we saw that here that
merge doesn't fail the same thing is
true for rebates and or all the other
similar commands
um which makes continue and abort and
for rebase and and cherubick and all
those similar commands unnecessary
we we also get a more consistent
conflict resolution flow you don't need
to remember which command
um
creative your conflicts you just check
out the command commit with conflict
and resolve the conflicts in the working
copy and then squash the that conflict
resolution into into the parent that
that conflict or as in in the screenshot
here we we were already
editing that commit so we just resolve
the conflict in The Working copy and and
it's it's gone from the log output
um so
um
yeah I remember so so now in the working
copy here in the second screenshot
the conflict has been resolved so in the
working copy
uh sorry in in the merge commit there in
the working copy
um that that working copy resolves the
conflict right
so we're going to come back to that in
the next slide
um and and and and a very important
feature that took me a year to to
realize it was possible is that um
because uh rebase always succeeds we can
we can always uh rebase all descendants
so if you
um check out the commit in the middle of
a stack and amend in some changes uh JD
will always rebase all The Descendants
on top so you will not have to do that
yourself and it moves the branches over
and so on
um
yes so so you can uh we can actually
even rebase
conflicts and conflict resolutions I
don't have time to explain how that
works but it does
um so in the top screenshots here
this is continuation from from the
previous slide
we re-based one side of the of the merge
from conflicts conflicting change to
there and it's and it's descendants onto
conflict and change one and of course we
get the same same conflict as we had in
the the merge commit before
and that stays in the non-conflicting
change but then in the the working copy
because we we had the conflict
resolution in the working copy which was
a merge commit before
um that resolution gets rebased over and
it stays in the working copy so we don't
have a conflict in the word copy here
um then the the second screenshot
um we were on JJ move 2 that commits
with the the first commit with the
conflict and that command
is a bit like the uh all the squash that
someone talked about earlier earlier
today
but much more generic you can can move
a change from any commit into any other
commit
so in this case we're moving the changes
from the working cup it does the default
as well there's no dash from
Move It from The Working copy and into
the conflict the first quit with the
conflict
and then the conflict is gone from the
entire stack and the working copy is in
this case becomes empty
we can also rebase merges correctly
by
rebasing the diff compared to the auto
merged parents
onto the new auto merge parents
and and that's actually how JD treats
all diffs or changes in in commits not
just not just on rebasing Souls when you
when you diff a merge commit it's come
to Auto merge the parents just like the
uh
three merged if I think Elijah talked
about before
um and the the last feature I was going
to talk about is the what I call the
Operation Log
so you can think of it as the entire
repo being under Source control
um and by the entire repo I mean refs
and on Anonymous heads and the working
copies position in in each work tree
so the
um
it's it's kind of like it gets ref Vlogs
but they're
um
across all the refs at once
and there's some extra metadata as well
so in the top screenshot there you can
see the
um
their username and hostname and time of
the operation
um and as as you can can imagine having
these snapshots of the repository at a
different points in time lets us go back
and look at the repository from a
previous snapshot
so in the in the middle uh
snapshot screenshot there
we run JD status at a particular
operation and that shows like just
before we had set set the description on
the commit
so we see that the working copy commit
has no description but it has a modified
font because it was from that point in
time
yeah you can run this with any command
not just status you can do run log or
div for anything
and of course this is very useful when
for when you're trying to figure out how
um aeropostory got into a certain
certain State especially if it's not
your own repositories you don't remember
what what happened
um
and like the last screenshot there shows
how we restored the entire repo to an
earlier State before before the file was
even added in this case
so that's the second operation from the
bottom when when we're just created the
repository so the working copy is empty
and and there's no description
um and and restoring back to
an earlier State like that is is very
useful when you have made a mistake but
uh JD actually even lets you undo
just a single operation without undoing
all the later operations
so that works kind of like git reverted
us but instead of acting on
on the files in the committed acts on
the refs in an operation
so the screenshots here show how we
undo and and
an operation the second to last
operation there that abandoned the
commit so we undo that operation and it
becomes visible again
um and the Operation Log also gives us
safe concurrency even if you have run
commands on
a repositories in on different machines
and sync them via Dropbox for example
you can get conflicts of course but you
will not
lose data or or get corruption
and then in those cases if you sync two
operations via Dropbox you'll see a
merge in this Operation Log output
um we also got a simpler simpler
programming model because transactions
never fail when you started a command
um the repository is loaded at the
current the latest operation and any any
changes that happen
um concurrently will not be seen by that
by that command
and then when the command is is done it
commits that transaction as a new
operation which gets recorded recorded
and can't fail just like write me in
commit can't fail
uh and and this is actually why I
developed the the Operation Log for the
simpler programming model and then uh
the undo and trying travel stuff is just
an accident
okay so um that's all about features and
workflows and take a look let's take a
look at the architecture
uh so JD is written as rest
um
it's written as a library to be to be
easy to to reuse
uh in the CLI is the only the current
only current user of that library but
the goal is to to be able to use it in
in a in a server or a CLI or or uh GUI
or or an idea for example
um and and I hope also by making it the
library we reduce the temptation to to
rewrite and have the same problem as
kids
um and and because I wanted JD to be
able to
um integrate with the ecosystem at
Google
I try to make it easy to replace
different modules a bit
um
and so for example it comes with two
different
commit storage backends by default
one is the the kids back end the source
commits as git commits
and the other one is just a very simple
custom one but it should be possible to
write one that stores commits in the
cloud for example
and same thing with the working copy
it's
of course stored on local disk normally
but the goal is to be able to to write
to replace that with an implementation
that writes
um writes the work in copy to a VFS for
example
integrates with a smarter vs VFS by not
actually writing all the files
and also to be able to use that at
Google
it needs to be scalable to very very
large repositories
which means we can't have any operations
that need all the ancestors for example
of the commits
so to achieve that we achieved that
mostly by by laziness not fetching
objects that we don't need so
try to be careful when designing
algorithms to
like not scale to the size of the
repository or the size of a tree unless
necessary
so another important design decision was
to
um
perform operations in the repository
first not in the working copy so
when all operations are at like right
commits and update references to point
to those commits and then only
afterwards after transaction commits
actually
um is that it's the working copy updated
to to match that that states the new
state if if they're working copy even
Changed by the operation
um and it helps here that the
um the working copy is to commit because
then you even when you're running a
command that updates the working copy
that's to say the same thing you just
commit that transaction that modifies
the working copy commit and then update
to working copy afterwards
um and we get a simpler programming
model because we don't have to we always
create commits
whatever commit we need and we don't
need to worry about what's in the
working copy at that
so same same thing as Elijah was talking
about I think with merge award
um and this makes it much much faster
by not touching the working copy
unnecessarily
um
and it also helps with the laziness by
because you don't need to
fetch objects from the back end right
which might be from a server
um just in order to update the working
copy
and then update it back later for
example
and as I said JD is a is git compatible
and
um you can you can start using it on on
the same independently of others who use
the same
um git project so
um
and this this compatibility happens at a
few different levels at the lowest level
there's the the commit storage
so there is as I said there's one back
and you commit storage back in the
stores commits in in back in git
Repository
um
and uh so so that that reads commits
from the get back the backing git
repository the the object store uh and
converts it into the in-memory in memory
uh representation that JJ expects
um and there's also a functionality for
uh importing gitreps uh into jda and
exporting refs to get
um and and for like interacting with Git
remotes
and of course these commands only work
in when you're when the back end is the
get back end
not with the custom backend
um and of course since I I worked on
Mercurial for many many years
um there are many good things I want to
replicate
such as the the simple clean ux with for
example
um its rev sets language for selecting
revisions
and uh the simpler workflows without the
staging area so
I copied those
mercurial's history rewriting sport is
also pretty good with HD split and phone
for example for for splitting and uh and
uh squash squashing commits
and uh HD rebase which rebases a whole
three up commits not just a linear
Branch or a single Branch at least
um so I copy those things as well
and then Mercurial has is very
customizable
mostly thanks to being written in Python
so that since JJ has written in Rust we
can't just copy that but we can you
can't use monkey patching
in the same way
um but I hope the the modular design I
talked about earlier at least helps
there but it still have a long way to go
if
to be as customizable as as material
okay so
um let's take a look at our plans for
the projects the open source project and
and our Integrations at Google
um so remember this is has probably has
been my 20 project for most of its
lifetime so there's a lot of features
and functionality missing
um like for example GC and repacking is
not implemented
so we need to implement that for both
commits and and operations
for backings that lately fetch objects
over the network to make to make that
perform okay we need to do
some batch prefetching and caching
there is no support for
copies or renames
so we have to
decide if we want to do to track them
like Mercurial or bitkeeper does for
example or if you want to detect them
like like it does
happy to hear suggestions
and just lots of other features are
missing on this particular one I
actually got a pull request from for a
few days ago so but still like get blame
for example is not there
um we need to make it easier to replace
different components like for example
adding a custom command or adding a
custom backend without like using while
using the API or not having to to patch
the source code
um
and yeah language bindings again want to
avoid rewrites so hope people can use
the API instead
so we want to make that as easy as
possible in different languages
security hardening and
the the custom back end uh the
non-get-back and uh is has very little
support like there is no support for
push and pull for example
um so
that would need to be done at some point
but the the get back is just
um
what I would recommend to everyone for
many many years ahead anyway so
and of course
contributions are very welcome the URL
is for the project is there
we don't want this to be just Google
running the product
um and yes I said my this is now my
full-time project at Google
uh and and the the goal of that project
is to
um to improve the internal development
ecosystem
there are
two big main parts of this this project
one is to
replace Mercurial by JJ internally
the other part is to move the
commit storage and
repository storage out of the file
system and into the cloud
so we're hoping to create a single huge
commit graph with all the commits from
all Google Developers
in in just one graph
and and by having the
repository is available in the cloud we
you should be able to
access them from anywhere and and from
like a cloudy Audi for example or review
tool should be able to access your
repository and modify it
so
um and this diagram here shows how we're
planning to accomplish that
so most of this is
the same as on on a previous slide
I've added some new systems and
Integrations with in red
one of the first things we're going to
have to do is to replace the commit
graph or which is called commit inductor
in the in the diagram
uh because it currently assumes that all
ancestors are are available which of
course doesn't scale to the size of our
monorail
so that's uh we're going to have to
rewrite to make that lazy
um
I'm probably going to to use something
called segmented change log that our
friends at meta developed for their
mercurial-based VCS
um then we're going to add
new new backends for the internal
cloud-based storage of commits and
operation logs well operation logs are
basically repositories
um and we will add a custom working copy
implementation for our
VFS
so we we we're hoping to not use sparse
checkouts so instead this
our VFS is we can just tell the the BFS
which commit to check out basically and
and like a few files on top
um
and we'll add some we'll add a bunch of
commands probably for
integrating with our internal
integration with our existing review
reviews and and like starting to reuse
or merging into Mainline and stuff like
that
and then um we're going to add a server
on top of this that will be used by the
cloud ID and review tool
um yeah that was all I had so thanks for
listening if you have any questions you
can find a link to
to Discord chat for example on on from
their poster repository page GitHub page
um and feel free to email me at
martinvancy google.com
thank you
[Applause]

File addition: 20221012_will_git_be_around_forever.md (----------)

[0.1]

[Music]
thank you
okay good afternoon great to see so many
of you here welcome to this talk
my name is Hano and I am from the
Netherlands and in the Netherlands I
work as an I.T consultant at infosupport
my Twitter handle is at a notify
I thought it was rather clever get
notified about things that hono is doing
but if you don't think the same thing
that's not a problem my wife also
doesn't think it's very clever but I do
so that's good enough for me
um I tweet about stuff that I spend way
too much time on so I tweet about
programming Java I'm a musician so I
also tweet about whenever a pop song
makes me think of software development I
tweet about virtual control stuff stuff
that I encounter in the projects that I
do so if you like that feel free to give
me a follow
now when it comes to Version Control the
topic of today's talk I have come a long
way I mean when I was in college back in
I didn't use git or subversion I did
Version Control by email who recognized
this physical though by email yeah
code repository do a few changes zip it
up send it via meal to your classmates
this is the final version and you get a
reply 10 minutes later no this is the
final version no this is the final
version
you get it when I started my first job
in it 2007 I was really eager to learn
something new in the world of Version
Control but at that place of work they
used Version Control by USB stick
also very modern you know you just put
your chain set on a USB stick walk over
to a Central Computer merge it manually
there over there and go to your
workstation back again with a fresh copy
disaster
thank goodness it's 15 years later today
and I have managed to gain some
experience with proper vertical tool
systems like subversion and git I even
teach a git course at info support a
one-day course where I teach our Junior
colleagues about how to use git as a
developer so lots of command line stuff
and also the pros and cons of
distributed Version Control how it
compares to earlier version systems like
CVS and subversion and this is actually
one of the core slides and it displays
which Version Control Systems have
emerged until now and their publication
date and it tries to put this into
perspective by comparing the publication
date to the most modern phone available
at the time so for example the Nokia
3310 I had one of those who had one of
those
Nokia friends hi
Coral correlates to subversion both this
phone and subversion are indestructible
and the battery was excellent but the
version doesn't have a battery so that
doesn't make any sense at all whereas
CVS published in 1990 obviously ancient
refers to oh this one here it even has a
power block also ancient
um and at the end of one particular
course a student came up to me and she
says I really like that you told us kids
and I'm going to be so happy using it
but it seems like kind of an old system
right I mean it was published in 2005
and back then it was 2019 why are we
still using it 14 years later I mean
isn't that kind of old for a Version
Control System what's going to be the
next big thing
and to be honest I couldn't answer her
question
I wasn't sure I had no idea I thought
well it is probably going to be around
forever well of course I didn't say that
out loud because nothing lasts for
referee in it world right next week
there will be a new JavaScript framework
probably so you you never know so I
couldn't tell her that so and I really
didn't like it so I
decided to investigate it for a bit and
things got kind of out of hand because I
didn't just
tell her the answer I had to investigate
it and the investigation turned into a
conference talk and it's the one you're
attending right now so welcome to you
all thank you for attending and let's
try to find the answer to this student's
question together shall we and I think
to answer a question we have to discover
what factors are influencing the
popularity of Version Control today so
let's take git as an example why do we
think kid became so popular
well if we look at this graph for a
moment we see one of the reasons so this
is a graph of the popularity of surgery
control systems until 2016 and we see
that from the moment that GitHub was
launched right here gets started growing
more rapidly also when bitbucket hosting
service that supported Mercurial
initially but started supporting git
around 2012. get started to rise even
more in popularity so a very important
factor is hosting platform support but I
think there are more reasons for example
killer features so when git was
published it was meant as a replacement
for bitkeeper the only distributed
surgical tool system at the time but it
had a commercial license and so when
this commercial license was also going
to be active for open source development
a lot of new projects emerge Mercurial
and also git so free to use
and it was meant to be be fast so
everyday operations should always take
less than a second is what Lena Stovall
said about this and it would support
easy branching unlike CVS for example in
apprentices CVS it's just a carbon copy
of a directory could take a very long
time with a lot of nesting and small
files
hosting platform support for git is of
course superb I mean there are 18
different websites that offer public
gets repositories as of 2022.
and open source Community Support is
also an important reason I think git has
been the driving force behind Global
open source development with contributor
spread across the globe
and I really like to use these strengths
of git as prediction variables factors
that will influence our prediction for
the next big thing and I would like to
add another one I call it the handicap
of the head start because it's always
the hardest for the top product to
remain the top product only if you
innovate regularly you will remain the
top product and not every product has
managed this for example Internet
Explorer has just been declared kind of
dead right in the start of this year
fail to innovate fail to add tabs
browsing for example during the browser
Wars and it got overtaken by Firefox and
later Google Chrome
um this is the same thing that happens
to subversion or maybe to your favorite
football team I'm A Fine Arts reporter
for the dutchies uh in the in the room
so I really know what it is to get on
top and not stay there I mean we only
get Champions like once in the 18 years
so I really know how this feels when we
put the graph data into a table this is
what we see we see that in 2021 so
person is down to eight percent and get
us up to 74 now the graph didn't last
until 2021 so I had to extrapolate here
for a bit but I think these numbers are
are quite correct
um the question is what will happen in
2032 will get silby the top one well of
course we also have to keep into account
that new products will emerge so let's
add some new products to the mix
in the rest of the talk I'll add two
newer version control systems to this
table and they are called fossil so
let's talk about fossil first
so fossil
has the following features it is a
distributed Version Control System it
was published a few years after git and
it has integrated project management so
box tracking it has a Wiki it has stack
notes sometimes it's called GitHub in a
box so all these all this stuff all
these features you can pull up on your
local development machine through the
command fossil UI I'll show you in a bit
there's no really no not need to use any
third-party products here because you
have a fully featured developer website
available when you use the Version
Control System fossil
so like I said there's this this
built-in web interface and there's also
an auto sync merge mode which means like
in git terms after you commit you
automatically push if there are no
conflicts
but you can also use a manual immersion
mode if you want
and fossil can also show descendants of
a check-in whereas in Git You can only
show ancestors with fossil you can also
use descendants so relationships between
commits actually are bi-directional
and it supports operations on multiple
repositories so if you're doing
microservices and every application is a
separate repository with fossil you can
perform operations on all repositories
that it knows of
and as a pursuit of all history
philosophy which means there is no
rebase in fossil so you can't change the
history would be would also be a very
bad name fossil right for a vertical
control system that could change the
history so
the name is well chosen I think
It Was Written specifically to support
the development of SQL live
and the escrow live project also uses
fossil for their development
and these projects they they
reap the benefits of one another because
fossil just used an sqlite database
for to to store the relationship between
the check-ins
um there are three codes housing at the
chiselab.com so you can use that if you
want to try out fossil but actually
fossil just needs an SQL live database
so I guess any hosting space provider
could provide this and do the job if you
want so kind of flexible
like I said you could just host it
yourself now already for repository in
fossil is like I said a single sqli
database file and it contains relations
between the check-ins so it can produce
both ancestors and descendants so let's
see how fossil work it works in a quick
demo and I'll switch to a different
screen
for you
there we go so we are in an empty
directory right now
let's enlarge it for a bit
um and we're going to create a new
fossil repository right here let's
create a demo
directory and let's create a new
repository now I fuss out the rep the
repository file is kept outside of the
directory so this is different than in
git because in git there's the dot git
folder the nested folder by with fossil
you keep the repository file outside of
the directory
let's change there into demo and then we
have to connect this directory to the
repo file that's in the folder above us
and now we have connected to Connected
them
so let's create a Java file
we have created it right now
and let's just add this file to our
repository we've added it now we can see
the changes if we want there it is and
we can commit like this
just looks like it actually commit minus
M initial commit there we go we've got a
new commit and if we want to see this
developer website we just um we just
performed the command fossil UI and
there it is our fossil UI it shows us
these check-ins and the history of our
repos three you can browse the files if
you want
okay so I've called this file
random.java so the idea is to provide
some random implementation here right so
let's create a class random
and let's have a main method here
that just sends the system out the
result of the random call
right
private and random
and um
well every time I want to generate a
random number I always try to roll die
like in XKCD Comics someone recognizes
it
roll a die to get a random number so I'm
just rolling it to people over there
maybe you can help me out to see
which number it lands on
the result is one ladies and gentlemen
so let's return one generated by a die
guaranteed to be random right
yes thanks
I'll recover that later
okay so let's return to our fossil
repulsory
um
close the UI down
just for the sake of Sanity oh I've
forgotten something here
oh sorry yes the demo effect
thank you folks very much
there we go one I can run again but it
wouldn't be very interesting right so
um let's add this new random
implementation to Fossil
and do a fossil commit again
implementate Implement random there we
go
have we we have committed it right but
the fossil UI isn't running because I
killed it
there we go there's an Implement random
now if you would create a branching
fossil you would also also you could
also based on the check and see its
descendants so this is a short talk I'm
not going to be able to show very much
of that but that's a nice feature of
fossil I think in my opinion
so returning to the slide deck
so
get first fossil actually
fossil is really meant for small scale
teams so the fossil team and the sqlite
team they are both small teams four or
five developers and they all know each
other personally whereas with Git this
is a major Global open source project
um 5000 contributors to the nilinks
kernel so the purpose of these two
vertical tool systems are very different
also the engagement is very Global with
Git and personal with sqlite and with
fossil and the workflow is also very
very different so with Git the workflow
Lieutenant the dictator and Lieutenant
workflow was thought of by Linux
Orville's and and his co-workers
um in this case because the Linda kernel
is so vast
so many lines of code liners only trust
a few specific people for certain parts
of the Linux kernel so that's why he
calls them his lieutenants whereas with
fossil this is a very small team so the
workflow is built on trust
so surely something to think about is
why are most projects using git
a physical tool system that was designed
to support a globally developed open
source major project like the Lincoln
Square gnome because our old projects
actually like that or are a lot of
projects actually like sqlite and fossil
a small-scale team with a limited amount
of team members not saying you should
switch to Fossil I'm just saying maybe
try it out for a change and see if you
like it
okay so the second first control system
is behold time to dive into that and the
first thing we need to address is the
name so it's a Spanish word for a bird
that's native to Mexico and it's a bird
that is known to do collaborative nest
building so I thought they'll kind of
like the analogy of collaboratively
building something using a Version
Control System
and to try different names but this one
is very uniquely googleable and well
they have a fair point so pickle it is
this vertical tool system was first
published in 2015 and it's a it's a
vertical tool system that is patch based
also distributed by the way but the
patch based thing is kind of kind of
neat
um it's based on the sound theory of
patches they've got all these
mathematical proofs that their patch
Theory works and that um that it doesn't
lead to many conflicts with all kinds of
examples I'm not a mathematician but if
you like that stuff you should really
check out their website there it's not
by no means the first batch based
version control system because darks is
also a versatile system based on patches
and but darks never got really quite
popular because it was kind of slow
and one of the main features of before
is to fix the slowness of darks
yeah like I said
um it also has a feature called
interactive recording which means when
you have added files to your in in
behold this is called a batchel edit
files to your patch you can still choose
which parts of the file you want to
include in your chains and which parts
you want to leave out
some quick facts about the who it is
written in Rust which is one of the
faster languages around if you see
language speed comparisons
um and it's been bootstraps in 2017
which means the people at the hall use
bijo as a version console system for
their source code just like the people
at IntelliJ use their own Ides to
develop their own ideas
so they eat their own dog food basically
and you can use free code hosting at
nest.becool.com
their hosting platform is called the
pickle Nest like the birds that do the
collaborative nest building so it all
makes sense now
and like I said
is based on a patch oriented design
so Apache is an intuitive Atomic unit of
work and it focuses on changes instead
of differences between snapshots so this
is what happens with kit you create a
version of a file you create a next
version and get saves the two states of
the file but using changes with the
whole the whole stores the things that
you added or removed from the file not
the files itself
also means that applying or unapplying a
patch doesn't change its identity with
Git it does change its identity when you
want to apply commit to a different
branch it gets a different ID and it's
not exactly the same commit internally
the the contents of the commit will be
similar but it won't be identical
and the end result of applying several
patches and before is always the same
regardless of the order in which they
were applied
and behold does this by keeping track of
dependent patches so if I create a file
in patch one modified in patch 2 when I
want to apply patch 2 to a different
file set
but who knows I have to apply patch 1
first and then after that I'm going to
apply patch 2.
this also means that rebase and merge
don't really exist
applying a patch is just like a flying a
patch and emerge if you really would
want to do a merge between two and
pre-hole a branches called a channel so
a merge between two channels well you
just have to apply the patches that are
the differences between those two
channels but there's not a merge command
and there are certainly no rebase
command
pickle does model conflict so conflicts
can happen but the resolve to resolve a
conflict you just have to add a third
patch on top of the two patches that are
conflicting and because these patches
um
the identities never change you can
always use the third patch that fixes
the conflict on in another spot or add
another channel to fix the same conflict
so you don't have to fix conflicts over
and over again like you sometimes have
with kids
let's make it more visual I think I'll
I'll use the second slide actually
like if commits were Bank transactions
this is what would be the difference
between snapshot and Patch based so if
this is my bank account and my initial
balance is 100 get food store 100 but
who would store
apparently you've gained 100 right now
Euros or dollars doesn't matter
when I get paid by my employer
it's an okay salary it's not much but
you know 300 Euros apparently a git
would store 400 because this is the end
result of the change but the who would
store plus 300 because this is the
actual change
and then I have to pay my heating bill
and of course uh well all balance will
be gone because that's very expensive
nowadays so uh git would store zero and
before we store minus 400. so this is
essentially the difference between
snapshot and Patch based systems
okay so let's demo this for a bit
see how it will work
so here I've got an empty directory with
uh before installed
and the first thing we need to do is
initialize the repository right so let's
run up a hole in a demo we get a new
folder just like with Git so that seems
easy enough
now I want to I want to keep track of
the movies I've seen this is my use case
and this week I'm watching one movie
next week I'll watch another one
so I'll create a channel for next week
and if I'll ask because what channels do
you know
sorry it's called Channel
singular there's a main Channel and a
next week channel so right now we're on
the main channel right so let's
um
create a movies file
well this week
I'm a devox and in my spare time I can
watch the movies I like right because
I'm I'm all alone right now and I was
planning to watch
a movie by Quentin Tarantino
I saw this one was on Netflix it's been
a long time coming I should have seen it
already didn't so
let's make sure it happens this week
right
um let's add let's tell people that this
file exists
and let's record it
like so like record minus m
watched a movie
here we go
now we get a an edit screen it says the
message this is the author this is my
author key it's very it's very inclusive
I can change my name later to henna
instead of Hano and it will still know
that this was me so I really like the
feature
um and here we I can say this line I
need to add this line or I can skip it
and if there will be multiple lines this
is the interactive recourse so I can
comment things out here if I want but
I'm not going to do it right now I'm
just going to save it like this
um
and um
I think I'm going to watch another movie
so let's um
edit it out again
um
movies but this time it will be
the next week and I'll I won't be here I
will be home with my kids and my kids
want to watch The Little Mermaid
so let's
add that movie to our list also
watch another movie
there we go
syntax errors I'm not sure what I did
all this is what I did sorry
now if we ask for the pickle log we see
all the patches that were here right and
what I want to do is take
this patch so I'll copy it right now to
uh to my clipboard
and change channels
to the next week one
because switch channel next week
Channel switch it's Channel switch
should have known that
here we go and of course
those patches are not present in this
channel but I can apply it right now
and what it has happened it has applied
two patches because it can't apply an
addition in a file that wasn't there in
the first place so it knows there's a
dependent patch and I need to apply that
one first and I'm not going to be able
to show you but it's it's just as easy
with unrecording so I can unrecord a
patch and who will know what patches are
dependent of the on recording so that's
really nice
[Music]
um
and uh these are all the commands
actually the command I also like is
credits I really hate the git blame
that's so negative big hole just says
credit you know because developers are
creating good stuff so that's a nice
touch by the whole team yeah I like that
let's return to the slides to wrap the
talk up
foreign
status status can we
here we go can we use this in production
well it has been a research project for
quite some time but it has entered
version 1.0
at beginning of this year and they are
currently in a beta phase so I'm going
to really keep track of that and see if
I can introduce it in all of my projects
right now I've only used it I think for
keeping track of flight decks working on
it with people collaboratively and I
really liked it until now and I think
it's quite promising because
um it is quite simple to work with
patches and you don't get this
this very familiar feeling that oh I've
done something wrong with Git and now my
repository has gone to hell and I don't
know how to fix it these things happen
quite not so often with with the whole a
lot less often actually
but of course there are a few drawbacks
and I'll try to plug them into our
prediction prediction variables
so these were the prediction variables
that we created right just before we did
the demos of fossil and behold killer
features hosting platform supports open
source community support and the
handicap of a head start
now so now that we have seen a bit of
fossil and before let's score them
according to these variables
well when we talk about features I um I
I gave behold the highest score because
I like their fast patch-based versioning
um especially if you're working with in
teams with both Technical and
non-technical contributors it's it could
be very very good for them to to to to
cooperate with a version control system
that doesn't require so much details
about the virtual control system in the
first place like with Git You have to
well basic usage it's fine but when you
do something wrong or something you
didn't intend you need a very bizarre
magic command that you can't remember
and have to Google every every time you
need it
um when so with be cool this could be a
lot easier for them so and I think I
like the show descendants feature of
fossil and also the fact that it focuses
itself on small teams I think there are
more small teams in the world than large
teams
um but I've scored them just a bit lower
so when it comes to hosting capabilities
of course git will win this one because
like I said there are 18 different
places on the internet where you can
host public get repositories and the
Mercurial repositories are all right I
guess there are like seven or eight
um Team Foundation those has only one
Azure devops
um and fossil and before both have one
so
they have the potential to grow but
still it's not their strong suit right
and we have to wait and see how popular
they get
I think git and Mercurial life both
proven that they are superb in
in supporting the open source community
and we have to wait and see how this
goes with uh fossil although I can tell
you already now that because fossil can
also be hosted by yourself at your own
your own service
um that won't will not help them in the
open source Community whereas because
the whole nest and you can share your
projects there so a bit more potential
there
and the dominant projects suffer the
most from the handicap of the have start
so gets a Mercurial suffer the most from
that and when we add it all together I
think that get a mercuro will well they
come up come out okay team Foundation I
don't think we'll be using much of that
anymore especially now that Azure devops
supports kit repositories
uh fossil was staying about the same and
before I think will gain some
so when we have to put this into
percentages of course this is this is a
lot of guesswork right but it's the best
I could do with the data that I have
right now I think we'll see something
like this
so
git has git has a very high top already
and um compared to 2021 I think it will
grow a bit further but suppression won't
grow it will only Decline and Mercurial
will also decline a bit and team
Foundation I think in my git course I
get a lot of people who were used to
team foundation and want to learn git
right now so I think there will also
decline for a bit and I think also
fossil and who will grow for a bit I
think there are good projects I think
they'll gain some Traction in their own
communities but of course we have to
wait and see whether they will gain
Global traction so I put I've put them
at six percent and three percent of
course they all or gases but like I said
it's very much dependent on whether
hosting platforms want to more hosting
platforms want to support these Version
Control Systems
um I've just sorted it by popularity for
your convenience right here
if you're really into the topic I've
compiled a list of articles that I read
in preparation of this talk I also tweet
the slide so you can read them later if
you want
um
basic question at the end of the talk I
think is now what what should I do with
this information well
I one of the things that I learned by
doing research or that actually two of
the things I learned is a lot of
projects are nothing like the Linux
kernel I think most of the projects I
worked on during my working hours were
nothing like the Linens kernel so why
don't you try false over chains and see
if you like it
and secondly get snapshoting and the
situations that you can find yourself in
after doing an operation that you didn't
intend might be too technical for the
average user so why don't you try behoo
for a change maybe for a side project we
work on it with a few people see if you
like it and you can like I said you can
use it because it is production ready
and stable
um it's a short talk so not really much
time for uh public questions but if you
have some you can come see me after the
talk I would like to thank you for your
attention and have a great conference
day
[Applause]
[Music]

File addition: 20220606_is_it_time_to_look_past_git.md (----------)

[0.1]

# Document Title
 Is it time to look past Git?
#git
#scm
#devops
I clearly remember the rough days of CVS and how liberating the switch to Subversion felt all those years ago. At the time I was pretty sure that nothing would ever get as good as Subversion. As it turns out, I was wrong, and Git showed it to me! So, having learned distributed version control concepts and embraced Git, I was pretty zealous about my newfound super powers. Again, I felt sure that nothing would ever surpass it. Again, it turned out I was wrong.
At the time of this writing, Git's been with us for over a decade and a half. During that time the ecosystem has absolutely exploded. From the rise of GitHub and GitLab to the myriad of new subcommands (just look at all this cool stuff), clearly Git has seen widespread adoption and success.
So what's wrong with Git?
Well for starters, I'm not the only person who thinks that Git is simply too hard. While some apparently yearn for the simpler Subversion days, my frustration comes from working with Git itself, some of which others have been pointing out for the last decade.
My frustrations are:
    I find the git-* family of utilities entirely too low-level for comfort. I once read a quote that Git is a version control construction kit more than a tool to be used directly. I think about this quote a lot.
    Git's underlying data model is unintuitive, and I love me a good graph data structure. Conceiving of a project's history as a non-causal treeish arrangement has been really hard for me to explain to folks and often hard for me to think about. Don't get me started on git rebase and friends.
    I can't trust Git's 3-way merge to preserve the exact code I reviewed. More here and here.
Two interesting projects I've discovered recently aim to add a Git-compatible layer on top of Git's on-disk format in order to simplify things. If not to simplify the conceptual model, then at least to provide an alternative UX.
One such project is the Gitless initiative which has a Python wrapper around Git proper providing far-simpler workflows based on some solid research. Unfortunately it doesn't look like Gitless' Python codebase has had active development recently, which doesn't inspire much confidence.
An alternative project to both Git and Gitless is a little Rust project called Jujutsu. According to the project page Jujutsu aims, at least in part, to preserve compatibility with Git's on-disk format. Jujutsu also boasts its own "native" format letting you choose between using the jj binary as a front-end for Git data or as a net-new tool.
What have we learned since Git?
Well technically one alternative I am going to bring up predates Git by several years, and that's DARCS. Fans of DARCS have written plenty of material on Git's perceived weaknesses. While DARCS' Haskell codebase apparently had some issues, its underlying "change" semantics have remained influential. For example, Pijul is a Rust-based contender currently in beta. It embraces a huge number of the paradigms, including the concept of commutative changes, which made DARCS effective and ergonomic. In my estimation, the very concept of change commutation greatly improves any graph-based models of repositories. This led me to have a really wonderful user experience when taking Pijul out for a spin on some smallish-local projects.
While Pijul hasn't yet reached production-quality maturity, the Fossil project has a pretty good track record. Fossil styles itself as the batteries-included SCM which has long been used as the primary VCS for development of the venerable SQLite project. In trying out Fossil, I would rate its usability and simplicity as being roughly on par with Pijul, plus you get all the extras like a built-in web interface complete with forums, wiki, and an issue tracker.
Ultimately I prefer the Pijul model more than Fossil's but still can't realistically see either of them replacing Git.
Why I'm not ditching Git (yet)
While my frustrations have lead me to look past Git, I've come to the conclusion that I can't ditch it quite yet. Git's dominance is largely cemented and preserved by its expansive ecosystem. It's so much more than the Git integration which ships standard in VSCode and even more than the ubiquity of platforms like GitHub and GitLab. It comes down to hearts and minds insofar as folks have spent so much time "getting good at Git" that they lack the motivation to switch right now. Coupled with the pervasiveness of Git in software development shops, it's hard to envision any alternatives achieving the requisite critical mass to become a proper success.
Granted, just as Git had a built-in feature to ease migration from Subversion, Fossil has a similar feature to help interface with Git. That certainly helps, but I honestly don't know if it'll be enough.
👋 Before you go
Do your career a big favor. Join DEV. (The website you're on right now)
It takes one minute, it's free, and is worth it for your career.
Get started
Community matters
Top comments (2)
pic
 
jessekphillips profile image
•
Jul 10 '22
I wish tools would emphasize a better flow. You point to rebase as an issue, but it makes for much more clarity of changes.
I have not looked at the tools you noted, but I tend to find that these things go back to basics rather then empower.
People should utilize git as a way to craft a change, not just saving some arbitrary state of the source code.
Reply
 
sasham1 profile image
•
Jan 24
Hi Jonathan, you might want to check out Diversion version control - we just launched on HN.
Would love to hear your thoughts!

File addition: 20220310_pijul_for_git_users.md (----------)

[0.1]

# Pijul for Git users
## Introduction
Pijul is an in-development distributed version control system that
implements repositories using a different [model][model] than
Git's. This model enables [cool features][why-pijul] and avoids
[common problems][bad-merge] which we are going to explore in this
tutorial. (The good stuff appears from the "Maybe we don't need
branches" section.) It is assumed that the reader uses Git in a
daily base, i.e. knows not only how to commit, push and pull but
also has had the need to cherry-pick commits and rebase branches
at least once.
## Creating a repo
```
$ mkdir pijul-tutorial
$ cd pijul-tutorial
$ pijul init
```
Nothing new here.
## Adding files to the repository
Just like in Git after we create a file it must be explicitly
added to the repository:
```
$ pijul add <files>
```
There's a difference though: in Pijul a file is just a UNIX file,
i.e. directories are also files so we don't need to create `.keep`
files to add *empty* directories to our repos. Try this:
```
$ mkdir a-dir
$ touch a-dir/a-file
$ pijul add a-dir/
$ pijul status
```
The output will be:
```
On branch master
Changes not yet recorded:
  (use "pijul record ..." to record a new patch)
        new file:  a-dir
Untracked files:
  (use "pijul add <file>..." to track them)
        a-dir/a-file
```
To add files recursively we must use the `--recursive` flag.
## Signing keys
Pijul can sign patches automatically, so let's create a signing key
before we record our first patch:
```
$ pijul key gen --signing
```
The key pair will be located in `~/.local/share/pijul/config`. At
the moment the private key is created without a password so treat
it with care.
## Recording patches
From the user perspective this is the equivalent to Git's commit
operation but it is interactive by default:
```
$ pijul record
added file a-dir
Shall I record this change? (1/2) [ynkadi?] y
added file a-dir/a-file
Shall I record this change? (2/2) [ynkadi?] y
What is your name <and email address>? Someone's name
What is the name of this patch? Add a dir and a file
Recorded patch 6fHCAzzT5UYCsSJi7cpNEqvZypMw1maoLgscWgi7m5JFsDjKcDNk7A84Cj93ZrKcmqHyPxXZebmvFarDA5tuX1jL
```
Here `y` means yes, `n` means no, `k` means undo and remake last
decision, `a` means include this and all remaining patches, `d`
means include neither this patch nor the remaining patches and `i`
means ignore this file locally (i.e. it is added to
`.pijul/local/ignore`).
Let's change `a-file`:
```
$ echo Hello > a-dir/a-file
$ pijul record
In file "a-dir/a-file"
+ Hello
Shall I record this change? (1/1) [ynkad?] y
What is the name of this patch? Add a greeting
Recorded patch 9NrFXxyNATX5qgdq4tywLU1ZqTLMbjMCjrzS3obcV2kSdGKEHzC8j4i8VPBpCq8Qjs7WmCYt8eCTN6s1VSqjrBB4
```
## Ignoring files
We saw that when recording a patch we can chose to locally ignore
a file, but we can also create a `.pijulignore` or `.ignore` file
in the root of our repository and record it. All those files
accept the same patterns as a `.gitignore` file.
Just like in Git if we want to ignore a file that was recorded in
a previous patch we must remove that file from the repository.
## Removing files from the repository
```
$ pijul remove <files>
```
The files will be shown as untracked again whether they were
recorded with a previous patch or not, so this has the effect of
`git reset <files>` or `git rm --cached` depending on the previous
state of these files.
## Removing a patch
```
$ pijul unrecord
```
This command is interactive. Alternatively, one can use `pijul
unrecord <patch>` to remove one or more patches, knowing their hash.
Patch hashes can be obtained with `pijul log`.
Unrecording and recording the same patch again will leave the
repository in the same state.
There are cases where a patch depends on a previous one.
For example if a patch edits (and only edits) file A it will
depend on the patch that created that file. We can see these
dependencies with `pijul dependencies` and they are managed
automatically. This is why `pijul unrecord <patch>` might
sometimes refuse to work.
## Discarding changes
```
$ pijul revert
```
This is like `git checkout` applied to files (instead of
branches).
## Branches
To create a new branch we use the `pijul fork <branch-name>`
command and to switch to another branch we use `pijul
checkout <branch-name>`.
To apply a patch from another branch we use the `pijul apply
<patch-hash>` command. Notice that this doesn't produce a
different patch with a different hash like `git cherry-pick` does.
Finally to delete a branch we have the `delete-branch` subcommand,
but:
## Maybe we don't need branches
Because in Git each commit is related to a parent (except for the
first one), branches are useful to avoid mixing up unrelated work.
We don't want our history to look like this:
```
* More work for feature 3
|
* More work for feature 1
|
* Work for feature 3
|
* Work for feature 2
|
* Work for feature 1
```
And if we need to push a fix for a bug ASAP we don't want to also
push commits that still are a work in progress so we create
branches for every new feature and work in them in isolation.
But in Pijul patches usually commute: in the same way that 3 + 4 +
8 produces exactly the same result than 4 + 3 + 8, if we apply
patch B to our repo before we apply patch A and then C the result
will be exactly the same that our coworkers will get if they apply
patch A before patch C and then patch B. Now if patch C has a
dependency called D (as we saw in Removing a patch) they cannot
commute, but the entire graph commutes with other patches, i.e if
I apply patch A before patch B and then patches CD I would get the
same repository state than if I applied patch B before patches CD
and then patch A. So Alice could have the same history as in the
previous example while Bob could have
```
* More work for feature 1
|
* Work for feature 2
|
* More work for feature 3
|
* Work for feature 3
|
* Work for feature 1
```
And the repos would be equivalents; that is, the files would be the
same. Why is that useful?
We can start working on a new feature without realizing
that it is actually a new feature and that we need a new branch.
We can create all the patches we need for that feature (e.g. the
patches that implement it, the patches that fix the bugs
introduced by it, and the patches that fix typos) in whatever
order we want. Then we can unrecord these patches and record them
again as just one patch without a rebase. (There's actually no
rebase operation in Pijul.)
But this model really shines when we start to work with:
## Remotes
At the moment, pushing works over SSH, the server only needs to have
Pijul installed. [The Nest][nest] is a free service that hosts public
repositories. We can reuse our current SSH key pair or create a new
pair with
```
$ pijul key gen --ssh
```
This new key pair will be stored in the same directory used for
the signing keys and we can add it to [The Nest][nest] like we do
with SSH keys in GitHub.
Now that we have an account on [The Nest][nest] we can upload our
signing key with `pijul key upload`.
Now let's push something:
```
$ pijul push <our-nest-user-name>@nest.pijul.com:<our-repo>
```
Unless we pass the `--all` flag Pijul will ask us which patches we
want to push. So we can keep a patch locally, unrecord it, record
it again, decide that actually we don't need it and kill it
forever or push it a year later when we finally decide that the
world needs it. All without branches.
If we don't want to specify the remote every time we push we can
set it as default with the `--set-default` flag.
Of course to pull changes we have the `pijul pull` command.
Both commands have `--from-branch` (source branch),
`--to-branch` (destination branch) and `--set-remote` (create a
local name for the remote) options.
BTW if we can keep patches for ourselves can we pull only the
patches we want? Yes, that's called "partial clone". [It was
introduced in version 0.11][partial-clone] and works like this:
```
$ pijul pull --path <patch-hash> <remote>
```
Of course it will bring the patch and all its dependencies.
As we have seen we neither need branches, cherry-picking nor
rebasing because of the patch theory behind Pijul.
## Contributing with a remote
With Pijul we don't need forking either. The steps to contribute
to a repo are:
1. Clone it with `pijul clone <repo-url>`
2. Make some patches!
3. Go to the page of the repo in [The Nest][nest] and open a
   new discussion
4. [The Nest][nest] will create a branch with the number of the
   discussion as a name
5. Push the patches with `pijul push
   <our-user-name>@nest.pijul.com:<repo-owner-user-name>/<repo-name>
   --to-branch :<discussion-number>`
Then the repo owner could apply our patches to the master branch.
You can also attach patches from your repos to a discussion when
you create or participate in one.
## Tags
A tag in Pijul is a patch that specifies that all the previous
patches depend on each other to recreate the current state of the
repo.
To create a tag we have the `pijul tag` command which will ask for
a tag name.
After new patches are added to the repo we can recreate the state
of any tag by creating a new branch:
```
pijul fork --patch <hash-of-the-tag> <name-of-the-new-branch>
```
Because tags are just patches we can look for their hashes with
`pijul log`.
## In summary
Forget about bad merges, feature branches, rebasing and conflicts
produced by merges after cherry-picking.
## Learning more
[Pijul has an on-line manual][manual] but currently it is a
little bit outdated. The best way to learn more is by
executing `pijul help`. This will list all the subcommands and we
can read more about any of them by running `pijul help
<subcommand>`.
The subcommands are interactive by default but we can pass data to
them directly from the command line to avoid being asked
questions. All these options are explained in each subcommand's help.
For more information on the theory behind Pijul refer to Joe
Neeman's [blog post on the matter][theory]. He also wrote a post
[that explains how Pijul implements it][implementation].
## A work in progress
As we said Pijul is an in-development tool: [the UI could
change in the future][ui-changes] and there are some missing
features. ([Something][bisect] like `bisect` would be super
helpful.) But that's also an opportunity: the developers seem
quite open to [receive feedback][discourse].
[bad-merge]: https://tahoe-lafs.org/~zooko/badmerge/simple.html
[bisect]: https://discourse.pijul.org/t/equivalent-of-checking-out-an-old-commit/176
[discourse]: https://discourse.pijul.org
[implementation]: https://jneem.github.io/pijul/
[nest]: https://nest.pijul.com
[manual]: https://pijul.org/manual/
[model]: https://pijul.org/model/
[partial-clone]: https://pijul.org/posts/2018-11-20-pijul-0.11/
[theory]: https://jneem.github.io/merging/
[ui-changes]: https://discourse.pijul.org/t/equivalent-of-checking-out-an-old-commit/176
[why-pijul]: https://pijul.org/manual/why_pijul.html

File addition: 20220210_designing_data_structures_for_collaborative_apps.md (----------)

[0.1]

# Document Title
Designing Data Structures for Collaborative Apps
Matthew Weidner | Feb 10th, 2022
Home | RSS Feed
Keywords: collaborative apps, CRDTs, composition
    I’ve put many of the ideas from this post into practice in a library, Collabs. You can learn more about Collabs, and see how open-source collaborative apps might work in practice, in my Local-First Web talk: Video, Slides, Live demo.
Suppose you’re building a collaborative app, along the lines of Google Docs/Sheets/Slides, Figma, Notion, etc. One challenge you’ll face is the actual collaboration: when one user changes the shared state, their changes need to show up for every other user. For example, if multiple users type at the same time in a text field, the result should reflect all of their changes and be consistent (identical for all users).
Conflict-free Replicated Data Types (CRDTs) provide a solution to this challenge. They are data structures that look like ordinary data structures (maps, sets, text strings, etc.), except that they are collaborative: when one user updates their copy of a CRDT, their changes automatically show up for everyone else. Each user sees their own changes immediately, while under the hood, the CRDT broadcasts a message describing the change to everyone else. Other users see the change once they receive this message.
CRDTs broadcast messages to relay changes
Note that multiple users might make changes at the same time, e.g., both typing at once. Since each user sees their own changes immediately, their views of the document will temporarily diverge. However, CRDTs guarantee that once the users receive each others’ messages, they’ll see identical document states again: this is the definition of CRDT correctness. Ideally, this state will also be “reasonable”, i.e., it will incorporate both of their edits in the way that the users expect.
    In distributed systems terms, CRDTs are Available, Partition tolerant, and have Strong Eventual Consistency.
CRDTs work even if messages might be arbitrarily delayed, or delivered to different users in different orders. This lets you make collaborative experiences that don’t need a central server, work offline, and/or are end-to-end encrypted (local-first software).
Google Docs doesn't let you type while offline
CRDTs allow offline editing, unlike Google Docs.
I’m particularly excited by the potential for open-source collaborative apps that anyone can distribute or modify, without requiring app-specific hosting.
# The Challenge: Designing CRDTs
Having read all that, let’s say you choose to use a CRDT for your collaborative app. All you need is a CRDT representing your app’s state, a frontend UI, and a network of your choice (or a way for users to pick the network themselves). But where do you get a CRDT for your specific app?
If you’re lucky, it’s described in a paper, or even better, implemented in a library. But those tend to have simple or one-size-fits-all data structures: maps, text strings, JSON, etc. You can usually rearrange your app’s state to make it fit in these CRDTs; and if users make changes at the same time, CRDT correctness guarantees that you’ll get some consistent result. However, it might not be what you or your users expect. Worse, you have little leeway to customize this behavior.
Anomaly in many map CRDTs: In a collaborative todo-list, concurrently deleting an item and marking it done results in a nonsense list item with no text field.
In many map CRDTs, when representing a todo-list using items with "title" and "done" fields, you can end up with an item {"done": true} having no "title" field. Image credit: Figure 6 by Kleppmann and Beresford.
Simulated user complaint: In a collaborative spreadsheet, concurrently moving a column and entering data might delete the entered data. If you don't control the CRDT implementation, then you can't fix this.
If a user asks you to change some behavior that comes from a CRDT library, but you don't understand the library inside and out, then it will be a hard fix.
This blog post will instead teach you how to design CRDTs from the ground up. I’ll present a few simple CRDTs that are obviously correct, plus ways to compose them together into complicated whole-app CRDTs that are still obviously correct. I’ll also present principles of CRDT design to help guide you through the process. To cap it off, we’ll design a CRDT for a collaborative spreadsheet.
Ultimately, I hope that you will gain not just an understanding of some existing CRDT designs, but also the confidence to tweak them and create your own!
# Basic Designs
I’ll start by going over some basic CRDT designs.
# Unique Set CRDT
Our foundational CRDT is the Unique Set. It is a set in which each added element is considered unique.
Formally, the user-facing operations on the set, and their collaborative implementations, are as follows:
    add(x): Adds an element e = (t, x) to the set, where t is a unique new tag, used to ensure that (t, x) is unique. To implement this, the adding user generates t, e.g., as a pair (device id, device-specific counter), then serializes (t, x) and broadcasts it to the other users. The receivers deserialize (t, x) and add it to their local copy of the set.
    delete(t): Deletes the element e = (t, x) from the set. To implement this, the deleting user serializes t and broadcasts it to the other users. The receivers deserialize t and remove the element with tag t from their local copy, if it has not been deleted already.
In response to user input, the operator calls "Output message". The message is then delivered to every user's "Receive & Update display" function.
The lifecycle of an add or delete operation.
When displaying the set to the user, you ignore the tags and just list out the data values x, keeping in mind that (1) they are not ordered (at least not consistently across different users), and (2) there may be duplicates.
Example: In a collaborative flash card app, you could represent the deck of cards as a Unique Set, using x to hold the flash card’s value (e.g., its front and back strings). Users can edit the deck by adding a new card or deleting an existing one, and duplicate cards are allowed.
When broadcasting messages, we require that they are delivered reliably and in causal order, but it’s okay if they are arbitarily delayed. (These rules apply to all CRDTs, not just the Unique Set.) Delivery in causal order means that if a user sends a message m after receiving or sending a message m’, then all users delay receiving m until after receiving m’. This is the strictest ordering we can implement without a central server and without extra round-trips between users, e.g., by using vector clocks.
Messages that aren’t ordered by the causal order are concurrent, and different users might receive them in different orders. But for CRDT correctness, we must ensure that all users end up in the same state regardless, once they have received the same messages.
For the Unique Set, it is obvious that the state of the set, as seen by a specific user, is always the set of elements for which they have received an add message but no delete messages. This holds regardless of the order in which they received concurrent messages. Thus the Unique Set is correct.
    Note that delivery in causal order is important—a delete operation only works if it is received after its corresponding add operation.
We now have our first principle of CRDT design:
Principle 1. Use the Unique Set CRDT for operations that “add” or “create” a unique new thing.
Although it is simple, the Unique Set forms the basis for the rest of our CRDTs.
    Aside. Traditionally, one proves CRDT correctness by proving that concurrent messages commute—they have the same effect regardless of delivery order (Shapiro et al. 2011)—or that the final state is a function of the causally-ordered message history (Baquero, Almeida, and Shoker 2014). However, as long as you stick to the techniques in this blog post, you won’t need explicit proofs: everything builds on the Unique Set in ways that trivially preserve CRDT correctness. For example, a deterministic view of a Unique Set (or any CRDT) is obviously still a CRDT.
    Aside. I have described the Unique Set in terms of operations and broadcast messages, i.e., as an operation-based CRDT. However, with some extra metadata, it is also possible to implement a merge function for the Unique Set, in the style of a state-based CRDT. Or, you can perform VCS-style 3-way merges without needing extra metadata.
# List CRDT
Our next CRDT is a List CRDT. It represents a list of elements, with insert and delete operations. For example, you can use a List CRDT of characters to store the text in a collaborative text editor, using insert to type a new character and delete for backspace.
Formally, the operations on a List CRDT are:
    insert(i, x): Inserts a new element with value x at index i, between the existing elements at indices i and i+1. All later elements (index >= i+1) are shifted one to the right.
    delete(i): Deletes the element at index i. All later elements (index >= i+1) are shifted one to the left.
We now need to decide on the semantics, i.e., what is the result of various insert and delete operations, possibly concurrent. The fact that insertions are unique suggests using a Unique Set (Principle 1). However, we also have to account for indices and the list order.
One approach would use indices directly: when a user calls insert(i, x), they send (i, x) to the other users, who use i to insert x at the appropriate location. The challenge is that your intended insertion index might move around as a result of users’ inserting/deleting in front of i.
The *gray* cat jumped on **the** table.
Alice typed " the" at index 17, but concurrently, Bob typed " gray" in front of her. From Bob's perspective, Alice's insert should happen at index 22.
It’s possible to work around this by “transforming” i to account for concurrent edits. That idea leads to Operational Transformation (OT), the earliest-invented approach to collaborative text editing, and the one used in Google Docs and most existing apps. Unfortunately, OT algorithms are quite complicated, leading to numerous flawed algorithms. You can reduce complexity by using a central server to manage the document, like Google Docs does, but that precludes decentralized networks, end-to-end encryption, and server-optional open-source apps.
List CRDTs use a different perspective from OT. When you type a character in a text document, you probably don’t think of its position as “index 17” or whatever; instead, its position is at a certain place within the existing text.
“A certain place within the existing text” is vague, but at a minimum, it should be between the characters left and right of your insertion point (“on” and “ table” in the example above) Also, unlike an index, this intuitive position doesn’t change if other users concurrently type earlier in the document; your new text should go between the same characters as before. That is, the position is immutable.
This leads to the following implementation. The list’s state is a Unique Set whose values are pairs (p, x), where x is the actual value (e.g., a character), and p is a unique immutable position drawn from some abstract total order. The user-visible state of the list is the list of values x ordered by their positions p. Operations are implemented as:
    insert(i, x): The inserting user looks up the positions pL, pR of the values to the left and right (indices i and i+1), generates a unique new position p such that pL < p < pR, and calls add((p, x)) on the Unique Set.
    delete(i): The deleting user finds the Unique Set tag t of the value at index i, then calls delete(t) on the Unique Set.
Of course, we need a way to create the positions p. That’s the hard part—in fact, the hardest part of any CRDT—and I don’t have space to go into it here; you should use an existing algorithm (e.g., RGA) or implementation (e.g., Yjs’s Y.Array). Update: I’ve since published two libraries for creating and using CRDT-style positions in TypeScript: list-positions (most efficient), position-strings (most flexible). Both use the Fugue algorithm.
The important lesson here is that we had to translate indices (the language of normal, non-CRDT lists) into unique immutable positions (what the user intuitively means when they say “insert here”). That leads to our second principle of CRDT design:
Principle 2. Express operations in terms of user intention—what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.
This principle works because users often have some idea what one operation should do in the face of concurrent operations. If you can capture that intuition, then the resulting operations won’t conflict.
# Registers
Our last basic CRDT is the Register. This is a variable that holds an arbitrary value that can be set and get. If multiple users set the value at the same time, you pick one of them arbitrarily, or perhaps average them together.
Example uses for Registers:
    The font size of a character in a collaborative rich-text editor.
    The name of a document.
    The color of a specific pixel in a collaborative whiteboard.
    Basically, anything where you’re fine with users overwriting each others’ concurrent changes and you don’t want to use a more complicated CRDT.
Registers are very useful and suffice for many tasks (e.g., Figma and Hex use them almost exclusively).
The only operation on a Register is set(x), which sets the value to x (in the absence of concurrent operations). We can’t perform these operations literally, since if two users receive concurrent set operations in different orders, they’ll end up with different values.
However, we can add the value x to a Unique Set, following Principle 1. The state is now a set of values instead of a single value, but we’ll address that soon. We can also delete old values each time set(x) is called, overwriting them.
Thus the implementation of set(x) becomes:
    For each element e in the Unique Set, call delete(e) on the Unique Set; then call add(x).
The result is that at any time, the Register’s state is the set of all the most recent concurrently-set values.
Loops of the form “for each element of a collection, do something” are common in programming. We just saw a way to extend them to CRDTs: “for each element of a Unique Set, do some CRDT operation”. I call this a causal for-each operation because it only affects elements that are prior to the for-each operation in the causal order. It’s useful enough that we make it our next principle of CRDT design:
Principle 3a. For operations that do something “for each” element of a collection, one option is to use a causal for-each operation on a Unique Set (or List CRDT).
(Later we will expand on this with Principle 3b, which also concerns for-each operations.)
Returning to Registers, we still need to handle the fact that our state is a set of values, instead of a specific value.
One option is to accept this as the state, and present all conflicting values to the user. That gives the Multi-Value Register (MVR).
Another option is to pick a value arbitrarily but deterministically. E.g., the Last-Writer Wins (LWW) Register tags each value with the wall-clock time when it is set, then picks the value with the latest timestamp.
Grid of pixels, some conflicting (outlined in red). One conflicting pixel has been clicked on, revealing the conflicting choices.
In Pixelpusher, a collaborative pixel art editor, each pixel shows one color by default (LWW Register), but you can click to pop out all conflicting colors (MVR). Image credit: Peter van Hardenberg (original).
In general, you can define the value getter to be an arbitrary deterministic function of the set of values.
Examples:
    If the values are colors, you can average their RGB coordinates. That seems like fine behavior for pixels in a collaborative whiteboard.
    If the values are booleans, you can choose to prefer true values, i.e., the Register’s value is true if its set contains any true values. That gives the Enable-Wins Flag.
# Composing CRDTs
We now have enough basic CRDTs to start making more complicated data structures through composition. I’ll describe three techniques: CRDT Objects, CRDT-Valued Maps, and collections of CRDTs.
# CRDT Objects
The simplest composition technique is to use multiple CRDTs side-by-side. By making them instance fields in a class, you obtain a CRDT Object, which is itself a CRDT (trivially correct). The power of CRDT Objects comes from using standard OOP techniques, e.g., implementation hiding.
Examples:
    In a collaborative flash card app, to make individual cards editable, you could represent each card as a CRDT Object with two text CRDT (List CRDT of characters) instance fields, one for the front and one for the back.
    You can represent the position and size of an image in a collaborative slide editor by using separate Registers for the left, top, width, and height.
To implement a CRDT Object, each time an instance field requests to broadcast a message, the CRDT Object broadcasts that message tagged with the field’s name. Receivers then deliver the message to their own instance field with the same name.
# CRDT-Valued Map
A CRDT-Valued Map is like a CRDT Object but with potentially infinite instance fields, one for each allowed map key. Every key/value pair is implicitly always present in the map, but values are only explicitly constructed in memory as needed, using a predefined factory method (like Apache Commons’ LazyMap).
Examples:
    Consider a shared notes app in which users can archive notes, then restore them later. To indicate which notes are normal (not archived), we want to store them in a set. A Unique Set won’t work, since the same note can be added (restored) multiple times. Instead, you can use a CRDT-Valued Map whose keys are the documents and whose values are Enable-Wins Flags; the value of the flag for key doc indicates whether doc is in the set. This gives the Add-Wins Set.
    Quill lets you easily display and edit rich text in a browser app. In a Quill document, each character has an attributes map, which contains arbitrary key-value pairs describing formatting (e.g., "bold": true). You can model this using a CRDT-Valued Map with arbitrary keys and LWW Register values; the value of the Register for key attr indicates the current value for attr.
A CRDT-Valued Map is implemented like a CRDT Object: each message broadcast by a value CRDT is tagged with its serialized key. Internally, the map stores only the explicitly-constructed key-value pairs; each value is constructed using the factory method the first time it is accessed by the local user or receives a message. However, this is not visible externally—from the outside, the other values still appear present, just in their initial states. (If you want an explicit set of “present” keys, you can track them using an Add-Wins Set.)
# Collections of CRDTs
Our above definition of a Unique Set implicitly assumed that the data values x were immutable and serializable (capable of being sent over the network). However, we can also make a Unique Set of CRDTs, whose values are dynamically-created CRDTs.
To add a new value CRDT, a user sends a unique new tag and any arguments needed to construct the value. Each recipient passes those arguments to a predefined factory method, then stores the returned CRDT in their copy of the set. When a value CRDT is deleted, it is forgotten and can no longer be used.
Note that unlike in a CRDT-Valued Map, values are explicitly created (with dynamic constructor arguments) and deleted—the set effectively provides collaborative new and free operations.
We can likewise make a List of CRDTs.
Examples:
    In a shared folder containing multiple collaborative documents, you can define your document CRDT, then use a Unique Set of document CRDTs to model the whole folder. (You can also use a CRDT-Valued Map from names to documents, but then documents can’t be renamed, and documents “created” concurrently with the same name will end up merged.)
    Continuing the Quill rich-text example from the previous section, you can model a rich-text document as a List of “rich character CRDTs”, where each “rich character CRDT” consists of an immutable (non-CRDT) character plus the attributes map CRDT. This is sufficient to build a simple but inefficient Google Docs-style app with CRDTs.
# Using Composition
You can use the above composition techniques and basic CRDTs to design CRDTs for many collaborative apps. Choosing the exact structure, and how operations and user-visible state map onto that structure, is the main challenge.
A good starting point is to design an ordinary (non-CRDT) data model, using ordinary objects, collections, etc., then convert it to a CRDT version. So variables become Registers, objects become CRDT Objects, lists become List CRDTs, sets become Unique Sets or Add-Wins Sets, etc. You can then tweak the design as needed to accommodate extra operations or fix weird concurrent behaviors.
To accommodate as many operations as possible while preserving user intention, I recommend:
Principle 4. Independent operations (in the user’s mind) should act on independent state.
Examples:
    As mentioned earlier, you can represent the position and size of an image in a collaborative slide editor by using separate Registers for the left, top, width, and height. If you wanted, you could instead use a single Register whose value is a tuple (left, top, width, height), but this would violate Principle 4. Indeed, then if one user moved the image while another resized it, one of their changes would overwrite the other, instead of both moving and resizing.
    Again in a collaborative slide editor, you might initially model the slide list as a List of slide CRDTs. However, this provides no way for users to move slides around in the list, e.g., swap the order of two slides. You could implement a move operation using cut-and-paste, but then slide edits concurrent to a move will be lost, even though they are intuitively independent operations.
    Following Principle 4, you should instead implement move operations by modifying some state independent of the slide itself. You can do this by replacing the List of slide CRDTs with a Unique Set of CRDT Objects { slide, positionReg }, where positionReg is an LWW Register indicating the position. To move a slide, you create a unique new position like in a List CRDT, then set the value of positionReg equal to that position. This construction gives the List-with-Move CRDT.
# New: Concurrent+Causal For-Each Operations
There’s one more trick I want to show you. Sometimes, when performing a for-each operation on a Unique Set or List CRDT (Principle 3a), you don’t just want to affect existing (causally prior) elements. You also want to affect elements that are added/inserted concurrently.
For example:
    In a rich text editor, if one user bolds a range of text, while concurrently, another user types in the middle of the range, the latter text should also be bolded.
    One user bolds a range of text, while concurrently, another user types " the" in the middle. In the final result, " the" is also bolded.
    In other words, the first user’s intended operation is “for each character in the range including ones inserted concurrently, bold it”.
    In a collaborative recipe editor, if one user clicks a “double the recipe” button, while concurrently, another user edits an amount, then their edit should also be doubled. Otherwise, the recipe will be out of proportion, and the meal will be ruined!
I call such an operation a concurrent+causal for-each operation. To accomodate the above examples, I propose the following addendum to Principle 3a:
Principle 3b. For operations that do something “for each” element of a collection, another option is to use a concurrent+causal for-each operation on a Unique Set (or List CRDT).
To implement this, the initiating user first does a causal (normal) for-each operation. They then send a message describing how to perform the operation on concurrently added elements. The receivers apply the operation to any concurrently added elements they’ve received already (and haven’t yet deleted), then store the message in a log. Later, each time they receive a new element, they check if it’s concurrent to the stored message; if so, they apply the operation.
    Aside. It would be more general to split Principle 3 into “causal for-each” and “concurrent for-each” operations, and indeed, this is how the previous paragraph describes it. However, I haven’t yet found a good use case for a concurrent for-each operation that isn’t part of a concurrent+causal for-each.
Concurrent+causal for-each operations are novel as far as I’m aware. They are based on a paper I, Heather Miller, and Christopher Meiklejohn wrote last year, about a composition technique we call the semidirect product, which can implement them (albeit in a confusing way). Unfortunately, the paper doesn’t make clear what the semidirect product is doing intuitively, since we didn’t understand this ourselves! My current opinion is that concurrent+causal for-each operations are what it’s really trying to do; the semidirect product itself is (a special case of) an optimized implementation, improving memory usage at the cost of simplicity.
# Summary: CRDT Design Techniques
That’s it for our CRDT design techniques. Before continuing to the spreadsheet case study, here is a summary cheat sheet.
    Start with our basic CRDTs: Unique Set, List CRDT, and Registers.
    Compose these into steadily more complex pieces using CRDT Objects, CRDT-Valued Maps, and Collections of CRDTs.
    When choosing basic CRDTs or how to compose things, keep in mind these principles:
    Principle 1. Use the Unique Set CRDT for operations that “add” or “create” a unique new thing.
    Principle 2. Express operations in terms of user intention—what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.
    Principle 3(a, b). For operations that do something “for each” element of a collection, use a causal for-each operation or a concurrent+causal for-each operation on a Unique Set (or List CRDT).
    Principle 4. Independent operations (in the user’s mind) should act on independent state.
# Case Study: A Collaborative Spreadsheet
Now let’s get practical: we’re going to design a CRDT for a collaborative spreadsheet editor (think Google Sheets).
As practice, try sketching a design yourself before reading any further. The rest of this section describes how I would do it, but don’t worry if you come up with something different—there’s no one right answer! The point of this blog post is to give you the confidence to design and tweak CRDTs like this yourself, not to dictate “the one true spreadsheet CRDT™”.
# Design Walkthrough
To start off, consider an individual cell. Fundamentally, it consists of a text string. We could make this a Text (List) CRDT, but usually, you don’t edit individual cells collaboratively; instead, you type the new value of the cell, hit enter, and then its value shows up for everyone else. This suggests instead using a Register, e.g., an LWW Register.
Besides the text content, a cell can have properties like its font size, whether word wrap is enabled, etc. Since changing these properties are all independent operations, following Principle 4, they should have independent state. This suggests using a CRDT Object to represent the cell, with a different CRDT instance field for each property. In pseudocode (using extends CRDTObject to indicate CRDT Objects):
class Cell extends CRDTObject {
  content: LWWRegister<string>;
  fontSize: LWWRegister<number>;
  wordWrap: EnableWinsFlag;
  // ...
}
The spreadsheet itself is a grid of cells. Each cell is indexed by its location (row, column), suggesting a map from locations to cells. (A 2D list could work too, but then we’d have to put rows and columns on an unequal footing, which might cause trouble later.) Thus let’s use a Cell-CRDT-Valued Map.
What about the map keys? It’s tempting to use conventional row-column indicators like “A1”, “B3”, etc. However, then we can’t easily insert or delete rows/columns, since doing so renames other cells’ indicators. (We could try making a “rename” operation, but that violates Principle 2, since it does not match the user’s original intention: inserting/deleting a different row/column.)
Instead, let’s identify cell locations using pairs (row, column), where “row” means “the line of cells horizontally adjacent to this cell”, independent of that row’s literal location (1, 2, etc.), and likewise for “column”. That is, we create an opaque Row object to represent each row, and likewise for columns, then use pairs (Row, Column) for our map keys.
The word “create” suggests using Unique Sets (Principle 1), although since the rows and columns are ordered, we actually want List CRDTs. Hence our app state looks like:
rows: ListCRDT<Row>;
columns: ListCRDT<Column>;
cells: CRDTValuedMap<[row: Row, column: Column], Cell>;
Now you can insert or delete rows and columns by calling the appropriate operations on columns and rows, without affecting the cells map at all. (Due to the lazy nature of the map, we don’t have to explicitly create cells to fill a new row or column; they implicitly already exist.)
Speaking of rows and columns, there’s more we can do here. For example, rows have editable properties like their height, whether they are visible, etc. These properties are independent, so they should have independent states (Principle 4). This suggests making Row into a CRDT Object:
class Row extends CRDTObject {
  height: LWWRegister<number>;
  isVisible: EnableWinsFlag;
  // ...
}
Also, we want to be able to move rows and columns around. We already described how to do this using a List-with-Move CRDT:
class MovableListEntry<C> extends CRDTObject {
  value: C;
  positionReg: LWWRegister<UniqueImmutablePosition>;
}
class MovableListOfCRDTs<C> extends CRDTObject {
  state: UniqueSetOfCRDTs<MovableListEntry<C>>;
}
rows: MovableListOfCRDTs<Row>;
columns: MovableListOfCRDTs<Column>;
Next, we can also perform operations on every cell in a row, like changing the font size of every cell. For each such operation, we have three options:
    Use a causal for-each operation (Principle 3a). This will affect all current cells in the row, but not any cells that are created concurrently (when a new column is inserted). E.g., a “clear” operation that sets every cell’s value to "".
    Use a concurrent+causal for-each operation (Principle 3b). This will affect all current cells in the row and any created concurrently. E.g., changing the font size of a whole row.
    Use an independent state that affects the row itself, not the cells (Principle 4). E.g., our usage of Row.height for the height of a row.
# Finished Design
In summary, the state of our spreadsheet is as follows.
// ---- CRDT Objects ----
class Row extends CRDTObject {
  height: LWWRegister<number>;
  isVisible: EnableWinsFlag;
  // ...
}
class Column extends CRDTObject {
  width: LWWRegister<number>;
  isVisible: EnableWinsFlag;
  // ...
}
class Cell extends CRDTObject {
  content: LWWRegister<string>;
  fontSize: LWWRegister<number>;
  wordWrap: EnableWinsFlag;
  // ...
}
class MovableListEntry<C> extends CRDTObject {
  value: C;
  positionReg: LWWRegister<UniqueImmutablePosition>;
}
class MovableListOfCRDTs<C> extends CRDTObject {
  state: UniqueSetOfCRDTs<MovableListEntry<C>>;
}
// ---- App state ----
rows: MovableListOfCRDTs<Row>;
columns: MovableListOfCRDTs<Column>;
cells: CRDTValuedMap<[row: Row, column: Column], Cell>;
Note that I never explicitly mentioned CRDT correctness—the claim that all users see the same document state after receiving the same messages. Because we assembled the design from existing CRDTs using composition techniques that preserve CRDT correctness, it is trivially correct. Plus, it should be straightforward to reason out what would happen in various concurrency scenarios.
As exercises, here are some further tweaks you can make to this design, phrased as user requests:
    “I’d like to have multiple sheets in the same document, accessible by tabs at the bottom of the screen, like in Excel.” Hint (highlight to reveal): Use a List of CRDTs.
    “I’ve noticed that if I change the font size of a cell, while at the same time someone else changes the font size for the whole row, sometimes their change overwrites mine. I’d rather keep my change, since it’s more specific.” Hint: Use a Register with a custom getter.
    “I want to reference other cells in formulas, e.g., = A2 + B3. Later, if B3 moves to C3, its references should update too.” Hint: Store the reference as something immutable.
# Conclusion
I hope you’ve gained an understanding of how CRDTs work, plus perhaps a desire to apply them in your own apps. We covered a lot:
    Traditional CRDTs: Unique Set, List/Text, LWW Register, Enable-Wins Flag, Add-Wins Set, CRDT-Valued Map, and List-with-Move.
    Novel Operations: Concurrent+causal for-each operations on a Unique Set or List CRDT.
    Whole Apps: Spreadsheet, rich text, and pieces of various other apps.
For more info, crdt.tech collects most CRDT resources in one place. For traditional CRDTs, the classic reference is Shapiro et al. 2011, while Preguiça 2018 gives a more modern overview.
I’ve also put many of these ideas into practice in a library, Collabs. You can learn more about Collabs, and see how open-source collaborative apps might work in practice, in my Local-First Web talk: Video, Slides, Live demo.
# Related Work
This blog post’s approach to CRDT design - using simple CRDTs plus composition techniques - draws inspiration from a number of sources. Most similar is the way Figma and Hex describe their collaboration platforms; they likewise support complex apps by composing simple, easy-to-reason-about pieces. Relative to those platforms, I incorporate more academic CRDT designs, enabling more flexible behavior and server-free operation.
The specific CRDTs I describe are based on Shapiro et al. 2011 unless noted otherwise. Note that they abbreviate “Unique Set” to “U-Set”. For composition techniques, a concept analogous to CRDT Objects appears in BloomL; CRDT-Valued Maps appear in Riak and BloomL; and the Collections of CRDTs are inspired by how Yjs’s Y.Array and Y.Map handle nested CRDTs.
Similar to how I build everything on top of a Unique Set CRDT, Mergeable Replicated Data Types are all built on top of a (non-unique) set with a 3-way merge function, but in a quite different way.
Other systems that allow arbitrary nesting of CRDTs include Riak, Automerge, Yjs, and OWebSync.
# Acknowledgments
I thank Heather Miller, Ria Pradeep, and Benito Geordie for numerous CRDT design discussions that led to these ideas. Jonathan Aldrich, Justine Sherry, and Pratik Fegade reviewed a version of this post that appears on the CMU CSD PhD Blog. I am funded by an NDSEG Fellowship sponsored by the US Office of Naval Research.
# Appendix: What’s Missing?
The CRDT design techniques I describe above are not sufficient to reproduce all published CRDT designs - or more generally, all possible Strong Eventual Consistency data structures. This is deliberate: I want to restrict to (what I consider) reasonable semantics and simple implementations.
In this section, I briefly discuss some classes of CRDTs that aren’t covered by the above design techniques.
# Optimized Implementations
Given a CRDT designed using the above techniques, you can often find an alternative implementation that has the same semantics (user-visible behavior) but is more efficient.
For example:
    Suppose you are counting clicks by all users. According to the above techniques, you should use a Unique Set CRDT and add a new element to the set for each click (Principle 1), then use the size of the set as the number of clicks. But there is a much more efficient implementation: merely store the number of clicks, and each time a user clicks, send a message instructing everyone to increment their number. This is the classic Counter CRDT.
    Peritext is a rich-text CRDT that allows formatting operations to also affect concurrently inserted characters, like our example above. Instead of using concurrent+causal for-each operations, they store formatting info at the start and end of the range, then do some magic to make sure that everything works correctly. This is much more efficient than applying formatting to every affected character like in our example, especially for memory usage.
My advice is to start with an implementation using the techniques in this blog post. That way, you can pin down the semantics and get a proof-of-concept running quickly. Later, if needed, you can make an alternate, optimized implementation that has the exact same user-visible behavior as the original (enforced by mathematical proofs, unit tests, or both).
Doing things in this order - above techniques, then optimize - should help you avoid some of the difficulties of traditional, from-scratch CRDT design. It also ensures that your resulting CRDT is both correct and reasonable. In other words: beware premature optimization!
    3 out of 4 Peritext authors mention that it was difficult to get working: here, here, and here.
One way to optimize is with a complete rewrite at the CRDT level. For example, relative to the rich text CRDT that we sketched above (enhanced with concurrent+causal for-each operations), Peritext looks like a complete rewrite. (In reality, Peritext came first.)
Another option is to “compress” the CRDT’s state/messages in some way that is easy to map back to the original CRDT. That is, in your mind (and in the code comments), you are still using the CRDT derived from this blog post, but the actual code operates on some optimized representation of the same state/messages.
For example, in the rich text CRDT sketched above, if storing separate formatting registers for each character uses too much memory, you could compress the state by deduplicating the identical formatting entries that result when a user formats a range of text. Then the next time you receive an operation, you decompress the state, apply that operation, and recompress. Likewise, if formatting a range of characters individually generates too much network traffic (since there is one CRDT message per character), you could instead send a single message that describes the whole formatting operation, then have recipients decompress it to yield the original messages.
# Optimized Semantics
Some existing CRDTs deliberately choose behavior that may look odd to users, but has efficiency benefits. The techniques in this blog post don’t always allow you to construct those CRDTs.
The example that comes to my mind concerns what to do when one user deletes a CRDT from a CRDT collection, while concurrently, another user makes changes to that CRDT. E.g., one user deletes a presentation slide while someone else is editing the slide. There are a few possible behaviors:
    “Deleting” semantics: The deletion wins, and concurrent edits to the deleted CRDT are lost. This is the semantics we adopt for the Collections of CRDTs above; I attribute it to Yjs. It is memory-efficient (deleted CRDTs are not kept around forever), but can lose data.
    “Archiving” semantics: CRDTs are never actually deleted, only archived. Concurrent edits to an archived CRDT apply to that CRDT as usual, and if the new content is interesting, users can choose to un-archive. We described how to do this using the Add-Wins Set above. This is the nicest semantics for users, but it means that once created, a CRDT stays in memory forever.
    “Resetting” semantics: A delete operation “resets” its target CRDT, undoing the effect of all (causally) prior operations. Concurrent edits to the deleted CRDT are applied to this reset state. E.g., if a user increments a counter concurrent to its deletion, then the resulting counter value will be 1, regardless of what its state was before deletion.
      This semantics is adopted by the Riak Map and a JSON CRDT paper, but it is not possible using the techniques in this blog post. It is memory-efficient (deleted CRDTs are not kept around forever) and does not lose data, but has weird behavior. E.g., if you add some text to a slide concurrent to its deletion, then the result is a slide that has only the text you just entered. This is also the cause of the above map anomaly.
I am okay with not supporting odd semantics because user experience seems like a first priority. If your desired semantics results in poor performance (e.g. “archiving” semantics leading to unbounded memory usage), you can work around it once it becomes a bottleneck, e.g., by persisting some state to disk.
# Hard Semantics
Some CRDTs seem useful and not prematurely optimized, but I don’t know how to implement them using the techniques in this blog post. Two examples:
    A Tree CRDT with a “move” operation, suitable for modeling a filesystem in which you can cut-and-paste files and folders. The tricky part is that two users might concurrently move folders inside of each other, creating a cycle, and you have to somehow resolve this (e.g., pick one move and reject the other). Papers: Kleppmann et al., Nair et al..
    A CRDT for managing group membership, in which users can add or remove other users. The tricky part is that one user might remove another, concurrent to that user taking some action, and then you have to decide whether to allow it or not. This is an area of active research, but there have been some proposed CRDTs, none of which appear to fit into this blog post.
    Aside. There is a sense in which the Unique Set CRDT (hence this blog post) is “CRDT-complete”, i.e., it can be used to implement any CRDT semantics: you use a Unique Set to store the complete operation history together with causal ordering info, then compute the state as a function of this history, like in pure op-based CRDTs. However, this violates the spirit of the blog post, which is to give you guidance on how to design your CRDT.
# Beyond CRDTs
CRDTs, and more broadly Strong Eventual Consistency, are not everything. Some systems, including some collaborative apps, really need Strong Consistency: the guarantee that events happen in a serial order, agreed upon by everyone. E.g., monetary transactions. So you may need to mix CRDTs with strongly consistent data structures; there are a number of papers about such “Mixed Consistency” systems, e.g., RedBlue consistency.
Home • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub

File addition: 20211012_lightning_fast_rebases_with_git_move.md (----------)

[0.1]

# Document Title
Steno & PL
About
Lightning-fast rebases with git-move
Oct 12, 2021
You can use git move as a drop-in 10x faster replacement for git rebase (see the demo). The basic syntax is
$ git move -b <branch> -d <dest>
How do I install it? The git move command is part of the git-branchless suite of tools. See the installation instructions.
What does “rebase” mean? In Git, to “rebase” a commit means to apply a commit’s diff against its parent commit as a patch to another target commit. Essentially, it “moves” the commit from one place to another.
How much faster is it? See Timing. If the branch is currently checked out, then 10x is a reasonable estimate. If the branch is not checked out, then it’s even faster.
Is performance the only added feature? git move also offers several other quality-of-life improvements over git rebase. For example, it can move entire subtrees, not just branches. See the git move documentation for more information.
    Timing
    Why is it faster?
    What about merge conflicts?
    Related work
    Interactive rebase
    Related posts
    Comments
Timing
I tested on the Git mirror of Mozilla’s gecko-dev repository. This is a large repository with ~750k commits and ~250k working copy files, so it’s good for stress tests.
It takes about 10 seconds to rebase 20 commits with git rebase:
Versus about 1 second with git move:
These timings are not scientific, and there are optimizations that can be applied to both, but the order of magnitude is roughly correct in my experience.
Since git move can operate entirely in-memory, it can also rebase branches which aren’t checked out. This is much faster than using git rebase, because it doesn’t have to touch the working copy at all.
Why is it faster?
There are two main problems with the Git rebase process:
    It touches disk.
    It uses the index data structure to create tree objects.
With a stock Git rebase, you have to check out to the target commit, and then apply each of the commits’ contents individually to disk. After each commit’s application to disk, Git will implicitly check the status of files on disk again. This isn’t strictly necessary for many rebases, and can be quite slow on sizable repos.
When Git is ready to apply one of the commits, it first populates the “index” data structure, which is essentially a sorted list of all of the files in the working copy. It can be expensive for Git to convert the index into a “tree” object, which is used to store commits internally, as it has to insert or re-insert many already-existing entries into the object database. (There are some optimizations that can improve this, such as the cache tree extension).
Work is already well underway on upstream Git to support the features which would make in-memory rebases feasible, so hopefully we’ll see mainstream Git enjoy similar performance gains in the future.
What about merge conflicts?
If an in-memory rebase produces a merge conflict, git move will cancel it and restart it as an on-disk rebase, so that the user can resolve merge conflicts. Since in-memory rebases are typically very fast, this doesn’t usually impede the developer experience.
Of course, it’s possible in principle to resolve merge conflicts in-memory as well.
Related work
In-memory rebases are not a new idea:
    GitUp (2015), a GUI client for Git with a focus on manipulating the commit graph.
        Unfortunately, in my experience, it doesn’t perform too well on large repositories.
        To my knowledge, no other Git GUI client offers in-memory rebases. Please let me know of others, so that I can update this comment.
    git-revise (2019), a command-line utility which allows various in-memory edits to commits.
        git-revise is a replacement for git rebase -i, not git rebase. It can reorder commits, but it isn’t intended to move commits from one base to another. See Interactive rebase.
    Other source control systems have in-memory rebases, such as Mercurial and Jujutsu.
The goal of the git-branchless project is to improve developer velocity with various features that can be incrementally adopted by users, such as in-memory rebases. Performance is an explicit feature: it’s designed to work with monorepo-scale codebases.
Interactive rebase
Interactive rebase (git rebase -i) is a feature which can be used to modify, reorder, combine, etc. several commits in sequence. git move does not do this at present, but this functionality is planned for a future git-branchless release. Watch the Github repository to be notified of new releases.
In the meantime, you can use git-revise. Unfortunately, git-branchless and git-revise do not interoperate well due to git-revise’s lack of support for the post-rewrite hook (see this issue).
Related posts
The following are hand-curated posts which you might find interesting.
Date 		Title
19 Jun 2021 		git undo: We can do better
12 Oct 2021 	(this post) 	Lightning-fast rebases with git-move
19 Oct 2022 		Build-aware sparse checkouts
16 Nov 2022 		Bringing revsets to Git
05 Jan 2023 		Where are my Git UI features from the future?
11 Jan 2024 		Patch terminology
Want to see more of my posts? Follow me on Twitter or subscribe via RSS.
Comments
    Discussion on Lobsters
Steno & PL
subscribe via RSS
    Waleed Khan
    me@waleedkhan.name
arxanas
    arxanas
This is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.

File addition: 20210619_git_undo_we_can_do_better.md (----------)

[0.1]

# Document Title
Steno & PL
About
git undo: We can do better
Jun 19, 2021
Update for future readers: Are you looking for a way to undo something with Git? The git undo command won’t help with your current issue (it needs to be installed ahead of time), but it can make dealing with future issues a lot easier. Try installing git-branchless, and then see the documentation for git undo.
    Motivation
    Solution
    Demos
    Implementation
    Related posts
    Comments
Motivation
Git is a version control system with robust underlying principles, and yet, novice users are terrified of it. When they make a mistake, many would rather delete and re-clone the repository than try to fix it. Even proficient users can find wading through the reflog tedious.
Why? How is it so easy to “lose” your data in a system that’s supposed to never lose your data?
Well, it’s not that it’s too easy to lose your data — but rather, that it’s too difficult to recover it. For each operation you want to recover from, there’s a different “magic” incantation to undo it. All the data is still there in principle, but it’s not accessible to many in practice.
Here’s my theory: novice and intermediate users would significantly improve their understanding and efficacy with Git if they weren’t afraid of making mistakes.
Solution
To address this problem, I offer git undo, part of the git-branchless suite of tools. To my knowledge, this is the most capable undo tool currently available for Git. For example, it can undo bad merges and rebases with ease, and there are even some rare operations that git undo can undo which can’t be undone with git reflog.
Update 2021-06-21: user gldnspud on Hacker News points out that the GitUp client also supports undo/redo via snapshots, also by adding additional plumbing on top of Git.
I’ve presented demos below, and briefly discussed the implementation at the end of the article. My hope is that by making it easier to fix mistakes, novice Git users will be able to experiment more freely and learn more effectively.
Demos
Undoing an amended commit:
Undoing a merge conflict that was resolved wrongly:
Implementation
git undo is made possible by a recent addition to Git: the reference-transaction hook. This hook triggers whenever a change is made to a reference, such as a branch. By recording all reference moves, we can rebuild the state of the commit graph at any previous point in time. Then we accomplish the undo operation by restoring all references to their previous positions in time (possibly creating or deleting references in the process).
I originally built git-branchless in order to replicate a certain Mercurial workflow, but the data structures turn out to be flexible enough to give us a git undo feature nearly for free. You can find more detail at the Architecture page for the project.
Related posts
The following are hand-curated posts which you might find interesting.
Date 		Title
19 Jun 2021 	(this post) 	git undo: We can do better
12 Oct 2021 		Lightning-fast rebases with git-move
19 Oct 2022 		Build-aware sparse checkouts
16 Nov 2022 		Bringing revsets to Git
05 Jan 2023 		Where are my Git UI features from the future?
11 Jan 2024 		Patch terminology
Want to see more of my posts? Follow me on Twitter or subscribe via RSS.
Comments
    Discussion on Hacker News
Steno & PL
subscribe via RSS
    Waleed Khan
    me@waleedkhan.name
arxanas
    arxanas
This is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.

File addition: 20210318_bob_2021_raichoo_ketchum_darcs_because_git_won.md (----------)

[0.1]

so
hi everyone um so this is a
re-recording of my talk that i wanted to
give for bobconf
we had a couple of technical issues and
so i decided to
do a re-recording so we can actually
have the
talk as it was intended so yeah this is
darks because get one uh admittedly
that's a little bit click-baity but
uh it's meant to be a bit
tongue-in-cheek so bear with me there
so yeah um first of all who am i
uh i'm raichu um i've been in the
haskell community for around
10 years now uh been writing software
for programming for 30 years um
lots of c plus lots of java but 10 years
ago i found this
programming language called haskell
which i enjoy quite a lot
and uh been using it for lots of stuff
since then um
i'm the author of various uh open source
projects
um some of you you might probably know
there is this
project called haskell vim which is a uh
plugin for for vim um
the syntax highlighting and indentation
and stuff like that and
uh for some reason it has become quite
popular and
uh yeah people seem to be enjoying it um
i'm speaking regularly at conferences so
uh
if you google write you on youtube you
will find a couple of talks
that i've done over the years and um
i'm also co-founder of a hackspace
in betafield which is called acme labs
so hex bases are these
loose um clubhouses if you want to like
where
people meet and uh do projects and stuff
art and beauty software projects
whatever comes to mind
and i'm also the co-founder and cto of
anti in bielefeld
we are uh we are software consultants we
are writing um
software for companies that
want to have certain problems solved and
we
uh we deliver that well at least
that's the intention um maybe you've
heard of us
we've also did the uh the advent of
haskell
2020. uh we picked up that work and
it uh i think it was quite popular and a
lot of people enjoyed that
so yeah it produced a couple of
interesting blog posts so
you can still read those up at the
haskell.com
so yeah um
i said the talk is a little bit
clickbaity so i want to
talk about what this talk is not so this
talk is not like get
bad dark is good i don't want to
convince you to
switch everything to darks and
abandon git that's not the idea i just
want to present an alternative
to what is basically the mainstream at
the moment
because i think darks is a very
interesting
very interesting version control system
because it does
things absolutely in a different way
like it's not just a different ui
for the same concept it does things
conceptually
very different um this talk is also not
a transition guide
uh i can't just take a workflow and then
transform that
into like a git workflow and then
transform that into darks
workflows grow organically this talk
will only give you hints at what darks
can do
and maybe that fits your workflow maybe
it doesn't
so you will have to find ways to
make that work for you if you're
interested in that
so okay let's get started um well
not really first we got to talk a little
bit about the dark's image sorry
um when you have heard about darks there
are a couple of things that
uh that are quite common
like conceptions of the system some
people consider it to be dead
i can assure you it's not just recently
we had a major release
2.16 and we're now at version 2.16.3
um so it's very active it's still being
developed by very very
clever people and yeah it's a life and
kicking
another thing that i hear quite a lot is
that people complain that it's slow
and i don't think that statement doesn't
necessarily make sense on its own
because when you say something is slow
you
have to compare it to something and when
you compare it to git
it's certainly slow git is pretty much
built around the idea that it has to be
optimized
um for every in every nook and cranny
there is optimizations and stuff like
that
so uh to me personally
it really isn't that big of a deal if an
operation just takes
500 milliseconds versus 20 milliseconds
i mean that might be an issue for you
but
with most of the projects that i'm
working on
this is not the issue like the version
control system is not the bottleneck
um the bottleneck is how you use it and
what kind of workflows it
it enables me to do and to me that's
that's way more
that's way more important than like
having
shaving off like three seconds from my
push
um another thing that i hear quite a lot
is like the exponential merge problem
and people talk about that a lot on on
forums and communication platforms and
that is something that has been an issue
with
earlier versions of darks where has been
quite prominent but
um the darks people really
worked out this issue quite a lot so the
issue is that
um when you're merging things you can
run into exponential runtime which is
certainly not something that you would
desire and
uh there are darks people have put a lot
of
work into finding all those edge cases
and
uh fixing them and making the algorithm
better and and the performance
way better and personally i've been
using darts for
years and i never encountered that
problem
um there might be still i mean there's
certainly still places where these these
edge conditions can happen
but um they are
exceptionally rare and that brings me
back to the the idea that
that darks is dead um they are working
on a new
patch theory that eliminates all those
i mean the goal is to eliminate all
those edge cases and
make it more performant and yeah grow
the system so that's that's very
interesting ongoing work at the moment
and i'm going to link you to the paper
that
lies the foundation for that so
this is my favorite everyone knows how
to use git
um yeah what does that mean
um whenever i hear that i i think of
that
and and please don't get me wrong um you
can do the same things
with darts and any other version control
system
but to me the notion you know how to use
something
is pretty much like when you say in a
programming language i know the syntax
of a programming language
what's more important is that you can
produce something meaningful
with that programming language and to me
it's the same thing with version control
yes you can use the version control you
can you can create patches you can push
stuff
but to me the most important thing is um
how to build meaningful patches
especially like in a working context and
i know that's
that's a hard hard thing if you're doing
a prototype and stuff like that and and
i
i admit that i'm sometimes guilty of
doing just
that um especially in a prototype
situation where we're not
100 sure what the end result is going to
be
so yeah to me the idea is that a version
control system has to
really give you something back when you
are
crafting patches in a meaningful way you
have to be able to use them
as as good as possible and dars pretty
much does that and i hope i can
illustrate that later in the demo for
you
so now yeah let's get now now we're
really getting started
so what's the different difference
anyway um
from the from the outside looking in it
might look like
darks is just another version control
system with a slightly different ui and
underneath everything works just like
completely the same
and this is basically the point where
you start to differentiate between a
snapshot based version control something
like git or mercurial
and um a patch-based version control and
this this notion
might be a little bit weird because it
focuses around patches what does that
mean
um i will show you in the demo but first
of all let's let's look at a
situation where we have four
characters here we have we have alice we
have bob
with charlie and eve and alice and bob
they both have
patches a and charlie has been
developing
patch b which is a new feature or
whatever
documentation choose whatever you want
and eve has also been working on a patch
on a patch c
so as it turns out alice and charlie are
working closely together
they are in a team together but at the
moment they're
not in the same office they have to work
remotely
there are different situations where
that might occur like a global pandemic
but yeah so as it turns out those two
are communicating on a regular basis
so um alice knows about charlie's patch
and
she's eager to try it and apply it to
her repository so what she does is she
basically pulls this patch
a couple of minutes later she hears
about eve's patch and
because she's not working on the same
team they're they're not communicating
frequently but
now she hears about it and she's she
thinks oh that's that's actually
something that i want to pull in as well
because
the work that i'm currently doing would
very much benefit from that and now she
pulls that in
um on the other side of the city or
planet uh bob suddenly
uh realizes that eve who he's working
closely together with
has written patch c and he's like oh
great i've been waiting for that patch
for a long time and pulls that in now he
has
pulled that patch and you can now see
alice has
abc and you can maybe sense that
something interesting is going on here
and he has uh bypass patches a and c
um and a couple of minutes later he
hears about charlie's patch
and thinks oh i got to pull that in as
well so now you pull that
in and now you can see that um alice's
repository has
the order the order of patches in which
she pulled those in she has a b and c
and uh bob has a c
and b so
how do their repositories look like now
so with git
um we would have something like this
some kind of
this history so we start at a common
starting point
um alice pulls in charlie's feature she
pulls an eavs feature
and then she merges and what you can see
here is that
these commits these last commits where
the head of
the gift repo is those differ
because with charlie's and eve's patches
they diverged
from that common starting point and now
we have to bring those back together
and these merge commits are created by
the people who have
pulled in the patches it's not like a
fast forward we have to merge something
and now what's happened happening is
that
basically when you are pulling in
patches in a different order
things get very different so it's like
let's say you are there and we you see a
banana and you see an
apple and you take the apple and you
take the banana and you would certainly
end up with a different result if you
would take the banana first and the
apple
so this is something that happens in
version control quite a lot
but um now i want to show you how docs
essentially does the same deals with the
same situation here
so yeah it's demo time hopefully this is
going to work out
so um i'm going to start with a clean
slate here
there's there's nothing there and i'm
going to
make a repository alice and go to alice
repository and i'm going to show you how
darts works here i'm on my remote
machine at the moment
so yeah i've initialized initialize that
repository
you can see this underscore docs folder
is
where darks is doing all the bookkeeping
and now we're going to to uh write a
little
little file and i want to apologize for
the color scheme here
let's do something like that so this is
the file
and we're going to add that to the repo
and
this file is now being tracked and we
want to record
um our initial patch
initial record yeah and now you can see
the docs workflow is like super
interactive yes
what changes do we want to record yes we
have added a file
and we've added the content so darks is
basically prompting us of what we want
to do which is
like super helpful if you want to do
more complex operations
so if you want to take a look at the
repository now we can see that
we have recorded a patch so now let's
just
create all those
all those repositories that we have had
before we've created
bob we create charlie and we
create so all these repositories are now
in the same state
so let's go to um let's go to to
charlie first and uh
build charlie's patch here and which
only basically does is
something like super simple it's not a
keyboard i know so it's a little bit
weird
oh we just add a to the top of the file
and now
we can ask darts what's new and it says
okay at
line one we are going to add character a
so let's record that
yes yes this is charlie
and let's do the same thing with eve
so what eve is doing she's essentially
just going to add b
to the end of the file
yes we're going to add that and this is
so let's go back to alice and reproduce
the workflow that we just had
so
alice is going to pull from charlie
first
so yeah and now we can look at the file
excuse me you can say we can see the
patch has been applied
so do the same thing to e
and we can see those two files have been
merged
so yeah if we look at the history we can
now see
um we've pulled from charlie first then
we pull from eve and what's important
here
is those hashes those those hashes are
essentially the identity
of of what's going on here
so of the identity of the patch to be
more precise so let's go to bob
and pull we do that the other way in a
brown the first thing is we do we pull
from eve because
that's who charlie's been working with
and we pull that
pull that patch and let's take a look at
that
so let's let's let's let's look at those
things side by side let's go into um
alice's repository and the darks log
here
and here you can see so this is this is
what
charlie is seeing uh excuse me about
what bob is seeing
and you can see that even though
so this um just to be precise here
what's happening here is a cherry pick
dart is basically constantly doing
cherry picks and merging
and even though i've cherry picked this
patch
from um from charlie's repository from
excuse me from if's repository
um they actually have the same
the same identity the same hash even
though
they have been pulled in different
orders that have been cherry picked in
different orders so what darks
does is that it doesn't change a patch
identity when it does cherry picking
which is like immensely neat i think
so let's do the same thing with charlie
yes yes and as you can see
it's the same thing you can even do
something like we can get to go from
that repository and
um
dear test test repository
and it just to demonstrate the cherry
picking property here if we
pull from let's say alice she's got all
the patches
um yes i want to take the initial record
do i want to take charlie's
no but i want to take use
let's look at that let's first look at
the file
oh come on yeah and now you can see
even though i've cherry picked the b
just the b
patch and left the a patch behind um
i can still end up with result and
this thing this patch this has still the
same identity
so how can we see that to
repositories have the same state
that's that's the important part we can
look at show repo
and here is this thing which is called
the weak hash and the weak hash
has is computed about all the set of the
set of all patches that are
inside of that repository and if we do
the same thing to bob
you can see those weak hashes
they line up they have they have the
same state
um yeah this is this is this is how the
workflow is different here
so something else which i find kind of
kind of nice
is that um darts can essentially produce
graphics output and
i'm usually using different tools there
but here you can see what's actually
happening like these two
these two patches they don't even though
they
even in a situation where you're in the
same repository and recording those two
patches yourself
darts can figure out that they do not
actually depend on each other
so this the patch that charlie wrote
which essentially just added a to the
file it doesn't
it the only thing that it depends on is
that the initial record is there
it's the same thing with these um
change that just added something else to
the file
so now we can just pick everything
um pick and choose and this becomes even
more interesting if we have way more
patches with
way more features in the repository so
we can pick and choose
all the different things that we want
and we can look at
how they interact with each other and of
course we can also state
dependencies directly like explicitly
say i want to depend on that patch
but i'm not going to to to show that
here
so yeah dogs basically just
treats a repositories a set of patches
and i mean set in a mathematical sense
like
the order of the patches of the elements
that doesn't matter if i have a set b
a and a b it's the same set the order
isn't important so let's talk a little
bit
darks does that let's talk about patch
theory
um so we haskellers we all like our
theory
a lot and uh yeah patch theory
[Music]
just just let me get this straight you
don't need to know the theory to
actually enjoy
uh or be able to work with darts it's
the same thing you don't have to
know category theory to to to know
uh haskell uh basically if you would go
to a category theorist and explain them
from a haskell perspective what a monad
is they would say it's not a monad
you are this is just a very special case
that you're talking about so
um even the theory that we're using in
in
haskell is not exactly like the the real
deal
um and there are very passionate
discussions so i just want to make sure
that you know that even though the
theory is like
maybe intellectually stimulating it's
not like mandatory
so if we're talking about patch theory
let's talk about what the patch is
so a patch basically patch a takes a
repository from some state o
to state a that's what it does so like
adding a file or writing content to a
file that's pretty simple
um if we have two patches and
patch a takes a repository from o to a
and patch b takes it from a to b
we can compose them where hasslers that
that's what we like we compose things
all the time
and uh so does patch theory so we can
like
sequence those patches up
we also want to be able to invert the
patch so we want to
take changes back so if we added a line
we want to get rid of that line
so for every patch there should be an
inverse
so if we have a patch a that goes from
mode to a
o inverted should go from a to o
so we can go back there
um and we should be able to commute
patches so this is a little bit more
tricky and apologize for the notation
but this is important to understand how
docs is basically doing what i just
showed you
so if i f patch a and b and they do not
depend on each other so what does that
mean if
let's say patch a adds a file and patch
b edits the content of that file
of course we can't swap around those
operations that doesn't work
you can't edit the content of a file if
it hasn't been added before
but let's say we're talking about the
situation that we just had before like
we have a file and
patch a adds something to the top of the
file and patch b adds something to the
bottom of the file so line one or maybe
something like maybe line five that
that's where patch b operates
so essentially we could swap those
operations around
and so what these uh subscripts mean is
that it's not essentially this
the patches are not fundamentally the
same they're a little bit different like
say this patch adds
something to line one and this adds
something to line six
if we swap those around um for
for patch b everything shifts up a line
so now this adds
to line five and this still adds to line
one
even though the operations are basically
equivalent
so that's what this is what i mean by
that notation a is basically equivalent
to a1 and b is equivalent to it to be
one
they do the same thing they don't have
they don't
essentially do the same operations like
adding this one this one adds to line
six and this one adds to line 5 even
though they are doing
they are representing the same change
and what's important is that
the context that we end up in like b and
a1
they are still the same they are
identical even though
we switch them around so
how does a merge work then so a merge is
basically
let's take those two patches and bring
them together so how will we do that
we can't just concatenate c to b
we can't just put that on there because
it goes from a to c
and uh we are now in a repository that
has in state b
so with all the rules that we have with
all the the
uh mechanisms that we have we can't now
do a merge
we can say okay first we invert b we're
back in state a
then we can apply c so now we still have
the information of b in there even
though it has been reverted but the
information is still there
we can commute b and c
and then we end up with c1 and b1
and then what we can do we can we can
just throw away
c1 excuse me b1
and then we have a merge we ended up
with the state that we wanted pulled in
all those patches and voila
so this is a nice thing we can now
reason algebraically
about patches which i find i think is
really exciting to me
and another thing that's super exciting
to me is this
merging symmetric um so it doesn't
matter in which order
i'm pulling in those patches so the
the good thing about this is so we first
started using darks
in our hackspace and hackspaces are like
very loose
groups of hackers working together and
you don't necessarily
all of them are like on a on a single
central repository like everyone picks
and pulls and from
picks pushes to everyone else and
um in this situation it's like super
useful if you don't
have the have to maintain the same order
we can work together we have a very
loose workflow there
and we can just put pull out things
together
in the way we want to and uh like
neglect patches and
uh take the ones that we
seem fit for our current situation
that's pretty neat
i i enjoy that quite a lot
so what applications does that have how
do we apply that
and to me um i have a whaling compositor
project which is like
20 000 lines of c code so it's a
reasonably large project
and what i do have free repositories
and um what i essentially do is i i use
this this kind of workflow this this
model or
what darts enables me to do immense
i do i work that i use that quite a lot
um so i have a current repository where
i pull
all the patches in every all the work
happens there and i have
if i have a feature a new feature that
is
is uh not really breaking any
of the workflows or that doesn't
introduce any breaking changes
i pull that into the stable into
sustainable repository and you can see
i highlighted all the number the part of
the version numbers in red
that basically are part of the number
that i'm going to increase with that
release
and whenever there's a new feature that
doesn't break functionality i can
pull that over to stable and whenever i
just have a bug fix that doesn't
introduce any new features
i can pull it over to release and this
is essentially the workflow that we're
using at my company
um it it has proven to be very solid
and i was able to do some very
very exciting to me very very fluent and
fun release engineering like way more
uh i enjoy that way more than i did with
git but that that's a personal
preference
so yeah that is something that that we
enjoy quite a lot
so another thing is that passvale is an
open source project that we are going to
to open source
within the next few days and it's
essentially like a password manager
something like a password store um
password store is using git underneath
and we
thought well the the model of dars is
essentially
super neat if you have a very
distributed password manager so
lots of people working on lots of
different passwords
and you constantly have to merge all
those states and since darts is like
very sophisticated with merging we
thought that would be like a very
interesting
foundation for a uh for a password
manager
and we also have model things like trust
and and all sorts of things so
if you know a password and i give a
password to someone else we can use
pass veil to to use to basically use
uh use password to transfer that
password so we don't have to
use insecure channels like post-its or
pass post it on a chat or
use email or something like that so we
can use the password manager
to transfer our passwords to other
people and work together with them
so this is also something that is proved
to be quite beneficial especially in a
very distributed situations
in which we are at the moment so yeah
um tooling so if you are using git
tooling is great of course there's a lot
of tooling a lot of people use it a lot
of people write
really great tools and um so we are kind
of like
stuck in the situation where i have to
build a lot of stuff ourselves um
which is fun but it's also work and
yeah this this is something that we've
been doing for the last few months
so what i did i i built something that's
called docsgutter
which is a um a plugin for neovim
it's it's written in lua so it's a
plugin for neofilm that essentially does
the things
get gutter tests so it shows us changes
we can
look at the changes that uh we can look
at
different hungs in a file we can jump to
hungs we can
um yeah we can we can essentially like
grab through
our uh project and figure out where are
the changes
and uh what has been changed so this
helps me a lot with uh
when i'm i'm working on a rather large
code base and want to craft all my
patches together
like this goes in that patch and this
goes in there so this is like
very helpful and another thing that i
have is a darts fish prompt
so if you are using the fish or the
friendly indirect
shell that's a prompt that is very
useful
so you can see okay this repository has
changed there
have been added files uh untracked files
and stuff like that so everything you
know from
from from git prompt and something else
that we've been working on is called
dark slab
which is our um we are using that
internally at the moment but we didn't
open source it
but we're planning to we're going to
open source it it's
a um hosting platform essentially that
we're using to
to uh manage our repositories and at the
moment
it's like completely command line based
um which some of you might actually
enjoy and enjoy it quite a lot
everything you need is just darks and
ssh and uh
it works on all those machines and to us
it's great we don't you
need a web interface we can manage all
all our privileges and all our
rights management that is all done on
the command line and that's that's
actually rather exciting and we are
looking forward to open source that to
you
so yeah um that
pretty much brings us to the end um so i
have a couple of resources that you
might find interesting if you really
want to learn darks i have written
a little book a darks book um
because i didn't like the documentation
that was out there and i
thought yeah writing a book would be a
good idea it's basically
uh hosted our at our hackspace because
i wrote it for the hackspace initially
we're using it there and i thought
if we are going to use it there has to
be decent documentation
and it's very simple it's like baby's
first version control system it's not
like
super sophisticated stuff you can start
right off the bat if you've never used
the version control system before
um that works um you can start with the
book
and here is the uh the the
paper that i've been talking about so
this is like supposedly uh
the foundation for the new patch theory
that people have been working on and
there's an experimental
um you can experiment that with the the
latest uh major release of darks
um docs version three is a
an option but it's it's still
experimental so please don't use that
for
for critical code yeah that's about it
if you want to get a hold of me i'm on
twitter write you and you can drop me an
email i'd write you at anti-ide
um so yeah hope you enjoyed that
if you have any questions just drop me
an email or hit me up on chat
yeah thanks bye

File addition: 20210222_move_past_git.md (----------)

[0.1]

# Document Title
(brought to you by boringcactus)
Can We Please Move Past Git? (22 Feb 2021)
    Git is fundamentally a content-addressable filesystem with a VCS user interface written on top of it.
    — Pro Git §10.1
Most software development is not like the Linux kernel's development; as such, Git is not designed for most software development. Like Samuel Hayden tapping the forces of Hell itself to generate electricity, the foundations on which Git is built are overkill on the largest scale, and when the interface concealing all that complexity cracks, the nightmares which emerge can only be dealt with by the Doom Slayer that is Oh Shit Git. Of course, the far more common error handling method is start over from scratch.
Git is bad. But are version control systems like operating systems in that they're all various kinds of bad, or is an actually good VCS possible? I don't know, but I can test some things and see what comes up.
Mercurial
Mercurial is a distributed VCS that's around the same age as Git, and I've seen it called the Betamax to Git's VHS, which my boomer friends tell me is an apt analogy, but I'm too young for that to carry meaning. So let me see what all the fuss is about.
Well, I have some bad news. From the download page, under "Requirements":
    Mercurial uses Python (version 2.7). Most ready-to-run Mercurial distributions include Python or use the Python that comes with your operating system.
Emphasis theirs, but I'd have added it myself otherwise. Python 2 has been dead for a very long time now, and saying you require Python 2 makes me stop caring faster than referring to "GNU/Linux". If you've updated it to Python 3, cool, don't say it uses Python 2. Saying it uses Python 2 makes me think you don't have your shit together, and in fairness, that makes two of us, but I'm not asking people to use my version control system (so far, at least).
You can't be better than Git if you're that outdated. (Although you can totally be better than Git by developing a reputation for having a better UI than Git; word of mouth helps a lot.)
Subversion
I am a fan of subverting things, and I have to respect wordplay. So let's take a look at Subversion (sorry, "Apache® Subversion®").
There are no official binaries at all, and the most-plausible-looking blessed unofficial binary for Windows is TortoiseSVN. I'm looking through the manual, and I must say, the fact that branches and tags aren't actually part of the VCS, but instead conventions on top of it, isn't good. When I want to make a new branch, it's usually "I want to try an experiment, and I want to make it easy to give up on this experiment." Also, I'm not married to the idea of distributed VCSes, but I do tend to start a project well before I've set up server-side infrastructure for it, and Subversion is not designed for that sort of thing at all. So I think I'll pass.
You can't be better than Git if the server setup precedes the client setup when you're starting a new project. (Although you can totally be better than Git by having monotonically-ish increasing revision numbers.)
Fossil
Fossil is kinda nifty: it handles not just code but also issue tracking, documentation authoring, and a bunch of the other things that services like GitHub staple on after the fact. Where Git was designed for the Linux kernel, which has a fuckton of contributors and needs to scale absurdly widely, Fossil was designed for SQLite, which has a very small number of contributors and does not solicit patches. My projects tend to only have one contributor, so this should in principle work fine for me.
However, a few things about Fossil fail to spark joy. The fact that repository metadata is stored as an independent file separate from the working directory, for example, is a design decision that doesn't merge well with my existing setup. If I were to move my website into Fossil, I would need somewhere to put boringcactus.com.fossil outside of D:\Melody\Projects\boringcactus.com where the working directory currently resides. The documentation suggests ~/Fossils as a folder in which repository metadata can be stored, but that makes my directory structure more ugly. The rationale for doing it this way instead of having .fossil in the working directory like .git etc. is that multiple checkouts of the same repository are simpler when repository metadata is outside each of them. Presumably the SQLite developers do that sort of thing a lot, but I don't, and I don't know anyone who does, and I've only ever done it once (back in the days when the only way to use GitHub Pages was to make a separate gh-pages branch). Cluttering up my filesystem just so you can support a weird edge case that I don't need isn't a great pitch.
But sure, let's check this out. The docs have instructions for importing a Git repo to Fossil, so let's follow them:
PS D:\Melody\Projects\boringcactus.com> git fast-export --all | fossil import --git D:\Melody\Projects\misc\boringcactus.com.fossil
]ad fast-import line: [S IN THE
Well, then. You can't be better than Git if your instructions for importing from Git don't actually work. (Although you can totally be better than Git if you can keep track of issues etc. alongside the code.)
Darcs
Darcs is a distributed VCS that's a little different to Git etc. Git etc. have the commit as the fundamental unit on which all else is built, whereas Darcs has the patch as its fundamental unit. This means that a branch in Darcs refers to a set of patches, not a commit. As such, Darcs can be more flexible with its history than Git can: a Git commit depends on its temporal ancestor ("parent"), whereas a Darcs patch depends only on its logical ancestor (e.g. creating a file before adding text to it). This approach also improves the way that some types of merge are handled; I'm not sure how often this sort of thing actually comes up, but the fact that it could is definitely suboptimal.
So that's pretty cool; let's take a look for ourselves. Oh. Well, then. The download page is only served over plain HTTP - there's just nothing listening on that server over HTTPS - and the downloaded binaries are also served over plain HTTP. That's not a good idea. I'll pass, thanks.
You can't be better than Git while serving binaries over plain HTTP. (Although you can totally be better than Git by having nonlinear history and doing interesting things with patches.)
Pijul
Pijul is (per the manual)
    the first distributed version control system to be based on a sound mathematical theory of changes. It is inspired by Darcs, but aims at solving the soundness and performance issues of Darcs.
Inspired by Darcs but better, you say? You have my attention. Also of note is that the developers are also building their own GitHub clone, which they use to host pijul itself, which gives a really nice view of how a GitHub clone built on top of pijul would work, and also offers free hosting.
The manual gives installation instructions for a couple Linuces and OS X, but not Windows, and not Alpine Linux, which is the only WSL distro I have installed. However, someone involved in the project showed up in my mentions to say that it works on Windows, so we'll just follow the generic instructions and see what happens:
PS D:\Melody\Projects> cargo install pijul --version "~1.0.0-alpha"
    Updating crates.io index
  Installing pijul v1.0.0-alpha.38
  Downloaded <a bunch of stuff>
   Compiling <a bunch of stuff>
error: linking with `link.exe` failed: exit code: 1181
  |
  = note: "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX64\\x64\\link.exe" <lots of bullshit>
  = note: LINK : fatal error LNK1181: cannot open input file 'zstd.lib'
error: aborting due to previous error
So it doesn't work for me on Windows. (There's a chance that instructions would help, but in the absence of those, I will simply give up.) Let's try it over on Linux:
UberPC-V3:~$ cargo install pijul --version "~1.0.0-alpha"
<lots of output>
error: linking with `cc` failed: exit code: 1
  |
  = note: "cc" <a mountain of arguments>
  = note: /usr/lib/gcc/x86_64-alpine-linux-musl/9.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lzstd
          /usr/lib/gcc/x86_64-alpine-linux-musl/9.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lxxhash
          collect2: error: ld returned 1 exit status
error: aborting due to previous error
UberPC-V3:~$ sudo apk add zstd-dev xxhash-dev
UberPC-V3:~$ cargo install pijul --version "~1.0.0-alpha"
<lots of output again because cargo install forgets dependencies immediately smdh>
   Installed package `pijul v1.0.0-alpha.38` (executable `pijul`)
Oh hey, would you look at that, it actually worked, and all I had to do was wait six months for each compile to finish (and make an educated guess about what packages to install). So for the sake of giving back, let's add those instructions to the manual, so nobody else has to bang their head against the wall like I'd done the past few times I tried to get Pijul working for myself.
First, clone the repository for the manual:
UberPC-V3:~$ pijul clone https://nest.pijul.com/pijul/manual
Segmentation fault
Oh my god. That's extremely funny. Oh fuck that's hilarious - I sent that to a friend and her reaction reminded me that Pijul is written in Rust. This VCS so profoundly doesn't work on my machine that it manages to segfault in a language that's supposed to make segfaults impossible. Presumably the segfault came from C code FFId with unsafe preconditions that weren't met, but still, that's just amazing.
Update 2021-02-24: One of the Pijul authors reached out to me to help debug things. Apparently mmap on WSL is just broken, which explains the segfault. They also pointed me towards the state of the art in getting Pijul to work on Windows, which I confirmed worked locally and then set up automated Windows builds using GitHub Actions. So if we have a working Pijul install, let's see if we can add that CI setup to the manual:
PS D:\Melody\Projects\misc> pijul clone https://nest.pijul.com/pijul/manual pijul-manual
✓ Updating remote changelist
✓ Applying changes       47/47
✓ Downloading changes    47/47
✓ Outputting repository
Hey, that actually works! We can throw in some text to the installation page (and more text to the getting started page) and then use pijul record to commit our changes. That pulls up Notepad as the default text editor, which fails to spark joy, but that's a papercut that's entirely understandable for alpha software not primarily developed on this OS. Instead of having "issues" and "pull requests" as two disjoint things, the Pijul Nest lets you add changes to any discussion, which I very much like. Once we've recorded our change and made a discussion on the repository, we can pijul push boringcactus@nest.pijul.com:pijul/manual --to-channel :34 and it'll attach the change we just made to discussion #34. (It appears to be having trouble finding my SSH keys or persisting known SSH hosts, which means I have to re-accept the fingerprint and re-enter my Nest password every time, but that's not the end of the world.)
So yeah, Pijul definitely still isn't production-ready, but it shows some real promise. That said, you can't be better than Git if you aren't production-ready. (Although you can totally be better than Git by having your own officially-blessed GitHub clone sorted out already.) (And maybe, with time, you can be eventually better than Git.)
what next?
None of the existing VCSes that I looked at were unreservedly better than Git, but they all had aspects that would help beat Git.
A tool which is actually better than Git should start by being no worse than Git:
    allow importing existing Git repositories
    don't require Git users to relearn every single thing - we already had to learn Git, we've been through enough
Then, to pick and choose the best parts of other VCSes, it should
    have a UI that's better, or at least perceived as better, than Git's - ideally minimalism and intuitiveness will get you there, but user testing is gonna be the main thing
    avoid opaque hashes as the primary identifier for things - r62 carries more meaning than 7c7bb33 - but not at the expense of features that are actually important
    go beyond just source code, and cover issues, documentation wikis, and similar items, so that (for at least the easy cases) the entire state of the project is contained within version control
    approach history as not just a linear sequence of facts but a story
    offer hosting to other developers who want to use your VCS, so they don't have to figure that out themselves to get started in a robust way
And just for kicks, a couple of extra features that nobody has but everybody should:
    the CLI takes a back seat to the GUI (or TUI, I guess) - seeing the state gets easier that way, discovering features gets easier that way, teaching to people who aren't CLI-literate gets easier that way
    contributor names & emails aren't immutable - trans people exist, and git filter-graph makes it about as difficult to change my name as the state of Colorado did
    if you build in issue/wiki/whatever tracking, also build in CI in some way
    avoid internal jargon - either say things in plain $LANG or develop a consistent and intuitive metaphor and use it literally everywhere
I probably don't have the skills, and I certainly don't have the free time, to build an Actually Good VCS myself. But if you want to, here's what you're aiming for. Good luck. If you can pull it off, you'll be a hero. And if you can't, you'll be in good company.

File addition: 20201228_git_replacement.md (----------)

[0.1]

# Document Title
 Some things a potential Git replacement probably needs to provide
Recently there has been renewed interest in revision control systems. This is great as improvements to tools are always welcome. Git is, sadly, extremely entrenched and trying to replace will be an uphill battle. This is not due to technical but social issues. What this means is that approaches like "basically Git, but with a mathematically proven model for X" are not going to fly. While having this extra feature is great in theory, in practice is it not sufficient. The sheer amount of work needed to switch a revision control system and the ongoing burden of using a niche, nonstandard system is just too much. People will keep using their existing system.
What would it take, then, to create a system that is compelling enough to make the change? In cases like these you typically need a "big design thing" that makes the new system 10× better in some way and which the old system can not do. Alternatively the new system needs to have many small things that are better but then the total improvement needs to be something like 20× because the human brain perceives things nonlinearly. I have no idea what this "major feature" would be, but below is a list of random things that a potential replacement system should probably handle.
Better server integration
One of Git's design principles was that everyone should have all the history all the time so that every checkout is fully independent. This is a good feature to have and one that should be supported by any replacement system. However it is not revision control systems are commonly used. 99% of the time developers are working on some sort of a centralised server, be it Gitlab, Github or the a corporation's internal revision control server. The user interface should be designed so that this common case is as smooth as possible.
As an example let's look at keeping a feature branch up to date. In Git you have to rebase your branch and then force push it. If your branch had any changes you don't have in your current checkout (because they were done on a different OS, for example), they are now gone. In practice you can't have more than one person working on a feature branch because of this (unless you use merges, which you should not do). This should be more reliable. The system should store, somehow, that a rebase has happened and offer to fix out-of-date checkouts automatically. Once the feature branch gets to trunk, it is ok to throw this information away. But not before that.
Another thing one could do is that repository maintainers could mandate things like "pull requests must not contain merges from trunk to the feature branch" and the system would then automatically prohibit these. Telling people to remove merges from their pull requests and to use rebase instead is something I have to do over and over again. It would be nice to be able to prohibit the creation of said merges rather than manually detecting and fixing things afterwards.
Keep rebasing as a first class feature
One of the reasons Git won was that it embraced rebasing. Competing systems like Bzr and Mercurial did not and advocated merges instead. It turns out that people really want their linear history and that rebasing is a great way to achieve that. It also helps code review as fixes can be done in the original commits rather than new commits afterwards. The counterargument to this is that rebasing loses history. This is true, but on the other hand is also means that your commit history gets littered with messages like "Some more typo fixes #3, lol." In practice people seem to strongly prefer the former to the latter.
Make it scalable
Git does not scale. The fact that Git-LFS exists is proof enough. Git only scales in the original, narrow design spec of "must be scalable for a process that only deals in plain text source files where the main collaboration method is sending patches over email" and even then it does not do it particularly well. If you try to do anything else, Git just falls over. This is one of the main reasons why game developers and the like use other revision control systems. The final art assets for a single level in a modern game can be many, many times bigger than the entire development history of the Linux kernel.
A replacement system should handle huge repos like these effortlessly. By default a checkout should only download those files that are needed, not the entire development history. If you need to do something like bisection, then files missing from your local cache (and only those) should be downloaded transparently during checkout operations. There should be a command to download the entire history, of course, but it should not be done by default.
Further, it should be possible to do only partial checkouts. People working on low level code should be able to get just their bits and not have to download hundreds of gigs of textures and videos they don't need to do their work.
Support file locking
This is the one feature all coders hate: the ability to lock a file in trunk so that no-one else can edit it. It is disruptive, annoying and just plain wrong. It is also necessary. Practice has shown that artists at large either can not or will not use revision control systems. There are many studios where the revision control system for artists is a shared network drive, with file names like character_model_v3_final_realfinal_approved.mdl. It "works for them" and trying to mandate a more process heavy revision control system can easily lead to an open revolt.
Converting these people means providing them with a better work flow. Something like this:
    They open their proprietary tool, be it Photoshop, Final Cut Pro or whatever.
    Click on GUI item to open a new resource.
    A window pops up where they can browse the files directly from the server as if they were local.
    They open a file.
    They edit it.
    They save it. Changes go directly in trunk.
    They close the file.
There might be a review step as well, but it should be automatic. Merge requests should be filed and kept up to date without the need to create a branch or to even know that such a thing exists. Anything else will not work. Specifically doing any sort of conflict resolution does not work, even if it were the "right" thing to do. The only way around this (that we know of) is to provide file locking. Obviously this should only be limitable to binary files.
Provide all functionality via a C API
The above means that you need to be able to deeply integrate the revision control system with existing artist tools. This means plugins written in native code using a stable plain C API. The system can still be implemented in whatever SuperDuperLanguage you want, but its one true entry point must be a C API. It should be full-featured enough that the official command line client should be implementable using only functions in the public C API.
Provide transparent Git support
Even if a project would want to move to something else, the sad truth is that for the time being the majority of contributors only know Git. They don't want to learn a whole new tool just to contribute to the project. Thus the server should serve its data in two different formats: once in its native format and once as a regular Git endpoint. Anyone with a Git client should be able to check out the code and not even know that the actual backend is not Git. They should be able to even submit merge requests, though they might need to jump through some minor hoops for that. This allows you to do incremental upgrades, which is the only feasible way to get changes like these done.
Posted by Jussi at 12:57 PM Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
10 comments:
    UnknownDecember 28, 2020 at 4:43 PM
    Thank you for this thoughtful assessment of the technical landscape. I think you have provided eloquent design points along this roadmap.
    Reply
    Thomas DADecember 28, 2020 at 10:07 PM
    The example about a feature branch, isn't thst with force-with-lease is about?
    Reply
    Replies
        JussiDecember 29, 2020 at 12:23 AM
        Possibly. I tried to read that documentation page and I could not for the life of me understand what it is supposed to do. Which gives us yet another thing the replacement should provide:
        Have documentation that is actually readable and understandable by humans.
        Reply
    RickyxDecember 29, 2020 at 2:05 AM
    As an architect (not of software but of buildings) and artist, I totally agree.
    Currently I found no good solution to version binary cad, 3d files, images... currently used scheme in major studios?
    2020-11-09_my house.file
    2020-11-10_my house.file
    2020-11-22_my house Erik fix.file
    2020-12-02_my house changed roof.file
    ...
    Reply
    not emptyDecember 29, 2020 at 6:48 PM
    Are you aware of Perforce/Helix? I don't want to advertise this (commercial) application, but some of the points you're making seem to match its features (server-based, file-locking). I haven't used it in production (especially not what seems to be a git-compatible interface) but have looked at it as a way to handle large binary files.
    Since a creative studio relies on a lot of different applications, using a plain file server still seems to be the most compatible and transparent way to handle files. You just have to make it easy for artists to use a versioning scheme so you don't end up with what you've described (my_file_v5_final_approved_...)
    Reply
    Replies
        JussiDecember 30, 2020 at 12:05 AM
        From what I remember Perforce is server-only. And terrible. Only used it for a while ages ago, though.
        Reply
    UnknownDecember 29, 2020 at 10:12 PM
    I think the best solution would be to have git for mergable/diffable files and perforce for binary assets.
    It would also be good if artist/designers/architects etc. used mergable file formats for their work, but that is not really possible today.
    Reply
    Replies
        JussiDecember 30, 2020 at 12:06 AM
        The most important thing a version control system gives you is atomic state. For that you need only one system, two separate ones don't really work.
        UnknownDecember 30, 2020 at 5:41 PM
        I think that atomic state is way less important for binary/unmergable assets (compared to code). So from a game dev perspective a dual system should work way better than a single one.
        Reply
    ApichatDecember 31, 2020 at 9:43 AM
    You don't even mention the name of the project "basically Git, but with a mathematically proven model for X"
    I think you might refer to Pijul https://pijul.org/ :
    "Pijul is a free and open source (GPL2) distributed version control system. Its distinctive feature is to be based on a sound theory of patches, which makes it easy to learn and use, and really distributed."
    It gets recent news :
    https://pijul.org/posts/2020-11-07-towards-1.0/
    https://pijul.org/posts/2020-12-19-partials/
    and
    Pijul - The Mathematically Sound Version Control System Written in Rust
    https://initialcommit.com/blog/pijul-version-control-system
    Q&A with the Creator of the Pijul Version Control System
    https://initialcommit.com/blog/pijul-creator
    Reply

File addition: 20201129_qa_pijul.md (----------)

[0.1]

# Document Title
Introduction
This article is a Q&A format with Pierre-Étienne Meunier, the creator and lead developer of the Pijul VCS (version control system).
Q: What is your background in computer science and software development?
A: I've been an academic researcher for about 10 years, and I've recently left academia to found a science company working on issues related to energy savings and decentralised energy production. I'm the only computer scientist, and we work with sociologists and energy engineers.
While I was in academia, my main area of work was asynchronous and geometric computing, whose goal is to understand how systems with lots of simple components can interact in a geometric space to make computation happen. My current work is also somewhat related to this, although in a different way.
My favourite simple components differ from man-made "silicon cores", in that they happen in nature or at least in the real world, and are not made for the specific purpose of computation. For many years, I've worked on getting molecules to self-organise in a test tube to form meaningful shapes. And I've also worked on Pijul, where different authors edit a document in a disorderly way, without necessarily agreeing on "good practices" beforehand.
The idea of Pijul came while Florent Becker and myself were writing a paper on self-assembly. At some point we started thinking about the shortcomings of Darcs (Florent was one of the core contributors of Darcs at the time). We decided that we had to do something about it, to keep the "mathematically designed" family of version-controlled systems alive.
Q: What is your approach to learning, furthering your knowledge and understanding on a topic? What resources do you use?
A: The thing I love the most about computer science is that it can be used to think about a wide variety of subjects of human knowledge, and yet you can get a very concrete and very cheap experience of it by programming a computer. In all disciplines, the main way for virtually anyone to learn technical and scientific things is to play with them.
For example you can read a book about economics, and then immediately write little simulations of games from game theories, or a basic simulator of macroeconomics. You can read the Wikipedia page about wave functions, and then start writing code to simulate a quantum computer. These simulations will not be efficient or realistic, but they will allow their authors to formalise their ideas in a concrete and precise way.
By following this route, you not only get a door into the subject you wanted to understand initially, you also get access to philosophical questions that seemed very abstract before, since computer science is a gateway between the entire world of "pure reason" (logic and mathematics), as Kant would say, and the physical world. What can we know for a fact? Is there a reality beyond language? And suddenly you get a glimpse of what Hume, Kant, Wittgenstein… were after.
Q: What was the first programming language you learned and how did you get into it?
A: I think I started with C when I was around 12. My uncle had left France to take an administrative position high up at the Vatican, and left most of his things with his brothers and sisters. My mother got his computer, a Victor v286, which was already pretty old when we got it (especially in the days where Moore's law was at its peak). Almost nothing was supplied with it: MS-DOS, a text processor, and a C IDE. So if I wanted to use it, I had little choice. I don't remember playing much with the text processor.
Q: Why did you decide to start a Version Control System? What is it about VCS that interests you?
A: My first contact with version control systems was with SVN at university, and I remember being impressed when I first started using it for the toy project we were working on.
Then, when I did my PhD, my friends and colleagues convinced me to switch to Darcs. As the aspirant mathematician I was, the idea of a rigorous patch theory, where detecting conflicts was done entirely by patch commutation, was appealing. As everybody else, I ran every now and then into Darcs' issues with conflicts, until Florent and I noticed that (1) it had become nearly impossible to convince our skeptical colleagues to install and use it and (2) this particular corner of version control systems was surprisingly close to the rest of our research interests, and that's how we got started.
Q: How do you decide what projects to work on?
A: This is one of my biggest problems in life. I'm generally interested in a large number of things, and I have little time to explore all of them in the depth they deserve. When I have to choose, I try to do what I think will teach me (and hopefully others) the most, or what will change things the most.
Q: Why did you choose to write Pijul in Rust?
A: We didn't really choose, Rust was the only language at the time that ticked all the boxes:
    Statically typed with automatic memory management, because we were writing mathematical code, and we were just two coders, so we needed as much help from compilers as we could get. That essentially meant one of OCaml, Haskell, Scala, Rust, Idris.
    Fast, because we knew we would be benchmarked against Git. That ruled out Scala (because of startup times) and Idris, which was experimental at the time. Also, Haskell can be super fast, but the performance is hard to guarantee deterministically.
    Could call C functions on Windows. That ruled out OCaml (this might have been fixed since then), and was a strong argument for Rust, since the absence of a GC makes this particularly easy.
    An additional argument for Rust is that compiling stuff is super easy. Sure, there are system dependencies, but most of the time, things "just work". It works so well that most things that people are encouraged to split their work into multiple crates rather than bundling it up into a large monolithic library. As a consequence, even very simple programs can easily have dozens of dependencies.
Q: When you encounter a tough development problem, what is your process to solve it? Especially if the problem is theoretical in nature?
A: It really depends on the problem, but I tend to start all my projects by playing with toy examples, until I understand something new. I take notes about the playing, and then try to harden the reasoning by making the intuitive parts rigorous. This is often where bugs (both in mathematical proofs and in software) are hidden. Often, this needs a new iteration of playing, and sometimes many more. I usually find that coding is a good way to play with a problem, even though it isn't always possible, especially in later iterations of the project.
Of course, for things like Pijul, code is needed all the way to the end, but it is a different sort of code, not the "prototype" kind used to understand a problem.
Apart from that, I find that taking a break to walk outside, without forcing myself to stay too focused on the problem, is very useful. I also do other, more intensive sports, but they rarely help solve my problems.
Q: How big is the Pijul team? Who is it made up of?
A: For quite a while there was just Florent and myself, then a few enthusiasts joined us to work on the previous version: we had about 10 contributors, including Thomas Letan who, in addition to his technical contributions, organised the community, welcomed people, etc. And Tae Sandoval, who wrote a popular tutorial (Pijul for Git users), and has been convincing people on social media at an impressive rate for a few years.
Learn how Git's code works... 🚀👨‍💻📚
Image of the cover of the Decoding Git Guidebook for Developers
Check out our Decoding Git Guidebook for Developers
Now that we have a candidate version that seems applicable to real-life projects, the team has started growing again, and the early alpha releases of version 1.0.0, even though they still have a few bugs, are attracting new contributors and testers.
Q: What are your goals for Pijul capabilities, growth, and adoption?
A: When Pijul becomes stable, it will be usable for very large projects, and bring more sanity to source code management, by making things more deterministic.
This has the potential to save a large number of engineering hours globally, and to use continuous integration tools more wisely: indeed, on large repositories, millions of CPU-hours are wasted each year just to check that Git and others didn't shuffle lines around during a merge.
Smaller projects, beginners and non-technical people could also benefit from version control, but they aren't using it now, or at least not enough, because the barrier is just too high. Moreover, in some fields of industry and administration, people are doing version control manually, which seems like a giant waste of human time to me, but I also know that no current tool can do that job.
About growth and adoption, I also want to mention that Pijul started as an open source project, and that we are strongly committed to keeping it open source. Since there is a large amount of tooling to be developed around it to encourage adoption (such as text editor plugins, CI/CD tooling…), we are currently trying to build a commercial side as well, in order to fund these developments. This could be in the form of support, hosting, or maybe specialised applications of Pijul, or all that at the same time.
Q: Do you think Pijul (or any other VCS) could ever overtake Git?
A: I think there could be a space for both. Git is meant as a content-addressable storage engine, and is extraordinarily efficient at that. One thing that is particularly cool with focusing on versions is that diffs can be computed after the fact, and I can see how this is desirable sometimes. For example, for cases where only trivial merges happen (for example adding a file), such as scientific data, this models the reality quite well, as there is no "intent" of an "author" to convey.
In Pijul, we can have different diff algorithms, for instance taking a specific file format into account, but they have to be chosen at record time. The advantages are that the merge is unambiguous, in the sense that there is only one way to merge things, and that way satisfies a number of important properties not satisfied by Git. For example, the fact that merging two changes at once has the same effect as merging the first one, and then the other (as bizarre as it sounds, Git doesn't always do that).
However, because Git does not operate on changes (or "diffs"), it fails to model how collaboration really works. For example, when merging changes made on different branches, Git insists on ordering them, which is not actually what happened: in reality, the authors of the two branches worked in parallel, and merged each other's changes in different orders. This sounds like a minor difference, but in real life, it forces people to reorder their history all the time, letting Git guess how to reshuffle their precious source code in ways that are not rigorous at all.
Concretely, if Alice and Bob each produce a commit in parallel, then when they pull each other's change, they should get the exact same result, and see the same conflicts if there are conflicts. It shouldn't matter whether it is Alice or Bob who solves the conflicts, as long as the resolution works for both of them. If they later decide to push the result to another repository, there is no reason why the conflicts should reappear. They sometimes do in Git (which is the reason for the git rerere command), and the theory of Pijul guarantees that this is never the case.
Q: If "distributed version control" is the 3rd generation of VCS tools, do you anticipate a 4th generation? If so, what might that look like? What might be the distinguishing feature/aspect of the next generation VCS?
A: I believe "asynchronous" is the keyword here. Git (Mercurial, Fossil, etc.) are distributed in the sense that each instance is a server, but they are really just replicated instances of a central source of authority (usually hosted on GitHub or GitLab and named "master").
In contrast to this, asynchronous systems can work independently from each other, with no central authority. These systems are typically harder to design, since there is a rather large number of cases, and even figuring out how to make a list of cases isn't obvious.
Now of course, project leaders will always want to choose a particular version for release, but this should be dictated by human factors only, not by technical factors.
Q: What is your personal favorite VCS feature?
A: I believe commutativity is the thing every version control system is trying to simulate, with varying degrees of success. Merges and rebases in Git are trying to make things commute (well, when they work), and Darcs does it (except for conflicts). So this is really my favourite feature, and the fact that no one else was doing it rigorously got me into this field.
Q: If you could snap your fingers and have any VCS feature magically appear in Pijul, what would it be?
A: This is an excellent question. One thing we do not capture very well yet is keeping the identity of code blocks across refactorings. If you split or merge a file, or swap two functions in a file, then these changes will not commute with changes that happen in parallel.
We are not alone: Darcs doesn't model that at all, and Git and Mercurial use their merge heuristics to try and solve it. But as I said before, these heuristics are not rigorous, and can sometimes reshuffle files in unexpected ways without telling the user, which has non-trivial security implications.
I have ideas on how to do it in Pijul, but they are not fully formalised yet. I think the format is now extensible enough to support them. What was I saying about playing with toy examples and walking outside? ;-)
Conclusion
In this article, we presented a Q&A session with the creator and lead developer of the Pijul project, Pierre-Étienne Meunier.
If you're interested in learning more about how version control systems work under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how version control systems work at the code level. To do this we documented the first version of Git's code and discuss it in detail.
We hope you enjoyed this post! Feel free to shoot me an email at jacob@initialcommit.io with any questions or comments.

File addition: 20201030_conflict_free_replicated_relations.md (----------)

[0.1]

# Document Title
HAL Id: hal-02983557
https://inria.hal.science/hal-02983557
Submitted on 30 Oct 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Conflict-Free Replicated Relations for
Multi-Synchronous Database Management at Edge
Weihai Yu, Claudia-Lavinia Ignat
To cite this version:
Weihai Yu, Claudia-Lavinia Ignat. Conflict-Free Replicated Relations for Multi-Synchronous Database
Management at Edge. IEEE International Conference on Smart Data Services, 2020 IEEE World
Congress on Services, Oct 2020, Beijing, China. hal-02983557
Conflict-Free Replicated Relations for
Multi-Synchronous Database Management at Edge
Weihai Yu
UIT - The Arctic University of Norway
N-9037 Tromsø, Norway
weihai.yu@uit.no
Claudia-Lavinia Ignat
Universit´e de Lorraine, CNRS, Inria, LORIA
F-54000 Nancy, France
claudia.ignat@inria.fr
Abstract—In a cloud-edge environment, edge devices may not
always be connected to the network. Still, applications may
need to access the data on edge devices even when they are
not connected. With support for multi-synchronous access, data
on an edge device are kept synchronous with the data in the
cloud as long as the device is online. When the device is off-
line, the application can still access the data on the device,
asynchronously with concurrent data updates either in the cloud
or on other edge devices. Conflict-free Replicated Data Types
(CRDTs) emerged as a technology for multi-synchronous data
access. CRDTs guarantee that when all sites have applied the
same set of updates, the replicated data converge. However,
CRDTs have not been successfully applied to relational databases
(RDBs) for multi-synchronous access. In this paper, we present
Conflict-free Replicated Relations (CRRs) that apply CRDTs
to RDBs for support of multi-synchronous data access. With
CRR, existing RDB applications, with very little modification,
can be enhanced with multi-synchronous access. We also present
a prototype implementation of CRR with some preliminary
performance results.
Index Terms—CRDT; relational database; eventual consis-
tency; integrity constraints
I. INTRODUCTION
The cloud technology makes data globally shareable and
accessible, as long as the end-user’s device is connected to
the network. When the user’s device is not connected, the
data become inaccessible.
There are scenarios where the users may want to access
their data even when their devices are off-line. The researchers
may want to access their research data when they are at field
work without any network facility. The project managers and
engineers may want to access project data when they are under
extreme conditions such as inside a tunnel or in the ocean.
People may want to use their personal applications during
flight, in a foreign country or during mountain hiking.
In a cloud-edge environment, the cloud consists of servers
with powerful computation and reliable data storage capacities.
The edge consists of the devices outside the operation domain
of the cloud. The above-mentioned scenarios suggest multi-
synchronous access of data on edge devices. That is, there are
two modes of accessing data on edge devices: asynchronous
mode—the user can always access (read and write) the data on
the device, even when the device is off-line, and synchronous
mode—the data on the device is kept synchronous with the
data stored in the cloud, as long as the device is online.
One of the main challenges of multi-synchronous data ac-
cess is the limitation of a networked system stated in the CAP
theorem [1,2]: it is impossible to simultaneously ensure all
three desirable properties, namely (C) consistency equivalent
to a single up-to-date copy of data, (A) availability of the data
for update and (P) tolerance to network partition.
CRDTs, or Conflict-free Replicated Data Types, emerged to
address the CAP challenges [3]. With CRDT, a site updates its
local replica without coordination with other sites. The states
of replicas converge when they have applied the same set of
updates (referred to as strong eventual consistency in [3]).
CRDTs have been adopted in the construction of distributed
key-value stores [4], collaborative editors [5]–[8] and local-
first software [9,10]. All these bear the similar goal as multi-
synchronous data access. There has also been active research
on CRDT-based transaction processing [11] mainly to achieve
low latency for geo-replicated data stores.
Despite the above success stories, CRDTs have not been
applied to multi-synchronous access of data stored in rela-
tional databases (RDBs). To support multi-synchronous data
access, the application typically has to be implemented using
specific key-value stores built with CRDTs. These key-value
stores do not support some important RDB features, including
advanced queries and integrity constraints. Consequently, the
large number of existing applications that use RDBs have to
be re-implemented for multi-synchronous access.
In this paper, we propose Conflict-Free Replicated Relation
(CRR) for multi-synchronous access to RDB data. Our con-
tributions include:
• A set CRDT for RDBs. One of the hurdles for adopting
CRDTs to RDBs is that there is no appropriate set CRDT
for RDBs (Section III-A). We present CLSet CRDT
(causal-length set) that is particularly suitable for RDBs
(Section III-C).
• An augmentation of existing RDB schema with CRR in
a two-layer system (Section IV). With CRR, we are able
to enhance existing RDB applications with very little
modification.
• Automatic handling of integrity violations at merge (Sec-
tion V).
• A prototype implementation (Section VI). The implemen-
tation is independent of the underlying database manage-
ment system (DBMS). This allows different devices of
˜푟푖
푟푖
q(푟푖 )
u(푟푖 )
query
refreshupdate
˜u( ˜푟푖 )
˜푟 푗
anti-entropy
merge
푟 푗
Site 푆푖 Site 푆 푗
AR
layer
CRR
layer
Fig. 1. A two-layer relational database system
the same application to deploy different DBMSs. This
is particularly useful in a cloud-edge environment that
is inherently heterogeneous. We also report some initial
performance result of the prototype.
The paper is organized as follows. Section II gives an
overview of CRR, together with the system model our work
applies to. Section III reviews the background of CRDT and
presents the CLSet CRDT that is an underlying CRDT of
CRR. Section IV describes CRR in detail. Section VI presents
an implementation of CRR and some preliminary performance
results. Section VII discusses related work. Section VIII con-
cludes.
II. OVERVIEW OF CRR
The RDB supporting CRR consists of two layers: an Ap-
plication Relation (AR) layer and a Conflict-free Replicated
Relation (CRR) layer (see Figure 1). The AR layer presents the
same RDB schema and API as a conventional RDB system.
Application programs interact with the database at the AR
layer. The CRR layer supports conflict-free replication of
relations.
The AR-layer database schema 푅 has an augmented CRR
schema ˜푅. A site 푆푖 maintains both an instance 푟푖 of 푅 and an
instance ˜푟푖 of ˜푅. A query q on 푟푖 is performed directly on 푟푖 .
A request for update u on 푟푖 is translated into ˜u and performed
on the augmented relation instance ˜푟푖 . The update is later
propagated to remote sites through an anti-entropy protocol.
Every update in ˜푟푖 , either as the execution of a local update
˜u( ˜푟푖 ) or as the merge with a remote update ˜u( ˜푟 푗 ), refreshes 푟푖 .
CRR has the property that when both sites 푆푖 and 푆 푗 have
applied the same set of updates, the relation instances at the
two sites are equivalent, i.e. 푟푖 = 푟 푗 and ˜푟푖 = ˜푟 푗 .
The two-layered system also maintains the integrity con-
straints defined at the AR layer. Any violation of integrity
constraint is caught at the AR layer. A local update of ˜푟푖 and
refresh of 푟푖 are wrapped in an atomic transaction: a violation
would cause the rollback of the transaction. A merge at ˜푟푖 and
a refresh at 푟푖 is also wrapped in a transaction: a failed merge
would cause some compensation updates.
An application developer does not need to care about how
the CRR layer works. Little change is needed for an existing
RDB application to function with CRR support.
We use different CRDTs for CRRs. Since a relation instance
is a set of tuples or rows, we use a set CRDT for relation
instances. We design a new set CRDT (Section III), as none
of existing set CRDTs are suitable for CRRs (Section III-A).
A row consists of a number of attributes. Each of the attributes
can be individually updated. We use the LWW (last-write
wins) register CRDT [12,13] for attributes in general cases.
Since LWW registers may lose some concurrent updates, we
use the counter CRDT [3] for numeric attributes with additive
updates.
System Model
A distributed system consists of sites with globally unique
identifiers. Sites do not share memory. They maintain durable
states. Sites may crash, but will eventually recover to the
durable state at the time of the last crash.
A site can send messages to any other site in the system
through an asynchronous and unreliable network. There is no
upper bound on message delay. The network may discard,
reorder or duplicate messages, but it cannot corrupt messages.
Through re-sending, messages will eventually be delivered.
The implication is that there can be network partitions, but
disconnected sites will eventually get connected.
III. A SET CRDT FOR CRRS
In this section, we first present a background of CRDTs. As
a relation instance is a set of tuples, CRR needs an appropriate
set CRDT as a building block. We present limitations of
existing set CRDTs for RDBs. We then describe a new set
CRDT that addresses the limitations.
A. CRDT Background
A CRDT is a data abstraction specifically designed for data
replicated at different sites. A site queries and updates its
local replica without coordination with other sites. The data is
always available for update, but the data states at different
sites may diverge. From time to time, the sites send their
updates asynchronously to other sites with an anti-entropy
protocol. To apply the updates made at the other sites, a site
merges the received updates with its local replica. A CRDT
has the property that when all sites have applied the same set
of updates, the replicas converge.
There are two families of CRDT approaches, namely
operation-based and state-based [3]. Our work is based on
state-based CRDTs, where a message for updates consists of
the data state of a replica in its entirety. A site applies the
updates by merging its local state with the state in the received
message. The possible states of a state-based CRDT must form
a join-semilattice [14], which implies convergence. Briefly, the
states form a join-semilattice if they are partially ordered with
v and a join t of any two states (that gives the least upper
bound of the two states) always exists. State updates must be
inflationary. That is, the new state supersedes the old one in
v. The merge of two states is the result of a join.
Figure 2 (left) shows GSet, a state-based CRDT for grow-
only sets (or add-only set [3]), where 퐸 is a set of possible el-
ements, v def
= ⊆, t def
= ∪, insert is a mutator (update operation)
and in is a query. Obviously, an update through insert(푠, 푒)
is an inflation, because 푠 ⊆ {푒} ∪ 푠. Figure 2 (right) shows
GSet(퐸) def
= P (퐸)
insert(푠, 푒) def
= {푒} ∪ 푠
insert훿 (푠, 푒) def
=
{
{푒} if 푒 ∉ 푠
{} otherwise
푠 t 푠′ def
= 푠 ∪ 푠′
in(푠, 푒) def
= 푒 ∈ 푠
{푎, 푏, 푐}
{푎, 푏} {푎, 푐} {푏, 푐}
{푎} {푏} {푐}
{}
Fig. 2. GSet CRDT and Hasse diagram of states
the Hasse diagram of the states in a GSet. A Hasse diagram
shows only the “direct links” between states.
GSet is an example of an anonymous CRDT, since its oper-
ations are not specific to the sites that perform the operations.
Two concurrent executions of the same mutation, such as
insert({}, 푎), fulfill the same purpose.
Using state-based CRDTs, as originally presented [3], is
costly in practice, because states in their entirety are sent as
messages. Delta-state CRDTs address this issue by only send-
ing join-irreducible states [15,16]. Basically, join-irreducible
states are elementary states: every state in the join-semilattice
can be represented as a join of some join-irreducible state(s).
In Figure 2, insert훿 is a delta-mutator that returns join-
irreducible states which are singleton sets (boxed in the Hasse
diagram).
Since a relation instance is a set of tuples, the basic building
block of CRR is a set CRDT, or specifically, a delta-state set
CRDT.
A CRDT for general-purpose sets with both insertion and
deletion updates can be designed as a causal CRDT [15] such
as ORSet (observed-remove set [12,17,18]). Basically, every
element is associated with at least a causal context as meta
data. A causal context is a set of event identifiers (typically a
pair of a site identifier and a site-specific sequence number).
An insertion or deletion is achieved with inflation of the
associated causal context. Using causal contexts, we are able
to tell explicitly which insertions of an element have been
later deleted. However, maintaining causal contexts for every
element can be costly, even though it is possible to compress
causal contexts into vector states, for instance under causal
consistency. Causal CRDTs are not suitable for RDBs, which
generally do not allow multi-valued attributes such as causal
contexts.
In the following, we present a new set CRDT, based on the
abstraction of causal length. It is a specialization of our earlier
work on generic undo support for state-based CRDTs [19].
B. Causal length
The key issue that a general-purpose set CRDT must address
is how to identify the causality between the different insertion
and deletion updates. We achieve this with the abstraction of
causal length, which is based on two observations.
First, the insertions and deletions of a given element occur
in turns, one causally dependent on the other. A deletion is an
Site 퐴 Site 퐵 Site 퐶
{} 푠0
퐴 {} 푠0
퐵 {} 푠0
퐶
푎1
퐴 : insert(푠0
퐴, 푎)
{푎} 푠1
퐴
푎1
퐵 : insert(푠0
퐵, 푎)
{푎} 푠1
퐵
{푎} 푠2
퐴
{푎} 푠1
퐶
푎2
퐴 : delete(푠2
퐴, 푎)
{} 푠3
퐴
푎2
퐵 : delete(푠1
퐵, 푎)
{} 푠2
퐵
{} 푠3
퐵
{} 푠4
퐵
푎2
퐶 : delete(푠1
퐶 , 푎)
{} 푠2
퐶
푎3
퐵 : insert(푠4
퐵, 푎)
{푎} 푠5
퐵
{푎} 푠6
퐵
{} 푠3
퐶
{푎} 푠4
퐶
푎4
퐶 : delete(푠4
퐶 , 푎)
{} 푠5
퐶
Fig. 3. A scenario of concurrent set updates
inverse of the last insertion it sees. Similarly, an insertion is
an inversion of the last deletion it sees (or none, if the element
has never been inserted).
Second, two concurrent executions of the same mutation of
an anonymous CRDT (Section III-A) fulfill the same purpose
and therefore are regarded as the same update. Seeing one
means seeing both (such as the concurrent insertions of the
same element in GSet). Two concurrent inverses of the same
update are also regarded as the same one.
Figure 3 shows a scenario where three sites 퐴, 퐵 and 퐶
concurrently insert and delete element 푎. When sites 퐴 and
퐵 concurrently insert 푎 for the first time, with updates 푎1
퐴
and 푎1
퐵, they achieve the same effect. Seeing either one of the
updates is the same as seeing both. Consequently, states 푠1
퐴,
푠2
퐴, 푠1
퐵 and 푠1
퐶 are equivalent as far as the insertion of 푎 is
concerned.
Following the same logic, the concurrent deletions on these
equivalent states (with respect to the first insertion of 푎) are
also regarded as achieving the same effect. Seeing one of them
is the same as seeing all. Therefore, states 푠3
퐴, 푠2
퐵, 푠3
퐵, 푠4
퐵, 푠2
퐶
and 푠3
퐶 are equivalent with regard to the deletion of 푎.
Now we present the states of element 푎 as the equivalence
classes of the updates, as shown in the second column of
Table I. The concurrent updates that see equivalent states and
achieve the same effect are in the same equivalence classes.
The columns 푟 and ˜푟 in Table I correspond to the states of a
tuple (as an element of a set) in relation instances in the AR
and CRR layers (Section II).
When a site makes a new local update, it adds to its state a
new equivalence class that contains only the new update. For
example, when site 퐵 inserts element 푎 at state 푠0
퐵 with update
푎1
퐵, it adds a new equivalence class {푎1
퐵 } and the new state 푠1
퐵
becomes {{푎1
퐵 }}. When site 퐵 then deletes the element with
update 푎2
퐵, it adds a new equivalence class {푎2
퐵 } and the new
TABLE I
STATES OF A SET ELEMENT
푠 states in terms of equivalence groups ˜푟 푟
푠0
퐴 { } { } { }
푠1
퐴 { {푎1
퐴 } } { 〈푎, 1〉 } {푎 }
푠2
퐴 { {푎1
퐴, 푎1
퐵 } } { 〈푎, 1〉 } {푎 }
푠3
퐴 { {푎1
퐴, 푎1
퐵 }, {푎2
퐴 } } { 〈푎, 2〉 } { }
푠0
퐵 { } { } { }
푠1
퐵 { {푎1
퐵 } } { 〈푎, 1〉 } {푎 }
푠2
퐵 { {푎1
퐵 }, {푎2
퐵 } } { 〈푎, 2〉 } { }
푠3
퐵 { {푎1
퐴, 푎1
퐵 }, {푎2
퐵 } } { 〈푎, 2〉 } { }
푠4
퐵 { {푎1
퐴, 푎1
퐵 }, {푎2
퐴, 푎2
퐵 } } { 〈푎, 2〉 } { }
푠5
퐵 { {푎1
퐴, 푎1
퐵 }, {푎2
퐴, 푎2
퐵 }, {푎3
퐵 } } { 〈푎, 3〉 } {푎 }
푠6
퐵 { {푎1
퐴, 푎1
퐵 }, {푎2
퐴, 푎2
퐵 , 푎2
퐶 }, {푎3
퐵 } } { 〈푎, 3〉 } {푎 }
푠0
퐶 { } { } { }
푠1
퐶 { {푎1
퐵 } } { 〈푎, 1〉 } {푎 }
푠2
퐶 { {푎1
퐵 }, {푎2
퐶 } } { 〈푎, 2〉 } { }
푠3
퐶 { {푎1
퐵 }, {푎2
퐵 , 푎2
퐶 } } { 〈푎, 2〉 } { }
푠4
퐶 { {푎1
퐴, 푎1
퐵 }, {푎2
퐴, 푎2
퐵 , 푎2
퐶 }, {푎3
퐵 } } { 〈푎, 3〉 } {푎 }
푠5
퐶 { {푎1
퐴, 푎1
퐵 }, {푎2
퐴, 푎2
퐵 , 푎2
퐶 }, {푎3
퐵 }, {푎4
퐶 } } { 〈푎, 4〉 } { }
state 푠2
퐵 becomes {{푎1
퐵 }, {푎2
퐵 }}.
The merge of two states is handled as the union of the
equivalence classes. For example, when states 푠1
퐴 and 푠2
퐵
merge, the new state 푠3
퐵 becomes {{푎1
퐴} ∪ {푎1
퐵 }, {푎2
퐵 }} =
{{푎1
퐴, 푎1
퐵 }, {푎2
퐵 }}.
Now, observe that whether element 푎 is in the set is
determined by the number of equivalence classes, rather than
the specific updates contained in the equivalence classes. For
example, as there is one equivalence class in state 푠1
퐵, the
element 푎 is in the set. As there are two equivalence classes
in states 푠2
퐵, 푠3
퐵 and 푠4
퐵, the element 푎 is not in the set.
Due to the last observation, we can represent the states of
an element with a single number, the number of equivalence
classes. We call that number the causal length of the element.
An element is in the set when its causal length is an odd
number. The element is not in the set when its causal length
is an even number. As shown in Table I, a CRR-layer relation
˜푟 augments an AR-later relation 푟 with causal lengths.
C. CLSet CRDT
Figure 4 shows the CLSet CRDT. The states are a partial
function 푠 : 퐸 ↩→ N, meaning that when 푒 is not in the domain
of 푠, 푠(푒) = 0 (0 is the bottom element of N, i.e. ⊥N = 0).
Using partial function conveniently simplifies the specification
of insert, t and in. Without explicit initialization, the causal
length of any unknown element is 0. In the figure, insert훿
and delete훿 are the delta-counterparts of insert and delete
respectively.
An element 푒 is in the set when its causal length is an odd
number. A local insertion has effect only when the element is
not in the set. Similarly, a local deletion has effect only when
the element is actually in the set. A local insertion or deletion
CLSet(퐸) def
= 퐸 ↩→ N
insert(푠, 푒) def
=
{
푠{푒 7 → 푠(푒) + 1} if even(푠(푒))
푠 if odd(푠(푒))
insert훿 (푠, 푒) def
=
{
{푒 7 → 푠(푒) + 1} if even(푠(푒))
{} if odd(푠(푒))
delete(푠, 푒) def
=
{
푠 if even(푠(푒))
푠{푒 7 → 푠(푒) + 1} if odd(푠(푒))
delete훿 (푠, 푒) def
=
{
{} if even(푠(푒))
{푒 7 → 푠(푒) + 1} if odd(푠(푒))
(푠 t 푠′)(푒) def
= max(푠(푒), 푠′ (푒))
in(푠, 푒) def
= odd(푠(푒))
Fig. 4. CLSet CRDT
simply increments the causal length of the element by one.
For every element 푒 in 푠 and/or 푠′, the new causal length of 푒
after merging 푠 and 푠′ is the maximum of the causal lengths
of 푒 in 푠 and 푠′.
IV. CONFLICT-FREE REPLICATED RELATIONS
In this section we describe the design of the CRR layer. We
focus particularly on the handling of updates.
Without loss of generality, we represent the schema of an
application-layer relation with 푅(퐾, 퐴), where 푅 is the relation
name, 퐾 is the primary key and 퐴 is a non-key attribute. For a
relation instance 푟 of 푅(퐾, 퐴), we use 퐴(푟) for the projection
휋퐴푟. We also use 푟 (푘) for the row identified with key 푘, and
(푘, 푎) for row 푟 (푘) whose 퐴 attribute has value 푎.
A. Augmentation and caching
In a two-layer system as highlighted in Section II, we
augment an AR-layer relation schema 푅(퐾, 퐴) to a CRR-layer
schema ˜푅(퐾, 퐴, 푇퐴, 퐿) where 푇퐴 is the update timestamps of
attribute 퐴 and 퐿 is the causal lengths of rows. Basically,
˜푅(퐾, 퐿) implements a CLSet CRDT (Section III-C) where
the rows identified by keys are elements, and ˜푅(퐾, 퐴, 푇퐴) im-
plements the LWW-register CRDT [12,13] where an attribute
of each row is a register.
We use hybrid logical clock [20] for timestamps, which is
UTC time compatible and for two events 푒1 and 푒2 with clocks
휏1 and 휏2, 휏1 < 휏2 if 푒1 happens before 푒2.
For a relation instance ˜푟 of an augmented schema ˜푅, the
relation instance 푟 of the AR-layer schema 푅 is a cache of ˜푟.
For a relation operation op on 푟, we use ˜op as the cor-
responding operation on ˜푟. For example, we use ˜∈ on ˜푟 to
detect whether a row exists in 푟. That is ˜푟 (푘) ˜∈ ˜푟 ⇔ 푟 (푘) ∈ 푟.
According to CLSet (Figure 4), we define ˜∈ and ˜∉ as
˜푟 (푘) ˜∈ ˜푟 def
= ˜푟 (푘) ∈ ˜푟 ∧ odd(퐿( ˜푟 (푘)))
˜푟 (푘) ˜∉ ˜푟 def
= ˜푟 (푘) ∉ ˜푟 ∨ even(퐿( ˜푟 (푘)))
For an update operation u(푟, 푘, 푎) on 푟, we first perform
˜u( ˜푟, 푘, 푎, 휏퐴, 푙) on ˜푟. This results in the new instance ˜푟 ′ and
row ˜푟 ′ (푘) = (푘, 푎, 휏′
퐴, 푙′). We then refresh the cache 푟 as the
following:
• Insert (푘, 푎) into 푟 if ˜푟 ′ (푘) ˜∈ ˜푟 ′ and 푟 (푘) ∉ 푟.
• Delete 푟 (푘) from 푟 if ˜푟 ′ (푘) ˜∉ ˜푟 ′ and 푟 (푘) ∈ 푟.
• Update 푟 (푘) with (푘, 푎) if ˜푟 ′ (푘) ˜∈ ˜푟 ′, 푟 (푘) ∈ 푟 and
퐴(푟 (푘)) ≠ 푎.
All AR-layer queries are performed on the cached instance
푟 without any involvement of the CRR layer.
The following subsections describe how the CRR layer
handles the different update operations.
B. Update operations
The CRR layer handles a local row insertion as below:
ûinsert( ˜푟, (푘, 푎))
def
=



insert( ˜푟, (푘, 푎, now, 1)) if ˜푟 (푘) ∉ ˜푟
update( ˜푟, 푘, (푎, now, 퐿( ˜푟 (푘)) + 1)) if ˜푟 (푘) ∈ ˜푟
∧ ˜푟 (푘) ˜∉ ˜푟
skip otherwise
A row insertion attempts to achieve two effects: to insert row
푟 (푘) into 푟 and to assign value 푎 to attribute 퐴 of 푟 (푎). If 푟 (푘)
has never been inserted, we simply insert (푘, 푎, now, 1) into
˜푟. If 푟 (푘) has been inserted but later deleted, we re-inserted it
with the causal length incremented by 1. Otherwise, there is
already a row 푟 (푘) in 푟, thus we do nothing.
There could be an alternative handling in the last case.
Instead of doing nothing, we could update the value of attribute
퐴 with 푎. We choose to do nothing, because this behavior is
in line with the SQL convention.
The CRR layer handles a local row deletion as below:
ûdelete( ˜푟, 푘)
def
=
{
update( ˜푟, 푘, (−, 퐿( ˜푟 (푘)) + 1)) if ˜푟 (푘) ˜∈ ˜푟
skip otherwise
If there is a row 푟 (푘) in 푟, we increment the causal length
by 1. We do nothing otherwise. In the expression, we use the
“−” sign for the irrelevant attributes.
The CRR layer handles a local attribute update as below:
üupdate( ˜푟, 푘, 푎)
def
=
{
update( ˜푟, 푘, (푎, now, 퐿( ˜푟 (푘)))) if ˜푟 (푘) ˜∈ ˜푟
skip otherwise
If there is a row 푟 (푘) in 푟, we update the attribute and set
a new timestamp. We do nothing otherwise.
When the CRR layer handles one of the above update
operations and results in a new row ˜푟 (푘) in ˜푟 (either inserted
or updated), the relation instance with a single row { ˜푟 (푘)} is
a join-irreducible state in the possible instances of schema ˜푅.
The CRR layer later sends { ˜푟 (푘)} to the remote sites in the
anti-entropy protocol.
A site merges a received join-irreducible state {(푘, 푎, 휏, 푙)}
with its current state ˜푟 using a join:
˜푟 t {(푘, 푎, 휏, 푙)}
def
=



insert( ˜푟, (푘, 푎, 휏, 푙)) if ˜푟 (푘) ∉ ˜푟
update( ˜푟, 푘, (푎, max(푇 ( ˜푟), 휏), if 퐿( ˜푟) < 푙
max(퐿( ˜푟), 푙))) ∨ 푇 ( ˜푟) < 휏
skip otherwise
If there is no ˜푟 (푘) in ˜푟, we insert the received row into ˜푟. If
the received row has either a longer causal length or a newer
timestamp, we update ˜푟 with the received row. Otherwise, we
keep ˜푟 unchanged. Notice that a newer update on a row that is
concurrently deleted is still updated. The update is therefore
not lost. The row will have the latest updated attribute value
when the row deletion is later undone.
It is easy to verify that a new relation instance, resulted
either from one of the local updates or from a merge with
a received state, is always an inflation of the current relation
instance regarding causal lengths and timestamps.
C. Counters
The general handling of attribute updates described earlier
in Section IV-B is based on the LWW-register CRDT. This is
not ideal, because the effect of some concurrent updates may
get lost. For some numeric attributes, we can use the counter
CRDT [3] to handle updates as increments and decrements.
We make a special (meta) relation Ct for counter CRDTs
at the CRR layer: Ct(Rel, Attr, Key, Sid, Icr, Dcr), where
(Rel, Attr, Key, Sid) is a candidate key of Ct.
The attribute Rel is for relation names, Attr for names of
numeric attributes and Key for primary key values. For an AR-
layer relation 푅(퐾, 퐴), and respectively the CRR-layer relation
˜푅(퐾, 퐴, 푇퐴, 퐿), where 퐾 is a primary key and 퐴 is a numeric
attribute, (푅, 퐴, 푘) of 퐶푡 refers to the numeric value 퐴(푟 (푘))
(and 퐴( ˜푟 (푘))).
The attribute Sid is for site identifiers. Our system model
requires that sites are uniquely identified (Section II). The
counter CRDT is named (as opposed to anonymous, described
in Section III-A), meaning that a site can only update its
specific part of the data structure.
The attributes Icr and Dcr are for the increment and
decrement of numeric attributes from their initial values, set
by given sites.
We set the initial value of a numeric attribute to: the default
value, if the attribute is defined with a default value; the lower
bound, if there is an integrity constraint that specifies a lower
bound of the attribute; 0 otherwise.
If the initial value of the numeric attribute 퐴 of an AR-layer
relation 푅(퐾, 퐴) is 푣0, we can calculate the current value of
퐴(푟 (푘)) as 푣0 + sum(Icr) − sum(Dcr), as an aggregation on
the rows identified by (푅, 퐴, 푘) in the current instance of Ct.
We can translate a local update of a numeric attribute into
an increment or decrement operation. If the current value of
the attribute is 푣 and the new updated value is 푣′, the update
is an increment of 푣′ − 푣 if 푣′ > 푣, or a decrement of 푣 − 푣′ if
푣′ < 푣.
Site 푠 with the current 퐶푡 instance 푐푡 handles a local
increment or decrement operation as below:
˜inc푠 ( ˜푟, 퐴, 푘, 푣) def
=
{
update(ct, 푅, 퐴, 푘, 푠, (푖 + 푣, 푑)) if ct(푅, 퐴, 푘, 푠, 푖, 푑) ∈ ct
insert(ct, (푅, 퐴, 푘, 푠, (푣, 0))) otherwise
˜dec푠 ( ˜푟, 퐴, 푘, 푣) def
=
{
update(ct, 푅, 퐴, 푘, 푠, (푖, 푑 + 푣)) if ct(푅, 퐴, 푘, 푠, 푖, 푑) ∈ ct
insert(ct, (푅, 퐴, 푘, 푠, (0, 푣))) otherwise
If it is the first time the site updates the attribute, we insert
a new row into Ct. Otherwise, we set the new increment or
decrement value accordingly. Note that 푣 > 0 and the updates
in Ct are always inflationary. The relation consisting of a single
row of the update is a join-irreducible state in the possible
instance of Ct and is later sent to remote sites.
A site with the current Ct instance ct merges a received
join-irreducible state with a join:
ct t {(푅, 퐶, 푘, 푠, 푖, 푑)}
def
=



insert(ct, (푅, 퐶, 푘, 푠, 푖, 푑)) if ct(푅, 퐶, 푘, 푠) ∉ ct
update(ct, 푅, 퐶, 푘, 푠, if (푅, 퐶, 푘, 푠, 푖′, 푑′) ∈ ct
(max(푖, 푖′), max(푑, 푑′))) ∧ (푖′ < 푖 ∨ 푑′ < 푑)
skip otherwise
If the site has not seen any update from site 푠, we insert
the received row into ct. If it has applied some update from
푠 and the received update makes an inflation, we update the
corresponding row for site 푠. Otherwise, it has already applied
a later update from 푠 and we keep ct unchanged.
Notice that turning an update into an increment or decre-
ment might not always be an appropriate way to handle an
update. For example, two sites may concurrently update the
temperature measurement from 11◦퐶 to 15◦퐶. Increasing the
value twice leads to a wrong value 19◦퐶. In such situations,
it is more appropriate to handle updates as a LWW-register.
V. INTEGRITY CONSTRAINTS
Applications define integrity constraints at AR layer. A
DBMS detects violations of integrity constraints when we
refresh the cached AR-layer relations.
We perform updates on a CRR-layer instance ˜푟 and the
corresponding refresh of an AR-layer instance 푟 in a single
atomic transaction. A violation of any integrity constraint
causes the entire transaction to be rolled back. Therefore a
local update that violates any integrity constraint has no effect
on either ˜푟 or 푟.
When a site detects a violation at merge, it may perform an
undo on an offending update. We first describe how to perform
an undo. Then, we present certain rules to decide which
updates to undo. The purpose of making such decisions is to
avoid undesirable effects. In more general cases not described
below, we simply undo the incoming updates that cause
constraint violations, although this may result in unnecessary
undo of too many updates.
A. Undoing an update
For delta-state CRDTs, a site sends join-irreducible states
as remote updates. In our case, a remote update is a single
row (or more strictly, a relation instance containing the row).
To undo a row insertion or deletion, we simply increment
the causal length by one.
To undo an attribute update augmented with the LWW-
register CRDT, we set the attribute with the old value, using
the current clock value as the timestamp. In order to be able to
generate the undo update, the message of an attribute update
also includes the old value of the attribute.
We handle undo of counter updates with more care, because
counter updates are not idempotent and multiple sites may per-
form the same inverse operation concurrently. As a result, the
same increment (or decrement) might be mistakenly reversed
multiple times.
To address this problem, we create a new (meta) relation
for counter undo: CtU(Rel, Attr, Key, Sid, 푇).
Similar to relation Ct (Section IV-C), attributes Rel and Attr
are for the names of the relations and the counter attributes,
and attributes Key and Sid are for key values and sites that
originated the update. 푇 is the timestamp of the original
counter update. A row (푅, 퐴, 푘, 푠, 휏) in CtU uniquely identifies
a counter update.
A message for a counter update contains the timestamp and
the delta of the update, in addition to the join-irreducible state
of the update. The timestamp helps us uniquely identify the
update. The delta helps us generate the inverse update.
Relation CtU works as a GSet (Figure 2). When a site
undoes a counter update, it inserts a row identifying the update
into CtU. When a site receives an undo message, it performs
the undo only when the row of the update is not in CtU.
B. Uniqueness
An application may set a uniqueness constraint on an
attribute (or a set of attributes). For instance, the email address
of a registered user must be unique. Two concurrent updates
may violate a uniqueness constraint, though none of them
violates the constraint locally.
When we detect the violation of a uniqueness constraint at
the time of merge, we decide a winning update and undo the
losing one. Following the common sequential cases, an earlier
update wins. That is, the update with a smaller timestamp
wins.
As violations of uniqueness constraints occur only for
row insertions and attribute updates (as LWW-registers), the
only possible undo operations are row deletions and attribute
updates (as LWW-registers).
C. Reference integrity
A reference-integrity constraint may be violated due to
concurrent updates of a reference and the referenced row.
Suppose relation 푅1 (퐾1, 퐹) has a foreign key 퐹 referencing
to 푅2 (퐾2, 퐴). Assume two sites 푆1 and 푆2 have rows 푟2
1 (푘2)
and 푟2
2 (푘2) respectively. Site 푆1 inserts 푟1
1 (푘1, 푘2) and site 푆2
concurrently deletes 푟2
2 (푘2). The sites will fail to merge the
concurrent remote updates, because they violate the reference-
integrity constraint.
Obviously, both sites undoing the incoming remote update
is undesirable. We choose to undo the updates that add
references. The updates are likely row insertions. One reason
of this choice is that handling uniqueness violations often leads
to row deletions (Section V-B). If a row deletion as undo for
a uniqueness violation conflicts with an update that adds a
reference, undoing the row deletion will violate the uniqueness
constraint again, resulting in an endless cycle of undo updates.
Notice that two seemingly concurrent updates may not
be truly concurrent. One site might have already indirectly
seen the effect of the remote update, reflected as a longer
causal length. Two updates are truly concurrent only when
the referenced rows at the two sites have the same causal
length at the time of the concurrent updates. When the two
updates are not truly concurrent, a merge of an incoming
update at one of the sites would have no effect. In the above
example, the merge of the deletion at site 푆2 has no effect if
퐿( ˜푟2
1 (푘2)) > 퐿( ˜푟2
2 (푘2)).
D. Numeric constraints
An application may set numeric constraints, such as a lower
bound of the balance of a bank account. A set of concurrent
updates may together violate a numeric constraint, though the
individual local updates do not.
When the merge of an incoming update fails due to a
violation of a numeric constraint, we undo the latest update
(or updates) of the numeric attribute. This, however, may undo
too many updates than necessary. We have not yet investigated
how to avoid undoing too many updates that violate a numeric
constraint.
VI. IMPLEMENTATION
We have implemented a CRR prototype, Crecto, on top of
Ecto1, a data mapping library for Elixir2.
For every application-defined database table, we generate a
CRR-augmented table. We also generate two additional tables
in Crecto, Ct for counters (Section IV-C) and CtU for counter
undo (Section V-A).
For every table, Ecto automatically generates a primary-key
attribute which defaults to an auto-increment integer. Instead,
we enforce the primary keys to be UUIDs to avoid collision
of key values generated at different sites.
In the CRR-augmented tables, every attribute has an associ-
ated timestamp for the last update. We implemented the hybrid
logic clock [20] as timestamps.
1https://github/elixir-ecto/ecto
2https://elixir-lang.org
In Ecto, applications interact with databases through a single
Repo module. This allows us to localize our implementation
efforts to a Crecto.Repo module on top of the Ecto.Repo
module.
Since queries do not involve any CRR-layer relations,
Crecto.Repo forwards all queries unchanged to Ecto.Repo.
A local update includes first update(s) of CRR-layer re-
lation(s) and then a refresh of the cached relation at AR
layer. All these are wrapped in a single atomic transaction.
Any constraint violation is caught at cache refreshment, which
causes the transaction to be rolled back.
A site keeps outgoing messages in a queue. When the site
is online, it sends the messages in the queue to remote sites.
The queue is stored in a file so that it survives system crashes.
A merge of an incoming update includes also first update(s)
of CRR-layer relation(s) and then a refresh of the cached
relation at AR layer. These are also wrapped in an atomic
transaction. Again, any constraint violation is caught at cache
refreshment, which causes the transaction to be rolled back.
When this happens, we undo an offending update (Section V).
If we undo the incoming update, we generate an undo update,
insert an entry in CtU if the update is a counter increment
or decrement, and send the generated undo update to remote
sites. If we undo an update that has already been performed
locally, we re-merge the incoming update after the undo of the
already performed update.
For an incoming remote update, we can generate the undo
update using the additional information attached in the mes-
sage (Section V-A). For an already performed row insertion or
deletion, we can generate the undo update with an incremental
of the causal length. For an already performed attribute update,
we can generate the undo update in one of two ways. We can
use the messages stored in the queue. Or we can simply wait.
An undo update will eventually arrive, because this update will
violate the same constraint at a remote site.
Performance
To study the performance of CRR, we implemented an
advertisement-counter application [18,21] in Crecto. Adver-
tisements are displayed on edge devices (such as mobile
phones) according to some vendor contract, even when the
devices are not online. The local counters of displayed adver-
tisements are merged upstream in the cloud when the devices
are online. An advertisement is disabled when it has reached
a certain level of impression (total displayed number).
We are mainly interested in the latency of database updates,
which is primarily dependent on disk IOs. Table II shows
the latency measured at a Lenovo ThinkPad T540p laptop,
configured with an Intel quad-core i7-4710MQ CPU, 16 GiB
Ram, 512 GiB SSD, running Linux 5.4. The Crecto prototype
is implemented on Ecto 3.2 in Elixir 1.9.4 OTP 22, deployed
with PostgreSQL3 12.1.
To understand the extra overhead of multi-synchronous sup-
port, we measured the latency of the updates of the application
3https://www.postgresql.org
TABLE II
LATENCY OF DATABASE UPDATES (IN MS)
Ecto insertion deletion update
No-tx 0.9 0.9 0.9
In-tx 1.5 1.5 1.5
Crecto insertion deletion lww counter
Local 2.1 2.1 2.1 2.8
Merge 2.1 2.1 2.1 2.8
implemented in both Ecto and Crecto. Table II only gives an
approximate indication, since the latency depends on the load
of the system as well as the sizes of the tables. For example,
the measured latency of counter updates actually varied from
1.1ms to 3.8ms to get an average of 2.8ms. Notice also that
whether an update is wrapped in a transaction also makes
a difference. We thus include the measured latency of Ecto
updates that are either included (In-tx) or not included (No-
tx) in transactions.
We expected that the latency of the updates in Crecto would
be 2–3 times higher than that of the corresponding updates in
Ecto, since handling an update request involves two or three
updates in the two layers. The measured latency is less than
that. One explanation is that several updates may share some
common data-mapping overhead.
VII. RELATED WORK
CRDTs for multi-synchronous data access
One important feature of local-first software [10] is multi-
synchronous data access. In particular, data should be first
stored on user devices and be always immediately accessible.
Current prototypes of local-first software reported in [10] are
implemented using JSON CRDTs [9]. There is no support
for RDB features such as SQL-like queries and integrity
constraints.
The work presented in [22] aims at low latency of data
access at edge. It uses hybrid logical clock [20] to detect
concurrent operations and operational transformation [23] to
resolve conflicts. In particular, it resolves conflicting attribute
updates as LWW-registers. It requires a central server in the
cloud to disseminate operations and the data model is restricted
to the MangoDB4 model.
Lasp [18,21] is a programming model for large-scale dis-
tributed programming. One of its key features is to allow local
copies of replicated state to change during periods without
network connectivity. It is based on the ORSet CRDT [12]
and provides functional programming primitives such as map,
filter, product and union.
CRDTs for geo-replicated data stores
Researchers have applied CRDTs to achieve low latency
for data access in geo-replicated databases. Unlike multi-
synchronous access, such systems require all or a quorum of
the replicas to be available.
4https://www.mangodb.com
RedBlue consistency [11] allows eventual consistency for
blue operations and guarantees strong consistency for red
operations. In particular, it applies CRDTs for blue operations
and globally serializes red operations.
[24] presents an approach to support global boundary
constraints for counter CRDTs. For a bounded counter, the
“rights” to increment and decrement the counter value are
allocated to the replicas. A replica can only update the counter
using its allocated right. It can “borrow” rights from other
replicas when its allocated right is insufficient. This approach
requires that replicas with sufficient rights be available in order
to successfully perform an update.
Our work does not support global transactions [11] or
global boundary constraints [24]. It is known [25] that it
is impossible to support certain global constraints without
synchronization and coordination among different sites. We re-
pair asynchronously temporary violation of global constraints
through undo. Support of undo for row insertion and deletion
in CRR is straightforward, as CLSet (Section III-C) is a
specialization of a generic undo support for CRDTs [19].
AntidoteSQL [26] provides a SQL interface to a geo-
replicated key-value store that adopts CRDT approaches in-
cluding those reported in [11,24]. An application can declare
certain rules for resolving conflicting updates. Similar to our
work, an application can choose between LWW-register and
counter CRDTs for conflicting attribute updates. In Anti-
doteSQL, a deleted record is associated with a “deleted” flag,
hence excluding the possibility for the record to be inserted
back again. As AntidoteSQL is built on a key-value store, it
does not benefit from the maturity of the RDB industry.
Set CRDTs
The CLSet CRDT (Section III) is based on our earlier
work on undo support for state-based CRDTs [19]. Obviously,
deletion (insertion) of an element can be regarded as an undo
of a previous insertion (deletion) of the element.
As discussed in Section III-A, there exist CRDTs for
general-purpose sets ([12,15,17,18]). The meta data of existing
set CRDTs associated with each element are much larger than
a single number (causal length) in our CLSet CRDT. In [27],
we compare CLSet with existing set CRDTs in more detail.
VIII. CONCLUSION
We presented a two-layer system for multi-synchronous
access to relational databases in a cloud-edge environment.
The underlying CRR layer uses CRDTs to allow immediate
data access at edge and to guarantee data convergence when
the edge devices are online. It also resolves violations of
integrity constraints at merge by undoing offending updates. A
key enabling contribution is a new set CRDT based on causal
lengths. Applications access databases through the AR layer
with little or no concern of the underlying mechanism. We
implemented a prototype on top of a data mapping library
that is independent of any specific DBMS. Therefore servers
in the cloud and edge devices can deploy different DBMSs.
IX. ACKNOWLEDGMENT
The authors would like to thank the members of the COAST
team at Inria Nancy-Grand Est/Loria, in particular Victorien
Elvinger, for inspiring discussions.
REFERENCES
[1] A. Fox and E. A. Brewer, “Harvest, yield and scalable tolerant systems,”
in The Seventh Workshop on Hot Topics in Operating Systems, 1999, pp.
174–178.
[2] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of
consistent, available, partition-tolerant web services,” SIGACT News,
vol. 33, no. 2, pp. 51–59, 2002.
[3] M. Shapiro, N. M. Preguic¸a, C. Baquero, and M. Zawirski, “Conflict-free
replicated data types,” in 13th International Symposium on Stabilization,
Safety, and Security of Distributed Systems, (SSS 2011), 2011, pp. 386–
400.
[4] R. Brown, S. Cribbs, C. Meiklejohn, and S. Elliott, “Riak DT map: a
composable, convergent replicated dictionary,” in The First Workshop
on the Principles and Practice of Eventual Consistency, 2014, pp. 1–1.
[5] G. Oster, P. Urso, P. Molli, and A. Imine, “Data consistency for P2P
collaborative editing,” in CSCW. ACM, 2006, pp. 259–268.
[6] N. M. Preguic¸a, J. M. Marqu`es, M. Shapiro, and M. Letia, “A commu-
tative replicated data type for cooperative editing,” in ICDCS, 2009, pp.
395–403.
[7] W. Yu, L. Andr´e, and C.-L. Ignat, “A CRDT supporting selective undo
for collaborative text editing,” in DAIS, 2015, pp. 193–206.
[8] M. Nicolas, V. Elvinger, G. Oster, C.-L. Ignat, and F. Charoy, “MUTE:
A peer-to-peer web-based real-time collaborative editor,” in ECSCW
Panels, Demos and Posters, 2017.
[9] M. Kleppmann and A. R. Beresford, “A conflict-free replicated JSON
datatype,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 10, pp. 2733–
2746, 2017.
[10] M. Kleppmann, A. Wiggins, P. van Hardenberg, and M. McGranaghan,
“Local-first software: you own your data, in spite of the cloud,” in
Proceedings of the 2019 ACM SIGPLAN International Symposium on
New Ideas, New Paradigms, and Reflections on Programming and
Software, (Onward! 2019), 2019, pp. 154–178.
[11] C. Li, D. Porto, A. Clement, J. Gehrke, N. M. Preguic¸a, and R. Ro-
drigues, “Making geo-replicated systems fast as possible, consistent
when necessary,” in 10th USENIX Symposium on Operating Systems
Design and Implementation (OSDI), 2012, pp. 265–278.
[12] M. Shapiro, N. M. Preguic¸a, C. Baquero, and M. Zawirski, “A com-
prehensive study of convergent and commutative replicated data types,”
Rapport de recherche, vol. 7506, January 2011.
[13] P. Johnson and R. Thomas, “The maintamance of duplicated databases,”
Internet Request for Comments RFC 677, January 1976.
[14] V. K. Garg, Introduction to Lattice Theory with Computer Science
Applications. Wiley, 2015.
[15] P. S. Almeida, A. Shoker, and C. Baquero, “Delta state replicated data
types,” J. Parallel Distrib. Comput., vol. 111, pp. 162–173, 2018.
[16] V. Enes, P. S. Almeida, C. Baquero, and J. Leit˜ao, “Efficient Synchro-
nization of State-based CRDTs,” in IEEE 35th International Conference
on Data Engineering (ICDE), April 2019.
[17] A. Bieniusa, M. Zawirski, N. M. Preguic¸a, M. Shapiro, C. Baquero,
V. Balegas, and S. Duarte, “A optimized conflict-free replicated set,”
Rapport de recherche, vol. 8083, October 2012.
[18] C. Meiklejohn and P. van Roy, “Lasp: a language for distributed,
coordination-free programming,” in the 17th International Symposium
on Principles and Practice of Declarative Programming, 2015, pp. 184–
195.
[19] W. Yu, V. Elvinger, and C.-L. Ignat, “A generic undo support for
state-based CRDTs,” in 23rd International Conference on Principles of
Distributed Systems (OPODIS 2019), ser. LIPIcs, vol. 153, 2020, pp.
14:1–14:17.
[20] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone,
“Logical physical clocks,” in Principles of Distributed Systems
(OPODIS), ser. LNCS, vol. 8878. Springer, 2014, pp. 17–32.
[21] C. S. Meiklejohn, V. Enes, J. Yoo, C. Baquero, P. van Roy, and
A. Bieniusa, “Practical evaluation of the lasp programming model at
large scale: an experience report,” in the 19th International Symposium
on Principles and Practice of Declarative Programming, 2017, pp. 109–
114.
[22] D. Mealha, N. Preguic¸a, M. C. Gomes, and J. Leit˜ao, “Data replication
on the cloud/edge,” in Proceedings of the 6th Workshop on the Principles
and Practice of Consistency for Distributed Data (PaPoC), 2019, pp.
7:1–7:7.
[23] C. A. Ellis and S. J. Gibbs, “Concurrency control in groupware systems,”
in SIGMOD. ACM, 1989, pp. 399–407.
[24] V. Balegas, D. Serra, S. Duarte, C. Ferreira, M. Shapiro, R. Rodrigues,
and N. M. Preguic¸a, “Extending eventually consistent cloud databases
for enforcing numeric invariants,” in 34th IEEE Symposium on Reliable
Distributed Systems (SRDS), 2015, pp. 31–36.
[25] P. Bailis, A. Fekete, M. J. Franklin, A. Ghodsi, J. M. Hellerstein, and
I. Stoica, “Coordination avoidance in database systems,” Proc. VLDB
Endow., vol. 8, no. 3, pp. 185–196, 2014.
[26] P. Lopes, J. Sousa, V. Balegas, C. Ferreira, S. Duarte, A. Bieniusa,
R. Rodrigues, and N. M. Preguic¸a, “Antidote SQL: relaxed when
possible, strict when necessary,” CoRR, vol. abs/1902.03576, 2019.
[27] W. Yu and S. Rostad, “A low-cost set CRDT based on causal lengths,”
in Proceedings of the 7th Workshop on the Principles and Practice of
Consistency for Distributed Data (PaPoC), 2020, pp. 5:1–5:6.

File addition: 20191107_pondering_monorepo_version_control_system.md (----------)

[0.1]

# Document Title
Pondering a Monorepo Version Control System
Monorepos have desirable features but git is the wrong version control system (VCS) to realize one. Here I document how the right one would look like.
Why Monorepo?
The idea of a monorepo is to put "everything" into a single version control system. In contrast, to multirepo where developers regularly use multiple repositories. I don't have a more precise definition.
Google uses a monorepo. In 2016 Google had a 86TB repository and eventually developed a custom version control system for that. They report 30,000 commits per day and up to 800,000 file read queries per seconds. Microsofts has a 300GB git repository which requires significant changes and extensions to normal git. Facebook is doing something similar with Mercurial.
An implication of this monorepo approach is that it usually contains packages for multiple programming languages. So you cannot rely on language-specific tools like pip or maven. Bazel seems to be a fitting and advanced build system for this which comes from the fact that it stems from Googles internal "Blaze". For more tooling considerations, there is an Awesome Monorepo list.
Often, people try to store all dependencies in the monorepo as well. This might include tools like compilers and IDEs but probably not the operating system. The goal is reproducability and external dependencies should vanish.
If you want to read more discussions on monorepos, read Advantages of monorepos by Dan Luu and browse All You Always Wanted to Know About Monorepo But Were Afraid to Ask.
Most of the arguments for and against monorepos are strawman rants in my opinion. A monorepo does not guarantee a "single lint, build, test and release process" for example. You can have chaos in a monorepo and you can have order with multirepos. This is a question of process and not of repository structure.
There is only one advantage: In a monorepo you can do an atomic change across everything. This is what enables you to change the API and update all users in a single commit. With multirepo you have an inherent race condition and eventually there will be special tooling around this fact. However, this "atomic change across everything" also requires special tooling eventually. Google invests heavily in into the clang ecosystem for this reason. Nothing is for free.
That said, let's assume for now you we want to go for a monorepo.
Why not git?
If you talk about version control systems these days, people usually think about git. Everybody wants to use it despite gits well-known flaws and its UI which does not even try to hide implementation details. In the context of monorepos, the relevant flaw is that git scales poorly beyond some gigabytes of data.
Git needs LFS or Annex to deal with large binary files.
Plenty of git operations (e.g. git status) check every file in the checkout. Walking the whole checkout is not feasible for big monorepos.
To lessen some pain, git can download only a limited part of the history (shallow clone) and show only parts of the repository (sparse checkout). Still, there is no way around the fact that you need a full copy of the current HEAD on your disk. This is not feasible for big monorepos either.
As an alternative, I would suggest Subversion. The Apache Foundation has a big subversion repository which contains among others OpenOffice, Hadoop, Maven, httpd, Couchdb, Zookeeper, Tomcat, Xerces. Of course, Subversion has flaws itself and its development is very slow these days. For example, merging is painful. However, Google does not have branches in its monorepo. Instead, they use feature toggles. Maybe branches are conceptually wrong for big monorepos and we should not even try?
I considered creating a new VCS because I see an open niche there which none of the Open Source VCSs can fill. One central insight is that a client will probably never look at every single file, so we must avoid any need to download a full copy. Working on the repository will be more like working on a network file system. Thus we lose the ability to work offline as a tradeoff.
Integrated build system
I already wrote about merging version control and build system. In the discussions on that, I learned about proprietary solutions like Rational ClearCase and Vesta SCM. Here is the summary of these thoughts:
We already committed to store the repo on the network. The build system also wants to store artifacts on the network, so why not store them in the repo as well? It implies that the VCS knows which files are generated.
We also committed to put all tooling in the repo for the sake of reproducability. Thus, the build system can be simple. There is no need for a configure step because there is no relevant environment outside of the repo.
Now consider the fact that no single machine might be able to generate all the artifacts. Some might require a Windows or Linux machine. Some might require special hardware. However, the artifact generation is reproducible so it does not matter which client does it. We might as well integrate continuous integration (CI) jobs into the system.
Imagine this: You edit the code and commit a change. You already ran unit tests locally, so these artifacts are already there. During the next minutes you see more artifacts pop up which depended on your changes. These artifacts might be a Windows and a Linux and an OS X build. All these artifacts are naturally part of the version you committed, so if you switch to a different branch, the artifacts change automatically. There is no explicit back and forth with a CI system. Instead, the version-control-and-build-system just fills in the missing generated files for a new committed version.
To implement this, we need a notification mechanism in our version control system. We still need special clients which are continously waiting for new commits and generate artifacts. The VCS must manage these clients and propagate artifacts as needed. Certainly, this is very much beyond the usual job of a VCS.
More Design Aspects
Since we want to store "everything" in the repo, we also want non-technical people to use it. It already resembles a network file system, so it should provide an interface nearly as easy to use. We want to enable designers to store Photoshop files in there. Managers should store Excel and Powerpoint files in there. I say "nearly" because we need additional concepts like versions and a commit mechanism which cannot be hidden from users.
The wide range of users and security concerns require an access control mechanism with fine granularity (per file/directory). Since big monorepos exist in big organizations it naturally must be integrated into the surrounding system (LDAP, Exchange, etc).
Monorepo is not for Open Source
The VCS described above sounds great for big companies and many big projects. However, the Open Source world consists of small independent projects which are more loosely coupled. This loose coupling provides a certain resilience and diversity. While a monorepo allows you atomically change something everywhere, it also forces you to do it to some degree. Looser coupling means more flexibility on when you update dependencies, for example. The tradeoff is the inevitable chaos and you wonder if we really need so many build systems and scripting languages.
Open Source projects usually start with a single developer and no long-term goals. For that use case git is perfect. Maybe certain big projects may benefit. For example, would Gnome adopt this? Even there, it seems the partition with multiple git repos works well enough.
Why not build a Monorepo VCS?
OSS projects have no need for it. Small companies are fine with git, because their repositories are small enough. Essentially, this is a product for enterprises.
It makes no sense to build it as Open Source software in my spare time. It would be a worthwhile startup idea, but I have a family and no need to take the risk right now, and my current job is interesting enough. It is not a small project that can be done on the side at work. This is why I will not build it in the foreseeable feature although it is a fascinating technical challenge. Maybe someone else can, so I documented my thoughts here and can hopefully focus on other stuff now. Maybe Google is already building it.
Some discussion on lobste.rs.
The alternative for big companies is package management in my opinion. I looked around and ZeroInstall seems to be the only useable cross-platform language-independent package manager. Of course, it cares only about distributing artifacts. Its distributed approach provides a lot of flexibility on how you generate the artifacts which can be an advantage.
Also, Amazon's build system provides valuable insights for manyrepo environments.
© 2019-11-07
How a VCS designed for monorepos would look like and why I don't build it.
Share page on
Twitter
Facebook
artikel (ältere) articles (older)
homepage friend blogs publications
portrait image
Andreas Zwinkau appreciates email to zwinkau@mailbox.org, if you have a comment.
Anonymous feedback is welcome via admonymous.
I'm writing a weekly newsletter Marketwise about prediction markets and how they are useful to make decisions.
datenschutz

File addition: 20190810_fossil_versus_git.md (----------)

[0.1]

# Document Title
1.0 Don't Stress!
The feature sets of Fossil and Git overlap in many ways. Both are distributed version control systems which store a tree of check-in objects to a local repository clone. In both systems, the local clone starts out as a full copy of the remote parent. New content gets added to the local clone and then later optionally pushed up to the remote, and changes to the remote can be pulled down to the local clone at will. Both systems offer diffing, patching, branching, merging, cherrypicking, bisecting, private branches, a stash, etc.
Fossil has inbound and outbound Git conversion features, so if you start out using one DVCS and later decide you like the other better, you can easily move your version-controlled file content.¹
In this document, we set all of that similarity and interoperability aside and focus on the important differences between the two, especially those that impact the user experience.
Keep in mind that you are reading this on a Fossil website, and though we try to be fair, the information here might be biased in favor of Fossil, if only because we spend most of our time using Fossil, not Git. Ask around for second opinions from people who have used both Fossil and Git.
2.0 Differences Between Fossil And Git
Differences between Fossil and Git are summarized by the following table, with further description in the text that follows.
| GIT                                     | FOSSIL                                           |
| File versioning only                    | VCS, tickets, wiki, docs, notes, forum, UI, RBAC |
| Sprawling, incoherent, and inefficient  | Self-contained and efficient                     |
| Ad-hoc pile-of-files key/value database | The most popular database in the world           |
| Portable to POSIX systems only          | Runs just about anywhere                         |
| Bazaar-style development                | Cathedral-style development                      |
| Designed for Linux kernel development   | Designed for SQLite development                  |
| Many contributors                       | Select contributors                              |
| Focus on individual branches            | Focus on the entire tree of changes              |
| One check-out per repository            | Many check-outs per repository                   |
| Remembers what you should have done     | Remembers what you actually did                  |
| SHA-2                                   | SHA-3                                            |
2.1 Featureful
Git provides file versioning services only, whereas Fossil adds an integrated wiki, ticketing & bug tracking, embedded documentation, technical notes, and a web forum, all within a single nicely-designed skinnable web UI, protected by a fine-grained role-based access control system. These additional capabilities are available for Git as 3rd-party add-ons, but with Fossil they are integrated into the design. One way to describe Fossil is that it is "GitHub-in-a-box."
For developers who choose to self-host projects (rather than using a 3rd-party service such as GitHub) Fossil is much easier to set up, since the stand-alone Fossil executable together with a 2-line CGI script suffice to instantiate a full-featured developer website. To accomplish the same using Git requires locating, installing, configuring, integrating, and managing a wide assortment of separate tools. Standing up a developer website using Fossil can be done in minutes, whereas doing the same using Git requires hours or days.
Fossil is small, complete, and self-contained. If you clone Git's self-hosting repository, you get just Git's source code. If you clone Fossil's self-hosting repository, you get the entire Fossil website — source code, documentation, ticket history, and so forth.² That means you get a copy of this very article and all of its historical versions, plus the same for all of the other public content on this site.
2.2 Efficient
Git is actually a collection of many small tools, each doing one small part of the job, which can be recombined (by experts) to perform powerful operations. Git has a lot of complexity and many dependencies, so that most people end up installing it via some kind of package manager, simply because the creation of complicated binary packages is best delegated to people skilled in their creation. Normal Git users are not expected to build Git from source and install it themselves.
Fossil is a single self-contained stand-alone executable with hardly any dependencies. Fossil can be run inside a minimally configured chroot jail, from a Windows memory stick, off a Raspberry Pi with a tiny SD card, etc. To install Fossil, one merely puts the executable somewhere in the $PATH. Fossil is straightforward to build and install, so that many Fossil users do in fact build and install "trunk" versions to get new features between formal releases.
Some say that Git more closely adheres to the Unix philosophy, summarized as "many small tools, loosely joined," but we have many examples of other successful Unix software that violates that principle to good effect, from Apache to Python to ZFS. We can infer from that that this is not an absolute principle of good software design. Sometimes "many features, tightly-coupled" works better. What actually matters is effectiveness and efficiency. We believe Fossil achieves this.
Git fails on efficiency once you add to it all of the third-party software needed to give it a Fossil-equivalent feature set. Consider GitLab, a third-party extension to Git wrapping it in many features, making it roughly Fossil-equivalent, though much more resource hungry and hence more costly to run than the equivalent Fossil setup. GitLab's basic requirements are easy to accept when you're dedicating a local rack server or blade to it, since its minimum requirements are more or less a description of the smallest thing you could call a "server" these days, but when you go to host that in the cloud, you can expect to pay about 8⨉ as much to comfortably host GitLab as for Fossil.³ This difference is largely due to basic technology choices: Ruby and PostgreSQL vs C and SQLite.
The Fossil project itself is hosted on a very small VPS, and we've received many reports on the Fossil forum about people successfully hosting Fossil service on bare-bones $5/month VPS hosts, spare Raspberry Pi boards, and other small hosts.
2.3 Durable
The baseline data structures for Fossil and Git are the same, modulo formatting details. Both systems manage a directed acyclic graph (DAG) of Merkle tree / block chain structured check-in objects. Check-ins are identified by a cryptographic hash of the check-in comment, and each check-in refers to its parent via its hash.
The difference is that Git stores its objects as individual files in the .git folder or compressed into bespoke pack-files, whereas Fossil stores its objects in a SQLite database file using a hybrid NoSQL/relational data model of the check-in history. Git's data storage system is an ad-hoc pile-of-files key/value database, whereas Fossil uses a proven, heavily-tested, general-purpose, durable SQL database. This difference is more than an implementation detail. It has important practical consequences.
With Git, one can easily locate the ancestors of a particular check-in by following the pointers embedded in the check-in object, but it is difficult to go the other direction and locate the descendants of a check-in. It is so difficult, in fact, that neither native Git nor GitHub provide this capability short of groveling the commit log. With Git, if you are looking at some historical check-in then you cannot ask "What came next?" or "What are the children of this check-in?"
Fossil, on the other hand, parses essential information about check-ins (parents, children, committers, comments, files changed, etc.) into a relational database that can be easily queried using concise SQL statements to find both ancestors and descendants of a check-in. This is the hybrid data model mentioned above: Fossil manages your check-in and other data in a NoSQL block chain structured data store, but that's backed by a set of relational lookup tables for quick indexing into that artifact store. (See "Thoughts On The Design Of The Fossil DVCS" for more details.)
Leaf check-ins in Git that lack a "ref" become "detached," making them difficult to locate and subject to garbage collection. This detached head state problem has caused untold grief for countless Git users. With Fossil, detached heads are simply impossible because we can always find our way back into the block chain using one or more of the relational indices it automatically manages for you.
This design difference shows up in several other places within each tool. It is why Fossil's timeline is generally more detailed yet more clear than those available in Git front-ends. (Contrast this Fossil timeline with its closest equivalent in GitHub.) It's why there is no inverse of the cryptic @~ notation in Git, meaning "the parent of HEAD," which Fossil simply calls "prev", but there is a "next" special check-in name in Fossil. It is why Fossil has so many built-in status reports to help maintain situational awareness, aid comprehension, and avoid errors.
These differences are due, in part, to Fossil's start a year later than Git: we were able to learn from its key design mistakes.
2.4 Portable
Fossil is largely written in ISO C, almost purely conforming to the original 1989 standard. We make very little use of C99, and we do not knowingly make any use of C11. Fossil does call POSIX and Windows APIs where necessary, but it's about as portable as you can ask given that ISO C doesn't define all of the facilities Fossil needs to do its thing. (Network sockets, file locking, etc.) There are certainly well-known platforms Fossil hasn't been ported to yet, but that's most likely due to lack of interest rather than inherent difficulties in doing the port. We believe the most stringent limit on its portability is that it assumes at least a 32-bit CPU and several megs of flat-addressed memory.⁴ Fossil isn't quite as portable as SQLite, but it's close.
Over half of the C code in Fossil is actually an embedded copy of the current version of SQLite. Much of what is Fossil-specific after you set SQLite itself aside is SQL code calling into SQLite. The number of lines of SQL code in Fossil isn't large by percentage, but since SQL is such an expressive, declarative language, it has an outsized contribution to Fossil's user-visible functionality.
Fossil isn't entirely C and SQL code. Its web UI uses JavaScript where necessary.⁵ The server-side UI scripting usse a custom minimal Tcl dialect called TH1, which is embedeed into Fossil itself. Fossil's build system and test suite are largely based on Tcl.⁶ All of this is quite portable.
About half of Git's code is POSIX C, and about a third is POSIX shell code. This is largely why the so-called "Git for Windows" distributions (both first-party and third-party) are actually an MSYS POSIX portability environment bundled with all of the Git stuff, because it would be too painful to port Git natively to Windows. Git is a foreign citizen on Windows, speaking to it only through a translator.⁷
While Fossil does lean toward POSIX norms when given a choice — LF-only line endings are treated as first-class citizens over CR+LF, for example — the Windows build of Fossil is truly native.
The third-party extensions to Git tend to follow this same pattern. GitLab isn't portable to Windows at all, for example. For that matter, GitLab isn't even officially supported on macOS, the BSDs, or uncommon Linuxes! We have many users who regularly build and run Fossil on all of these systems.
2.5 Linux vs. SQLite
Fossil and Git promote different development styles because each one was specifically designed to support the creator's main software development project: Linus Torvalds designed Git to support development of the Linux kernel, and D. Richard Hipp designed Fossil to support the development of SQLite. Both projects must rank high on any objective list of "most important FOSS projects," yet these two projects are almost entirely unlike one another, so it is natural that the DVCSes created to support these projects also differ in many ways.
In the following sections, we will explain how four key differences between the Linux and SQLite software development projects dictated the design of each DVCS's low-friction usage path.
When deciding between these two DVCSes, you should ask yourself, "Is my project more like Linux or more like SQLite?"
2.5.1 Development Organization
Eric S. Raymond's seminal essay-turned-book "The Cathedral and the Bazaar" details the two major development organization styles found in FOSS projects. As it happens, Linux and SQLite fall on opposite sides of this dichotomy. Differing development organization styles dictate a different design and low-friction usage path in the tools created to support each project.
Git promotes the Linux kernel's bazaar development style, in which a loosely-associated mass of developers contribute their work through a hierarchy of lieutenants who manage and clean up these contributions for consideration by Linus Torvalds, who has the power to cherrypick individual contributions into his version of the Linux kernel. Git allows an anonymous developer to rebase and push specific locally-named private branches, so that a Git repo clone often isn't really a clone at all: it may have an arbitrary number of differences relative to the repository it originally cloned from. Git encourages siloed development. Select work in a developer's local repository may remain private indefinitely.
All of this is exactly what one wants when doing bazaar-style development.
Fossil's normal mode of operation differs on every one of these points, with the specific designed-in goal of promoting SQLite's cathedral development model:
    Personal engagement: SQLite's developers know each other by name and work together daily on the project.
    Trust over hierarchy: SQLite's developers check changes into their local repository, and these are immediately and automatically sync'd up to the central repository; there is no "dictator and lieutenants" hierarchy as with Linux kernel contributions. D. Richard Hipp rarely overrides decisions made by those he has trusted with commit access on his repositories. Fossil allows you to give some users more power over what they can do with the repository, but Fossil does not otherwise directly support the enforcement of a development organization's social and power hierarchies. Fossil is a great fit for flat organizations.
    No easy drive-by contributions: Git pull requests offer a low-friction path to accepting drive-by contributions. Fossil's closest equivalent is its unique bundle feature, which requires higher engagement than firing off a PR.⁸ This difference comes directly from the initial designed purpose for each tool: the SQLite project doesn't accept outside contributions from previously-unknown developers, but the Linux kernel does.
    No rebasing: When your local repo clone syncs changes up to its parent, those changes are sent exactly as they were committed locally. There is no rebasing mechanism in Fossil, on purpose.
    Sync over push: Explicit pushes are uncommon in Fossil-based projects: the default is to rely on autosync mode instead, in which each commit syncs immediately to its parent repository. This is a mode so you can turn it off temporarily when needed, such as when working offline. Fossil is still a truly distributed version control system; it's just that its starting default is to assume you're rarely out of communication with the parent repo.
    This is not merely a reflection of modern always-connected computing environments. It is a conscious decision in direct support of SQLite's cathedral development model: we don't want developers going dark, then showing up weeks later with a massive bolus of changes for us to integrate all at once. Jim McCarthy put it well in his book on software project management, Dynamics of Software Development: "Beware of a guy in a room."
    Branch names sync: Unlike in Git, branch names in Fossil are not purely local labels. They sync along with everything else, so everyone sees the same set of branch names. Fossil's design choice here is a direct reflection of the Linux vs. SQLite project outlook: SQLite's developers collaborate closely on a single coherent project, whereas Linux's developers go off on tangents and occasionally sync changes up with each other.
    Private branches are rare: Private branches exist in Fossil, but they're normally used to handle rare exception cases, whereas in many Git projects, they're part of the straight-line development process.
    Identical clones: Fossil's autosync system tries to keep local clones identical to the repository it cloned from.
Where Git encourages siloed development, Fossil fights against it. Fossil places a lot of emphasis on synchronizing everyone's work and on reporting on the state of the project and the work of its developers, so that everyone — especially the project leader — can maintain a better mental picture of what is happening, leading to better situational awareness.
Each DVCS can be used in the opposite style, but doing so works against their low-friction paths.
2.5.2 Scale
The Linux kernel has a far bigger developer community than that of SQLite: there are thousands and thousands of contributors to Linux, most of whom do not know each others names. These thousands are responsible for producing roughly 89⨉ more code than is in SQLite. (10.7 MLOC vs. 0.12 MLOC according to SLOCCount.) The Linux kernel and its development process were already uncommonly large back in 2005 when Git was designed, specifically to support the consequences of having such a large set of developers working on such a large code base.
95% of the code in SQLite comes from just four programmers, and 64% of it is from the lead developer alone. The SQLite developers know each other well and interact daily. Fossil was designed for this development model.
We think you should ask yourself whether you have Linus Torvalds scale software configuration management problems or D. Richard Hipp scale problems when choosing your DVCS. An automotive air impact wrench running at 8000 RPM driving an M8 socket-cap bolt at 16 cm/s is not the best way to hang a picture on the living room wall.
2.5.3 Accepting Contributions
As of this writing, Git has received about 4.5⨉ as many commits as Fossil resulting in about 2.5⨉ as many lines of source code. The line count excludes tests and in-tree third-party dependencies. It does not exclude the default GUI for each, since it's integral for Fossil, so we count the size of gitk in this.
It is obvious that Git is bigger in part because of its first-mover advantage, which resulted in a larger user community, which results in more contributions. But is that the only reason? We believe there are other relevant differences that also play into this which fall out of the "Linux vs. SQLite" framing: licensing, community structure, and how we react to drive-by contributions. In brief, it's harder to get a new feature into Fossil than into Git.
A larger feature set is not necessarily a good thing. Git's command line interface is famously arcane. Masters of the arcane are able to do wizardly things, but only by studying their art deeply for years. This strikes us as a good thing only in cases where use of the tool itself is the primary point of that user's work.
Almost no one uses a DVCS for its own sake; very few people get paid specifically in order to drive a DVCS. We use DVCSes as a tool to support some other effort, so we do not necessarily want the DVCS with the most features. We want a DVCS with easily internalized behavior so we can thoroughly master it despite spending only a small fraction of our working time thinking about the DVCS. We want to pick the tool up, use it quickly, and then set it aside in order to get back to our actual job as quickly as possible.
Professional software developers in particular are prone to focusing on feature set sizes when choosing tools because this is sometimes a highly important consideration. They spend all day, every day, in their favorite text editors, and time they spend learning all of the arcana of their favorite programming languages is well-spent. Skills with these tools are direct productivity drivers, which in turn directly drives how much money a developer can make. (Or how much idle time they can afford to take, which amounts to the same thing.) But if you are a professional software developer, we want you to ask yourself a question: "How do I get paid more by mastering arcane features of my DVCS?" Unless you have a good answer to that, you probably do not want to be choosing a DVCS based on how many arcane features it has.
The argument is similar for other types of users: if you are a hobbyist, how much time do you want to spend mastering your DVCSs instead of on the hobby supported by use of that DVCS?
There is some minimal set of features required to achieve the purposes that drive our selection of a DVCS, but there is a level beyond which more features only slow us down while we're learning the tool, since we must plow through documentation on features we're not likely to ever use. When the number of features grows to the point where people of normal motivation cannot spend the time to master them all, the tool becomes less productive to use.
The core developers of the Fossil project achieve a balance between feature set size and ease of use by carefully choosing which users to give commit bits to, then in being choosy about which of the contributed feature branches to merge down to trunk. We say "no" to a lot of feature proposals.
The end result is that Fossil more closely adheres to the principle of least astonishment than Git does.
2.5.4 Individual Branches vs. The Entire Change History
Both Fossil and Git store history as a directed acyclic graph (DAG) of changes, but Git tends to focus more on individual branches of the DAG, whereas Fossil puts more emphasis on the entire DAG.
For example, the default "sync" behavior in Git is to only sync a single branch, whereas with Fossil the only sync option it to sync the entire DAG. Git commands, GitHub, and GitLab tend to show only a single branch at a time, whereas Fossil usually shows all parallel branches at once. Git has commands like "rebase" that help keep all relevant changes on a single branch, whereas Fossil encourages a style of many concurrent branches constantly springing into existence, undergoing active development in parallel for a few days or weeks, then merging back into the main line and disappearing.
This difference in emphasis arises from the different purposes of the two systems. Git focuses on individual branches, because that is exactly what you want for a highly-distributed bazaar-style project such as Linux. Linus Torvalds does not want to see every check-in by every contributor to Linux, as such extreme visibility does not scale well. But Fossil was written for the cathedral-style SQLite project with just a handful of active committers. Seeing all changes on all branches all at once helps keep the whole team up-to-date with what everybody else is doing, resulting in a more tightly focused and cohesive implementation.
2.6 One vs. Many Check-outs per Repository
A "repository" in Git is a pile-of-files in the .git subdirectory of a single check-out. The working check-out directory and the .git repository subdirectory are normally in the same directory within the filesystem.
With Fossil, a "repository" is a single SQLite database file that can be stored anywhere. There can be multiple active check-outs from the same repository, perhaps open on different branches or on different snapshots of the same branch. It is common in Fossil to switch branches with a "cd" command between two check-out directories rather than switching to another branch in place within a single working directory. Long-running tests or builds can be running in one check-out while changes are being committed in another.
From the start, Git has allowed symlinks to this .git directory from multiple working directories. The git init command offers the --separate-git-dir option to set this up automatically. Then in version 2.5, Git added the "git-worktree" feature to provide a higher-level management interface atop this basic mechanism. Use of this more closely emulates Fossil's decoupling of repository and working directory, but the fact remains that it is far more common in Git usage to simply switch a single working directory among branches in place.
The main downside of that working style is that it invalidates all build objects created from files that change in switching between branches. When you have multiple working directories for a single repository, you can have a completely independent state in each working directory which is untouched by the "cd" command you use to switch among them.
There are also practical consequences of the way .git links work that make multiple working directories in Git not quite interchangeable, as they are in Fossil.
2.7 What you should have done vs. What you actually did
Git puts a lot of emphasis on maintaining a "clean" check-in history. Extraneous and experimental branches by individual developers often never make it into the main repository. And branches are often rebased before being pushed, to make it appear as if development had been linear. Git strives to record what the development of a project should have looked like had there been no mistakes.
Fossil, in contrast, puts more emphasis on recording exactly what happened, including all of the messy errors, dead-ends, experimental branches, and so forth. One might argue that this makes the history of a Fossil project "messy." But another point of view is that this makes the history "accurate." In actual practice, the superior reporting tools available in Fossil mean that the added "mess" is not a factor.
Like Git, Fossil has an amend command for modifying prior commits, but unlike in Git, this works not by replacing data in the repository, but by adding a correction record to the repository that affects how later Fossil operations present the corrected data. The old information is still there in the repository, it is just overridden from the amendment point forward. For extreme situations, Fossil adds the shunning mechanism, but it has strict limitations that prevent global history rewrites.
One commentator characterized Git as recording history according to the victors, whereas Fossil records history as it actually happened.
2.8 Hash Algorithm: SHA-3 vs SHA-2 vs SHA-1
Fossil started out using 160-bit SHA-1 hashes to identify check-ins, just as in Git. That changed in early 2017 when news of the SHAttered attack broke, demonstrating that SHA-1 collisions were now practical to create. Two weeks later, the creator of Fossil delivered a new release allowing a clean migration to 256-bit SHA-3 with full backwards compatibility to old SHA-1 based repositories.
Here in mid-2019, that feature is now in every OS and package repository known to include Fossil so that the next release as of this writing (Fossil 2.10) will default to enforcing SHA-3 hashes by default. This not only solves the SHAttered problem, it should prevent a reoccurrence for the foreseeable future. Only repositories created before the transition to Fossil 2 are still using SHA-1, and then only if the repository's maintainer chose not to switch them into SHA-3 mode some time over the past 2 years.
Meanwhile, the Git community took until August 2018 to announce their plan for solving the same problem by moving to SHA-256 (a variant of the older SHA-2 algorithm) and until February 2019 to release a version containing the change. It's looking like this will take years more to percolate through the community.
The practical impact of SHAttered on structured data stores like the one in Git and Fossil isn't clear, but you want to have your repositories moved over to a stronger hash algorithm before someone figures out how to make use of the weaknesses in the old one. Fossil's developers moved on this problem quickly and had a widely-deployed solution to it years ago.
3.0 Missing Features
Although there is a large overlap in capability between Fossil and Git, there are many areas where one system has a feature that is simply missing in the other. We covered most of those above, but there are a few remaining feature differences we haven't gotten to yet.
3.1 Features found in Fossil but missing from Git
    The fossil all command
    Fossil keeps track of all repositories and check-outs and allows operations over all of them with a single command. For example, in Fossil is possible to request a pull of all repositories on a laptop from their respective servers, prior to taking the laptop off network. Or it is possible to do "fossil all changes" to see if there are any uncommitted changes that were overlooked prior to the end of the workday.
    The fossil undo command
    Whenever Fossil is told to modify the local checkout in some destructive way (fossil rm, fossil update, fossil revert, etc.) Fossil remembers the prior state and is able to return the local check-out directory to its prior state with a simple "fossil undo" command. You cannot undo a commit, since writes to the actual repository — as opposed to the local check-out directory — are more or less permanent, on purpose, but as long as the change is simply staged locally, Fossil makes undo easier than in Git.
3.2 Features found in Git but missing from Fossil
    Rebase
    Because of its emphasis on recording history exactly as it happened, rather than as we would have liked it to happen, Fossil deliberately does not provide a "rebase" command. One can rebase manually in Fossil, with sufficient perseverance, but it is not something that can be done with a single command.
    Push or pull a single branch
    The fossil push, fossil pull, and fossil sync commands do not provide the capability to push or pull individual branches. Pushing and pulling in Fossil is all or nothing. This is in keeping with Fossil's emphasis on maintaining a complete record and on sharing everything between all developers.
Asides and Digressions
    Many things are lost in making a Git mirror of a Fossil repo due to limitations of Git relative to Fossil. GitHub adds some of these missing features to stock Git, but because they're not part of Git proper, exporting a Fossil repository to GitHub will still not include them; Fossil tickets do not become GitHub issues, for example.
    The fossil-scm.org web site is actually hosted in several parts, so that it is not strictly true that "everything" on it is in the self-hosting Fossil project repo. The web forum is hosted as a separate Fossil repo from the main Fossil self-hosting repo for administration reasons, and the Download page content isn't normally sync'd with a "fossil clone" command unless you add the "-u" option. (See "How the Download Page Works" for details.) There may also be some purely static elements of the web site served via D. Richard Hipp's own lightweight web server, althttpd, which is configured as a front end to Fossil running in CGI mode on these sites.
    That estimate is based on pricing at Digital Ocean in mid-2019: Fossil will run just fine on the smallest instance they offer, at US $5/month, but the closest match to GitLab's minimum requirements among Digital Ocean's offerings currently costs $40/month.
    This means you can give up waiting for Fossil to be ported to the PDP-11, but we remain hopeful that someone may eventually port it to z/OS.
    We try to keep use of Javascript to a minimum in the web UI, and we always try to provide sensible fallbacks for those that run their browsers with Javascript disabled. Some features of the web UI simply won't run without Javascript, but the UI behavior does degrade gracefully.
    "Why is there all this Tcl in and around Fossil?" you may ask. It is because D. Richard Hipp is a long-time Tcl user and contributor. SQLite started out as an embedded database for Tcl specifically. ([Reference]) When he then created Fossil to manage the development of SQLite, it was natural for him to use Tcl-based tools for its scripting, build system, test system, etc. It came full circle in 2011 when the Tcl and Tk projects moved from CVS to Fossil.
    A minority of the pieces of the Git core software suite are written in other languages, primarily Perl, Python, and Tcl. (e.g. git-send-mail, git-p4, and gitk, respectively.) Although these interpreters are quite portable, they aren't installed by default everywhere, and on some platforms you can't count on them at all. (Not just Windows, but also the BSDs and many other non-Linux platforms.) This expands the dependency footprint of Git considerably. It is why the current Git for Windows distribution is 44.7 MiB but the current fossil.exe zip file for Windows is 2.24 MiB. Fossil is much smaller despite using a roughly similar amount of high-level scripting code because its interpreters are compact and built into Fossil itself.
    Both Fossil and Git support patch(1) files, a common way to allow drive-by contributions, but it's a lossy contribution path for both systems. Unlike Git PRs and Fossil bundles, patch files collapse multiple checkins together, they don't include check-in comments, and they cannot encode changes made above the individual file content layer: you lose branching decisions, tag changes, file renames, and more when using patch files.
This page was generated in about 0.011s by Fossil 2.10 [29141af7af] 2019-08-09 21:08:46

File addition: 20190507_pseudo.md (----------)

[0.1]

# Document Title
Part 5: Pseudo-edges
May 07, 2019
This is the fifth (and final planned) post in a series on some new ideas in version control. To start at the beginning, go here.
The goal of this post is to describe pseudo-edges: what they are, how to compute them efficiently, and how to update them efficiently upon small changes. To recall the important points from the last post:
    We (pretend, for now, that we) represent the state of the repository as a graph in memory: one node for every line, with a directed edges that enforce ordering constraints between two lines. Each line has a flag that says whether it is deleted or not.
    The current output of the repository consists of just those nodes that are not deleted, and there is an ordering constraint between two nodes if there is a path in the graph between them, but note that the path is allowed to go through deleted nodes.
    Applying a patch to the repository is very efficient: the complexity of applying a patch is proportional to the number of changes it makes.
    Rendering a the current output to a file is potentially very expensive: its complexity requires traversing the entire graph, including nodes that are marked as deleted. To the extent we can, we’d like to reduce this complexity to the number of live nodes in the graph.
The main idea for solving this is to add “pseudo-edges” to the graph: for every path that connects two live nodes through a sequence of deleted nodes, add a corresponding edge to the graph. Once this is done, we can render the current output without traversing the deleted parts of the graph, because every ordering contraint that used to depend on some deleted parts is now represented by some pseudo-edge. Here’s an example: the deleted nodes are in gray, and the pseudo-edge that they induce is the dashed arrow.
We haven’t really solved anything yet, though: once we have the pseudo-edges, we can efficiently render the output, but how do we compute the pseudo-edges? The naive algorithm (look at every pair of live nodes, and check if they’re connected by a path of deleted nodes) still depends on the number of deleted nodes. Clearly, what we need is some sort of incremental way to update the pseudo-edges.
Deferring pseudo-edges
The easiest way that we can reduce the amount of time required for computing pseudo-edges is simply to do it rarely. Specifically, remember that applying a patch can be very fast, and that pseudo-edges only need to be computed when outputting a file. So, obviously, we should only update the pseudo-edges when it’s time to actually output the file. This sounds trivial, but it can actually be significant. Imagine, for example, that you’re cloning a repository that has a long history; let’s say it has n patches, each of which has a constant size, and let’s assume that computing pseudo-edges takes time O(m), where m is the size of the history. Cloning a repository involves downloading all of those patches, and then applying them one-by-one. If we recompute the pseudo-edges after every patch application, the total amount of time required to clone the repository is O(n^2); if we apply all the patches first and only compute the pseudo-edges at the end, the total time is O(n).
You can see how ojo implements this deferred pseudo-edge computation here: first, it applies all of the patches; then it recomputes the pseudo-edges.
Connected deleted components
Deferring the pseudo-edge computation certainly helps, but we’d also like to speed up the computation itself. The main idea is to avoid unnecessary recomputation by only examining parts of the graph that might have actually changed. At this point, I need to admit that I don’t know whether what I’m about to propose is the best way of updating the pseudo-edges. In particular, its efficiency rests on a bunch of assumptions about what sort of graphs we’re likely to encounter. I haven’t made any attempt to test these assumptions on actual large repositories (although that’s something I’d like to try in the future).
The main assumption is that while there may be many deleted nodes, they tend to be collected into a large number of connected components, each of which tends to be small. What’s more, each patch (I’ll assume) tends to only affect a small number of these connected components. In other words, the plan will be:
    keep track (incrementally) of connected components made up of deleted nodes,
    when applying or reverting a patch, figure out which connected components were touched, and only recompute paths among the live nodes that are on the boundary of one of the dirty connected components.
Before talking about algorithms, here are some pictures that should help unpack what it is that I actually mean. Here is a graph containing three connected components of deleted nodes (represented by the rounded rectangles):
When I delete node h, it gets added to one of the connected components, and I can update relevant pseudo-edges without looking at the other two connected components:
If I delete node d then it will cause all of the connected components to merge:
This isn’t hard to handle, it just means that we should run our pseudo-edge-checking algorithm on the merged component.
Maintaining the components
To maintain the partition of deleted nodes into connected components, we use a disjoint-set data structure. This is very fast (pretty close to constant time) when applying patches, because applying patches can only enlarge deleted components. It’s slower when reverting patches, because the disjoint-set algorithm doesn’t allow splitting: when reverting patches, connected components could split into smaller ones. Our approach is to defer the splitting: we just mark the original connected component as dirty. When it comes time to compute the pseudo-edges, we explore the original component, and figure out what the new connected pieces are.
The disjoint-set data structure is implemented in the ojo_partition subcrate. It appears in the Graggle struct; note also the dirty_reps member: that’s for keeping track of which parts in the partition have been modified by a patch and require recomputing pseudo-edges.
We recompute the components here. Specifically, we consider the subgraph consisting only of nodes that belong to one of the dirty connected components. We run Tarjan’s algorithm on that subgraph to find out what the new connected components are. On each of those components, we recompute the pseudo-edges.
Recomputing the pseudo-edges
The algorithm for this is: after deleting the node, look at the deleted connected component that it belongs to, including the “boundary” consisting of live nodes:
Using depth-first search, check which of the live boundary nodes (in this case, just a and i) are connected by a path within that component (in this case, they are). If so, add a pseudo-edge. The complexity of this algorithm is O(nm), where n is the number of boundary nodes, and m is the total number of nodes in the component, including the boundary (because we need to run n DFSes, and each one takes O(m) time). The hope here is that m and n are small, even for large histories. For example, I hope that n is almost always 2; at least, this is the case if the final live graph is totally ordered.
This algorithm is implemented here.
Unapplying, and pseudo-edge reasons
There’s one more wrinkle in the pseudo-edge computation, and it has to do with reverting patches: if applying a patch created a pseudo-edge, removing a patch might cause that pseudo-edge to get deleted. But we have to be very careful when doing so, because a pseudo-edge might have multiple reasons for existing. You can see why in this example from before:
The pseudo-edge from a to d is caused independently by the both the b -> c component and the cy -> cl -> e component. If by unapplying some patch we destroy the b -> c component but leave the cy -> cl -> e component untouched, we have to be sure not to delete the pseudo-edge from a to d.
The solution to this is to track to “reasons” for pseudo-edges, where each “reason” is a deleted connected component. This is a many-to-many mapping between connected deleted components and pseudo-edges, and it’s stored in the pseudo_edge_reasons and reason_pseudo_edges members of the GraggleData struct. Once we store pseudo-edge reasons, it’s easy to figure out when a pseudo-edge needs deleting: whenever its last reason becomes obsolete.
Pseudo-edge spamming: an optimization
We’ve finished describing ojo’s algorithm for keeping pseudo-edges up to date, but there’s stil room for improvement. Here, I’ll describe a potential optimization that I haven’t implemented yet. It’s based on a simple, but non-quite-correct, algorithm for adding pseudo-edges incrementally: every time you mark a node as deleted, add a pseudo-edge from each of its in-neighbors to each of its out-neighbors. I call this “pseudo-edge spamming” because it just eagerly throws in as many pseudo-edges as needed. In pictures, if we have this graph
and we delete the “deleted” line, then we’ll add a pseudo-edge from the in-neighbor of “deleted” (namely, “first”) to the out-neighbor of “deleted” (namely, “last”).
This algorithm has two problems. The first is that it isn’t complete: you might also need to add pseudo-edges when adding an edge where at least one end is deleted. Consider this example, where our graph consists of two disconnected parts.
If we add an edge from “deleted 1” to “deleted 2”, clearly we also need to add a pseudo-edge between each of the “first” nodes and each of the “last” nodes. In order to handle this case, we really do need to explore the deleted connected component (which could be slow).
The second problem with our pseudo-edge spamming algorithm is that it doesn’t handle reverting patches: it only describes how to add pseudo-edges, not delete them.
The nice thing about pseudo-edge spamming is that even if it isn’t completely correct, it can be used as a fast-path in the correct algorithm: when applying a patch, if it modifies the boundary of a deleted connected component that isn’t already dirty, use pseudo-edge spamming to update the pseudo-edges (and don’t mark the component as dirty). In every other case, fall back to the previous algorithm.

File addition: 20190401_local_first_software.md (----------)

[0.1]

# Document Title
Local-first software
You own your data, in spite of the cloud
    [Ink & Switch Logo]
    Martin Kleppmann
    Adam Wiggins
    Peter van Hardenberg
    Mark McGranaghan
    April 2019
Cloud apps like Google Docs and Trello are popular because they enable real-time collaboration with colleagues, and they make it easy for us to access our work from all of our devices. However, by centralizing data storage on servers, cloud apps also take away ownership and agency from users. If a service shuts down, the software stops functioning, and data created with that software is lost.
In this article we propose “local-first software”: a set of principles for software that enables both collaboration and ownership for users. Local-first ideals include the ability to work offline and collaborate across multiple devices, while also improving the security, privacy, long-term preservation, and user control of data.
We survey existing approaches to data storage and sharing, ranging from email attachments to web apps to Firebase-backed mobile apps, and we examine the trade-offs of each. We look at Conflict-free Replicated Data Types (CRDTs): data structures that are multi-user from the ground up while also being fundamentally local and private. CRDTs have the potential to be a foundational technology for realizing local-first software.
We share some of our findings from developing local-first software prototypes at Ink & Switch over the course of several years. These experiments test the viability of CRDTs in practice, and explore the user interface challenges for this new data model. Lastly, we suggest some next steps for moving towards local-first software: for researchers, for app developers, and a startup opportunity for entrepreneurs.
This article has also been published in PDF format in the proceedings of the Onward! 2019 conference. Please cite it as:
    Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. Local-first software: you own your data, in spite of the cloud. 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), October 2019, pages 154–178. doi:10.1145/3359591.3359737
We welcome your feedback: @inkandswitch or hello@inkandswitch.com.
Contents
Motivation: collaboration and ownership
Seven ideals for local-first software
    1. No spinners: your work at your fingertips
    2. Your work is not trapped on one device
    3. The network is optional
    4. Seamless collaboration with your colleagues
    5. The Long Now
    6. Security and privacy by default
    7. You retain ultimate ownership and control
Existing data storage and sharing models
    How application architecture affects user experience
    Developer infrastructure for building apps
Towards a better future
    CRDTs as a foundational technology
    Ink & Switch prototypes
    How you can help
Conclusions
    Acknowledgments
Motivation: collaboration and ownership
It’s amazing how easily we can collaborate online nowadays. We use Google Docs to collaborate on documents, spreadsheets and presentations; in Figma we work together on user interface designs; we communicate with colleagues using Slack; we track tasks in Trello; and so on. We depend on these and many other online services, e.g. for taking notes, planning projects or events, remembering contacts, and a whole raft of business uses.
We will call these services “cloud apps,” but you could just as well call them “SaaS” or “web-based apps.” What they have in common is that we typically access them through a web browser or through mobile apps, and that they store their data on a server.
Today’s cloud apps offer big benefits compared to earlier generations of software: seamless collaboration, and being able to access data from any device. As we run more and more of our lives and work through these cloud apps, they become more and more critical to us. The more time we invest in using one of these apps, the more valuable the data in it becomes to us.
However, in our research we have spoken to a lot of creative professionals, and in that process we have also learned about the downsides of cloud apps.
When you have put a lot of creative energy and effort into making something, you tend to have a deep emotional attachment to it. If you do creative work, this probably seems familiar. (When we say “creative work,” we mean not just visual art, or music, or poetry — many other activities, such as explaining a technical topic, implementing an intricate algorithm, designing a user interface, or figuring out how to lead a team towards some goal are also creative efforts.)
Our research on software that supports the creative process is discussed further in our articles Capstone, a tablet for thinking and The iPad as a fast, precise tool for creativity.
In the process of performing that creative work, you typically produce files and data: documents, presentations, spreadsheets, code, notes, drawings, and so on. And you will want to keep that data: for reference and inspiration in the future, to include it in a portfolio, or simply to archive because you feel proud of it. It is important to feel ownership of that data, because the creative expression is something so personal.
Unfortunately, cloud apps are problematic in this regard. Although they let you access your data anywhere, all data access must go via the server, and you can only do the things that the server will let you do. In a sense, you don’t have full ownership of that data — the cloud provider does. In the words of a bumper sticker: “There is no cloud, it’s just someone else’s computer.”
We use the term “ownership” not in the sense of intellectual property law and copyright, but rather as the creator’s perceived relationship to their data. We discuss this notion in a later section.
When data is stored on “someone else’s computer”, that third party assumes a degree of control over that data. Cloud apps are provided as a service; if the service is unavailable, you cannot use the software, and you can no longer access your data created with that software. If the service shuts down, even though you might be able to export your data, without the servers there is normally no way for you to continue running your own copy of that software. Thus, you are at the mercy of the company providing the service.
Before web apps came along, we had what we might call “old-fashioned” apps: programs running on your local computer, reading and writing files on the local disk. We still use a lot of applications of this type today: text editors and IDEs, Git and other version control systems, and many specialized software packages such as graphics applications or CAD software fall in this category.
The software we are talking about in this article are apps for creating documents or files (such as text, graphics, spreadsheets, CAD drawings, or music), or personal data repositories (such as notes, calendars, to-do lists, or password managers). We are not talking about implementing things like banking services, e-commerce, social networking, ride-sharing, or similar services, which are well served by centralized systems.
In old-fashioned apps, the data lives in files on your local disk, so you have full agency and ownership of that data: you can do anything you like, including long-term archiving, making backups, manipulating the files using other programs, or deleting the files if you no longer want them. You don’t need anybody’s permission to access your files, since they are yours. You don’t have to depend on servers operated by another company.
To sum up: the cloud gives us collaboration, but old-fashioned apps give us ownership. Can’t we have the best of both worlds?
We would like both the convenient cross-device access and real-time collaboration provided by cloud apps, and also the personal ownership of your own data embodied by “old-fashioned” software.
Seven ideals for local-first software
We believe that data ownership and real-time collaboration are not at odds with each other. It is possible to create software that has all the advantages of cloud apps, while also allowing you to retain full ownership of the data, documents and files you create.
We call this type of software local-first software, since it prioritizes the use of local storage (the disk built into your computer) and local networks (such as your home WiFi) over servers in remote datacenters.
In cloud apps, the data on the server is treated as the primary, authoritative copy of the data; if a client has a copy of the data, it is merely a cache that is subordinate to the server. Any data modification must be sent to the server, otherwise it “didn’t happen.” In local-first applications we swap these roles: we treat the copy of the data on your local device — your laptop, tablet, or phone — as the primary copy. Servers still exist, but they hold secondary copies of your data in order to assist with access from multiple devices. As we shall see, this change in perspective has profound implications.
Here are seven ideals we would like to strive for in local-first software.
1. No spinners: your work at your fingertips
Much of today’s software feels slower than previous generations of software. Even though CPUs have become ever faster, there is often a perceptible delay between some user input (e.g. clicking a button, or hitting a key) and the corresponding result appearing on the display. In previous work we measured the performance of modern software and analyzed why these delays occur.
Server-to-server round-trip times between various locations worldwide
Server-to-server round-trip times between AWS datacenters in various locations worldwide. Data from: Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “Highly Available Transactions: Virtues and Limitations,” VLDB 2014.
With cloud apps, since the primary copy of the data is on a server, all data modifications, and many data lookups, require a round-trip to a server. Depending on where you live, the server may well be located on another continent, so the speed of light places a limit on how fast the software can be.
The user interface may try to hide that latency by showing the operation as if it were complete, even though the request is still in progress — a pattern known as Optimistic UI — but until the request is complete, there is always the possibility that it may fail (for example, due to an unstable Internet connection). Thus, an optimistic UI still sometimes exposes the latency of the network round-trip when an error occurs.
Local-first software is different: because it keeps the primary copy of the data on the local device, there is never a need for the user to wait for a request to a server to complete. All operations can be handled by reading and writing files on the local disk, and data synchronization with other devices happens quietly in the background.
While this by itself does not guarantee that the software will be fast, we expect that local-first software has the potential to respond near-instantaneously to user input, never needing to show you a spinner while you wait, and allowing you to operate with your data at your fingertips.
2. Your work is not trapped on one device
Users today rely on several computing devices to do their work, and modern applications must support such workflows. For example, users may capture ideas on the go using their smartphone, organize and think through those ideas on a tablet, and then type up the outcome as a document on their laptop.
This means that while local-first apps keep their data in local storage on each device, it is also necessary for that data to be synchronized across all of the devices on which a user does their work. Various data synchronization technologies exist, and we discuss them in detail in a later section.
Most cross-device sync services also store a copy of the data on a server, which provides a convenient off-site backup for the data. These solutions work quite well as long as each file is only edited by one person at a time. If several people edit the same file at the same time, conflicts may arise, which we discuss in the section on collaboration.
3. The network is optional
Personal mobile devices move through areas of varying network availability: unreliable coffee shop WiFi, while on a plane or on a train going through a tunnel, in an elevator or a parking garage. In developing countries or rural areas, infrastructure for Internet access is sometimes patchy. While traveling internationally, many mobile users disable cellular data due to the cost of roaming. Overall, there is plenty of need for offline-capable apps, such as for researchers or journalists who need to write while in the field.
“Old-fashioned” apps work fine without an Internet connection, but cloud apps typically don’t work while offline. For several years the Offline First movement has been encouraging developers of web and mobile apps to improve offline support, but in practice it has been difficult to retrofit offline support to cloud apps, because tools and libraries designed for a server-centric model do not easily adapt to situations in which users make edits while offline.
Although it is possible to make web apps work offline, it can be difficult for a user to know whether all the necessary code and data for an application have been downloaded.
Since local-first applications store the primary copy of their data in each device’s local filesystem, the user can read and write this data anytime, even while offline. It is then synchronized with other devices sometime later, when a network connection is available. The data synchronization need not necessarily go via the Internet: local-first apps could also use Bluetooth or local WiFi to sync data to nearby devices.
Moreover, for good offline support it is desirable for the software to run as a locally installed executable on your device, rather than a tab in a web browser. For mobile apps it is already standard that the whole app is downloaded and installed before it is used.
4. Seamless collaboration with your colleagues
Collaboration typically requires that several people contribute material to a document or file. However, in old-fashioned software it is problematic for several people to work on the same file at the same time: the result is often a conflict. In text files such as source code, resolving conflicts is tedious and annoying, and the task quickly becomes very difficult or impossible for complex file formats such as spreadsheets or graphics documents. Hence, collaborators may have to agree up front who is going to edit a file, and only have one person at a time who may make changes.
Finder window showing conflicted files in Dropbox
A “conflicted copy” on Dropbox. The user must merge the changes manually.
A note with conflicting edits in Evernote
In Evernote, if a note is changed concurrently, it is moved to a “conflicting changes” notebook, and there is nothing to support the user in resolving the situation — not even a facility to compare the different versions of a note.
Resolving a Git merge conflict with DiffMerge
In Git and other version control systems, several people may modify the same file in different commits. Combining those changes often results in merge conflicts, which can be resolved using specialized tools (such as DiffMerge, shown here). These tools are primarily designed for line-oriented text files such as source code; for other file formats, tool support is much weaker.
On the other hand, cloud apps such as Google Docs have vastly simplified collaboration by allowing multiple users to edit a document simultaneously, without having to send files back and forth by email and without worrying about conflicts. Users have come to expect this kind of seamless real-time collaboration in a wide range of applications.
In local-first apps, our ideal is to support real-time collaboration that is on par with the best cloud apps today, or better. Achieving this goal is one of the biggest challenges in realizing local-first software, but we believe it is possible: in a later section we discuss technologies that enable real-time collaboration in a local-first setting.
Moreover, we expect that local-first apps can support various workflows for collaboration. Besides having several people edit the same document in real-time, it is sometimes useful for one person to tentatively propose changes that can be reviewed and selectively applied by someone else. Google Docs supports this workflow with its suggesting mode, and pull requests serve this purpose in GitHub.
Suggesting changes in Google Docs
In Google Docs, collaborators can either edit the document directly, or they can suggest changes, which can then be accepted or rejected by the document owner.
A pull request on GitHub
The collaboration workflow on GitHub is based on pull requests. A user may change multiple source files in multiple commits, and submit them as a proposed change to a project. Other users may review and amend the pull request before it is finally merged or rejected.
5. The Long Now
An important aspect of data ownership is that you can continue accessing the data for a long time in the future. When you do some work with local-first software, your work should continue to be accessible indefinitely, even after the company that produced the software is gone.
Cuneiform script on clay tablet, ca. 3000 BCE
Cuneiform script on clay tablet, ca. 3000 BCE. Image from Wikimedia Commons
“Old-fashioned” apps continue to work forever, as long as you have a copy of the data and some way of running the software. Even if the software author goes bust, you can continue running the last released version of the software. Even if the operating system and the computer it runs on become obsolete, you can still run the software in a virtual machine or emulator. As storage media evolve over the decades, you can copy your files to new storage media and continue to access them.
The Internet Archive maintains a collection of historical software that can be run using an emulator in a modern web browser; enthusiasts at the English Amiga Board share tips on running historical software.
On the other hand, cloud apps depend on the service continuing to be available: if the service is unavailable, you cannot use the software, and you can no longer access your data created with that software. This means you are betting that the creators of the software will continue supporting it for a long time — at least as long as you care about the data.
Our incredible journey is a blog that documents startup products getting shut down after an acquisition.
Although there does not seem to be a great danger of Google shutting down Google Docs anytime soon, popular products do sometimes get shut down or lose data, so we know to be careful. And even with long-lived software there is the risk that the pricing or features change in a way you don’t like, and with a cloud app, continuing to use the old version is not an option — you will be upgraded whether you like it or not.
Local-first software enables greater longevity because your data, and the software that is needed to read and modify your data, are all stored locally on your computer. We believe this is important not just for your own sake, but also for future historians who will want to read the documents we create today. Without longevity of our data, we risk creating what Vint Cerf calls a “digital Dark Age.”
We have previously written about long-term archiving of web pages. For an interesting discussion of long-term data preservation, see “The Cuneiform Tablets of 2015”, a paper by Long Tien Nguyen and Alan Kay at Onward! 2015.
Some file formats (such as plain text, JPEG, and PDF) are so ubiquitous that they will probably be readable for centuries to come. The US Library of Congress also recommends XML, JSON, or SQLite as archival formats for datasets. However, in order to read less common file formats and to preserve interactivity, you need to be able to run the original software (if necessary, in a virtual machine or emulator). Local-first software enables this.
6. Security and privacy by default
One problem with the architecture of cloud apps is that they store all the data from all of their users in a centralized database. This large collection of data is an attractive target for attackers: a rogue employee, or a hacker who gains access to the company’s servers, can read and tamper with all of your data. Such security breaches are sadly terrifyingly common, and with cloud apps we are unfortunately at the mercy of the provider.
While Google has a world-class security team, the sad reality is that most companies do not. And while Google is good at defending your data against external attackers, the company internally is free to use your data in a myriad ways, such as feeding your data into its machine learning systems.
Quoting from the Google Drive terms of service: “Our automated systems analyze your content to provide you personally relevant product features, such as customized search results, and spam and malware detection.”
Maybe you feel that your data would not be of interest to any attacker. However, for many professions, dealing with sensitive data is an important part of their work. For example, medical professionals handle sensitive patient data, investigative journalists handle confidential information from sources, governments and diplomatic representatives conduct sensitive negotiations, and so on. Many of these professionals cannot use cloud apps due to regulatory compliance and confidentiality obligations.
Local-first apps, on the other hand, have better privacy and security built in at the core. Your local devices store only your own data, avoiding the centralized cloud database holding everybody’s data. Local-first apps can use end-to-end encryption so that any servers that store a copy of your files only hold encrypted data that they cannot read.
Modern messaging apps like iMessage, WhatsApp and Signal already use end-to-end encryption, Keybase provides encrypted file sharing and messaging, and Tarsnap takes this approach for backups. We hope to see this trend expand to other kinds of software as well.
7. You retain ultimate ownership and control
With cloud apps, the service provider has the power to restrict user access: for example, in October 2017, several Google Docs users were locked out of their documents because an automated system incorrectly flagged these documents as abusive. In local-first apps, the ownership of data is vested in the user.
To disambiguate “ownership” in this context: we don’t mean it in the legal sense of intellectual property. A word processor, for example, should be oblivious to the question of who owns the copyright in the text being edited. Instead we mean ownership in the sense of user agency, autonomy, and control over data. You should be able to copy and modify data in any way, write down any thought, and no company should restrict what you are allowed to do.
Under the European Convention on Human Rights, your freedom of thought and opinion is unconditional — the state may never interfere with it, since it is yours alone — whereas freedom of expression (including freedom of speech) can be restricted in certain ways, since it affects other people. Communication services like social networks convey expression, but the raw notes and unpublished work of a creative person are a way of developing thoughts and opinions, and thus warrant unconditional protection.
In cloud apps, the ways in which you can access and modify your data are limited by the APIs, user interfaces, and terms of service of the service provider. With local-first software, all of the bytes that comprise your data are stored on your own device, so you have the freedom to process this data in arbitrary ways.
With data ownership comes responsibility: maintaining backups or other preventative measures against data loss, protecting against ransomware, and general organizing and managing of file archives. For many professional and creative users, as introduced in the introduction, we believe that the trade-off of more responsibility in exchange for more ownership is desirable. Consider a significant personal creation, such as a PhD thesis or the raw footage of a film. For these you might be willing to take responsibility for storage and backups in order to be certain that your data is safe and fully under your control.
In our opinion, maintaining control and ownership of data does not mean that the software must necessarily be open source. Although the freedom to modify software enhances user agency, it is possible for commercial and closed-source software to satisfy the local-first ideals, as long as it does not artificially restrict what users can do with their files. Examples of such artificial restrictions are PDF files that disable operations like printing, eBook readers that interfere with copy-paste, and DRM on media files.
Existing data storage and sharing models
We believe professional and creative users deserve software that realizes the local-first goals, helping them collaborate seamlessly while also allowing them to retain full ownership of their work. If we can give users these qualities in the software they use to do their most important work, we can help them be better at what they do, and potentially make a significant difference to many people’s professional lives.
However, while the ideals of local-first software may resonate with you, you may still be wondering how achievable they are in practice. Are they just utopian thinking?
In the remainder of this article we discuss what it means to realize local-first software in practice. We look at a wide range of existing technologies and break down how well they satisfy the local-first ideals. In the following tables, ✓ means the technology meets the ideal, — means it partially meets the ideal, and ✗ means it does not meet the ideal.
As we shall see, many technologies satisfy some of the goals, but none are able to satisfy them all. Finally, we examine a technique from the cutting edge of computer science research that might be a foundational piece in realizing local-first software in the future.
How application architecture affects user experience
Let’s start by examining software from the end user’s perspective, and break down how well different software architectures meet the seven goals of local-first software. In the next section we compare storage technologies and APIs that are used by software engineers to build applications.
Files and email attachments
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Files + email attachments 	✓ 	— 	✓ 	✗ 	✓ 	— 	✓
Viewed through the lens of our seven goals, traditional files have many desirable properties: they can be viewed and edited offline, they give full control to users, and they can readily be backed up and preserved for the long term. Software relying on local files also has the potential to be very fast.
However, accessing files from multiple devices is trickier. It is possible to transfer a file across devices using various technologies:
    Sending it back and forth by email;
    Passing a USB drive back and forth;
    Via a distributed file system such as a NAS server, NFS, FTP, or rsync;
    Using a cloud file storage service like Dropbox, Google Drive, or OneDrive (see later section);
    Using a version control system such as Git (see later section).
Of these, email attachments are probably the most common sharing mechanism, especially among users who are not technical experts. Attachments are easy to understand and trustworthy. Once you have a copy of a document, it does not spontaneously change: if you view an email six months later, the attachments are still there in their original form. Unlike a web app, an attachment can be opened without any additional login process.
The weakest point of email attachments is collaboration. Generally, only one person at a time can make changes to a file, otherwise a difficult manual merge is required. File versioning quickly becomes messy: a back-and-forth email thread with attachments often leads to filenames such as Budget draft 2 (Jane's version) final final 3.xls.
Nevertheless, for apps that want to incorporate local-first ideas, a good starting point is to offer an export feature that produces a widely-supported file format (e.g. plain text, PDF, PNG, or JPEG) and allows it to be shared e.g. via email attachment, Slack, or WhatsApp.
Web apps: Google Docs, Trello, Figma, Pinterest, etc.
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Google Docs 	— 	✓ 	— 	✓ 	— 	✗ 	—
Trello 	— 	✓ 	— 	✓ 	— 	✗ 	✗
Pinterest 	✗ 	✓ 	✗ 	✓ 	✗ 	✗ 	✗
At the opposite end of the spectrum are pure web apps, where the user’s local software (web browser or mobile app) is a thin client and the data storage resides on a server. The server typically uses a large-scale database in which the data of millions of users are all mixed together in one giant collection.
Web apps have set the standard for real-time collaboration. As a user you can trust that when you open a document on any device, you are seeing the most current and up-to-date version. This is so overwhelmingly useful for team work that these applications have become dominant. Even traditionally local-only software like Microsoft Office is making the transition to cloud services, with Office 365 eclipsing locally-installed Office as of 2017.
With the rise of remote work and distributed teams, real-time collaborative productivity tools are becoming even more important. Ten users on a team video call can bring up the same Trello board and each make edits on their own computer while simultaneously seeing what other users are doing.
The flip side to this is a total loss of ownership and control: the data on the server is what counts, and any data on your client device is unimportant — it is merely a cache. Most web apps have little or no support for offline working: if your network hiccups for even a moment, you are locked out of your work mid-sentence.
Offline indicator in Google Docs
If Google Docs detects that it is offline, it blocks editing of the document.
A few of the best web apps hide the latency of server communication using JavaScript, and try to provide limited offline support (for example, the Google Docs offline plugin). However, these efforts appear retrofitted to an application architecture that is fundamentally centered on synchronous interaction with a server. Users report mixed results when trying to work offline.
A negative user review of the Google Docs offline extension
A negative user review of the Google Docs offline extension.
Some web apps, for example Milanote and Figma, offer installable desktop clients that are essentially repackaged web browsers. If you try to use these clients to access your work while your network is intermittent, while the vendor’s servers are experiencing an outage, or after the vendor has been acquired and shut down, it becomes clear that your work was never truly yours.
Offline error message in Figma
The Figma desktop client in action.
Dropbox, Google Drive, Box, OneDrive, etc.
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Dropbox 	✓ 	— 	— 	✗ 	✓ 	— 	✓
Cloud-based file sync products like Dropbox, Google Drive, Box, or OneDrive make files available on multiple devices. On desktop operating systems (Windows, Linux, Mac OS) these tools work by watching a designated folder on the local file system. Any software on your computer can read and write files in this folder, and whenever a file is changed on one computer, it is automatically copied to all of your other computers.
As these tools use the local filesystem, they have many attractive properties: access to local files is fast, and working offline is no problem (files edited offline are synced the next time an Internet connection is available). If the sync service were shut down, your files would still remain unharmed on your local disk, and it would be easy to switch to a different syncing service. If your computer’s hard drive fails, you can restore your work simply by installing the app and waiting for it to sync. This provides good longevity and control over your data.
However, on mobile platforms (iOS and Android), Dropbox and its cousins use a completely different model. The mobile apps do not synchronize an entire folder — instead, they are thin clients that fetch your data from a server one file at a time, and by default they do not work offline. There is a “Make available offline” option, but you need to remember to invoke it ahead of going offline, it is clumsy, and only works when the app is open. The Dropbox API is also very server-centric.
The Dropbox mobile app showing a spinner while waiting to download a file
Users of the Dropbox mobile app spend a lot of time looking at spinners, a stark contrast to the at-your-fingertips feeling of the Dropbox desktop product.
The weakest point of file sync products is the lack of real-time collaboration: if the same file is edited on two different devices, the result is a conflict that needs to be merged manually, as discussed previously. The fact that these tools synchronize files in any format is both a strength (compatibility with any application) and a weakness (inability to perform format-specific merges).
Git and GitHub
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Git+GitHub 	✓ 	— 	✓ 	— 	✓ 	— 	✓
Git and GitHub are primarily used by software engineers to collaborate on source code. They are perhaps the closest thing we have to a true local-first software package: compared to server-centric version control systems such as Subversion, Git works fully offline, it is fast, it gives full control to users, and it is suitable for long-term preservation of data. This is the case because a Git repository on your local filesystem is a primary copy of the data, and is not subordinate to any server.
We focus on Git/GitHub here as the most successful examples, but these lessons also apply to other distributed revision control tools like Mercurial or Darcs, and other repository hosting services such as GitLab or Bitbucket. In principle it is possible to collaborate without a repository service, e.g. by sending patch files by email, but the majority of Git users rely on GitHub.
A repository hosting service like GitHub enables collaboration around Git repositories, accessing data from multiple devices, as well as providing a backup and archival location. Support for mobile devices is currently weak, although Working Copy is a promising Git client for iOS. GitHub stores repositories unencrypted; if stronger privacy is required, it is possible for you to run your own repository server.
We think the Git model points the way toward a future for local-first software. However, as it currently stands, Git has two major weaknesses:
    Git is excellent for asynchronous collaboration, especially using pull requests, which take a coarse-grained set of changes and allow them to be discussed and amended before merging them into the shared master branch. But Git has no capability for real-time, fine-grained collaboration, such as the automatic, instantaneous merging that occurs in tools like Google Docs, Trello, and Figma.
    Git is highly optimized for code and similar line-based text files; other file formats are treated as binary blobs that cannot meaningfully be edited or merged. Despite GitHub’s efforts to display and compare images, prose, and CAD files, non-textual file formats remain second-class in Git.
It’s interesting to note that most software engineers have been reluctant to embrace cloud software for their editors, IDEs, runtime environments, and build tools. In theory, we might expect this demographic of sophisticated users to embrace newer technologies sooner than other types of users. But if you ask an engineer why they don’t use a cloud-based editor like Cloud9 or Repl.it, or a runtime environment like Colaboratory, the answers will usually include “it’s too slow” or “I don’t trust it” or “I want my code on my local system.” These sentiments seem to reflect some of the same motivations as local-first software. If we as developers want these things for ourselves and our work, perhaps we might imagine that other types of creative professionals would want these same qualities for their own work.
Developer infrastructure for building apps
Now that we have examined the user experience of a range of applications through the lens of the local-first ideals, let’s switch mindsets to that of an application developer. If you are creating an app and want to offer users some or all of the local-first experience, what are your options for data storage and synchronization infrastructure?
Web app (thin client)
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Web apps 	✗ 	✓ 	✗ 	✓ 	✗ 	✗ 	✗
A web app in its purest form is usually a Rails, Django, PHP, or Node.js program running on a server, storing its data in a SQL or NoSQL database, and serving web pages over HTTPS. All of the data is on the server, and the user’s web browser is only a thin client.
This architecture offers many benefits: zero installation (just visit a URL), and nothing for the user to manage, as all data is stored and managed in one place by the engineering and DevOps professionals who deploy the application. Users can access the application from all of their devices, and colleagues can easily collaborate by logging in to the same application.
JavaScript frameworks such as Meteor and ShareDB, and services such as Pusher and Ably, make it easier to add real-time collaboration features to web applications, building on top of lower-level protocols such as WebSocket.
On the other hand, a web app that needs to perform a request to a server for every user action is going to be slow. It is possible to hide the round-trip times in some cases by using client-side JavaScript, but these approaches quickly break down if the user’s internet connection is unstable.
Despite many efforts to make web browsers more offline-friendly (manifests, localStorage, service workers, and Progressive Web Apps, among others), the architecture of web apps remains fundamentally server-centric. Offline support is an afterthought in most web apps, and the result is accordingly fragile. In many web browsers, if the user clears their cookies, all data in local storage is also deleted; while this is not a problem for a cache, it makes the browser’s local storage unsuitable for storing data of any long-term importance.
News website The Guardian documents how they used service workers to build an offline experience for their users.
Relying on third-party web apps also scores poorly in terms of longevity, privacy, and user control. It is possible to improve these properties if the web app is open source and users are willing to self-host their own instances of the server. However, we believe that self-hosting is not a viable option for the vast majority of users who do not want to become system administrators; moreover, most web apps are closed source, ruling out this option entirely.
All in all, we speculate that web apps will never be able to provide all the local-first properties we are looking for, due to the fundamental thin-client nature of the platform. By choosing to build a web app, you are choosing the path of data belonging to you and your company, not to your users.
Mobile app with local storage (thick client)
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Thick client 	✓ 	— 	✓ 	✗ 	— 	✗ 	✗
iOS and Android apps are locally installed software, with the entire app binary downloaded and installed before the app is run. Many apps are nevertheless thin clients, similarly to web apps, which require a server in order to function (for example, Twitter, Yelp, or Facebook). Without a reliable Internet connection, these apps give you spinners, error messages, and unexpected behavior.
However, there is another category of mobile apps that are more in line with the local-first ideals. These apps store data on the local device in the first instance, using a persistence layer like SQLite, Core Data, or just plain files. Some of these (such as Clue or Things) started life as a single-user app without any server, and then added a cloud backend later, as a way to sync between devices or share data with other users.
These thick-client apps have the advantage of being fast and working offline, because the server sync happens in the background. They generally continue working if the server is shut down. The degree to which they offer privacy and user control over data varies depending on the app in question.
Things get more difficult if the data may be modified on multiple devices or by multiple collaborating users. The developers of mobile apps are generally experts in end-user app development, not in distributed systems. We have seen multiple app development teams writing their own ad-hoc diffing, merging, and conflict resolution algorithms, and the resulting data sync solutions are often unreliable and brittle. A more specialized storage backend, as discussed in the next section, can help.
Backend-as-a-Service: Firebase, CloudKit, Realm
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Firebase, CloudKit, Realm 	— 	✓ 	✓ 	— 	✗ 	✗ 	✗
Firebase is the most successful of mobile backend-as-a-service options. It is essentially a local on-device database combined with a cloud database service and data synchronization between the two. Firebase allows sharing of data across multiple devices, and it supports offline use. However, as a proprietary hosted service, we give it a low score for privacy and longevity.
Another popular backend-as-a-service was Parse, but it was acquired and then shut down by Facebook in 2017. Apps relying on it were forced to move to other backend services, underlining the importance of longevity.
Firebase offers a great experience for you, the developer: you can view, edit, and delete data in a free-form way in the Firebase console. But the user does not have a comparable way of accessing, manipulating and managing their data, leaving the user with little ownership and control.
The Firebase console, where data can be viewed and edited
The Firebase console: great for developers, off-limits for the end user.
Apple’s CloudKit offers a Firebase-like experience for apps willing to limit themselves to the iOS and Mac platforms. It is a key-value store with syncing, good offline capabilities, and it has the added benefit of being built into the platform (thereby sidestepping the clumsiness of users having to create an account and log in). It’s a great choice for indie iOS developers and is used to good effect by tools like Ulysses, Bear, Overcast, and many more.
The preferences dialog of Ulysses, with the iCloud option checked
With one checkbox, Ulysses syncs work across all of the user’s connected devices, thanks to its use of CloudKit.
Another project in this vein is Realm. This persistence library for iOS gained popularity compared to Core Data due to its cleaner API. The client-side library for local persistence is called Realm Database, while the associated Firebase-like backend service is called Realm Object Server. Notably, the object server is open source and self-hostable, which reduces the risk of being locked in to a service that might one day disappear.
Mobile apps that treat the on-device data as the primary copy (or at least more than a disposable cache), and use sync services like Firebase or iCloud, get us a good bit of the way toward local-first software.
CouchDB
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
CouchDB 	— 	— 	✓ 	✗ 	— 	— 	—
CouchDB is a database that is notable for pioneering a multi-master replication approach: several machines each have a fully-fledged copy of the database, each replica can independently make changes to the data, and any pair of replicas can synchronize with each other to exchange the latest changes. CouchDB is designed for use on servers; Cloudant provides a hosted version; PouchDB and Hoodie are sibling projects that use the same sync protocol but are designed to run on end-user devices.
Philosophically, CouchDB is closely aligned to the local-first principles, as evidenced in particular by the CouchDB book, which provides an excellent introduction to relevant topics such as distributed consistency, replication, change notifications, and multiversion concurrency control.
While CouchDB/PouchDB allow multiple devices to concurrently make changes to a database, these changes lead to conflicts that need to be explicitly resolved by application code. This conflict resolution code is difficult to write correctly, making CouchDB impractical for applications with very fine-grained collaboration, like in Google Docs, where every keystroke is potentially an individual change.
In practice, the CouchDB model has not been widely adopted. Various reasons have been cited for this: scalability problems when a separate database per user is required; difficulty embedding the JavaScript client in native apps on iOS and Android; the problem of conflict resolution; the unfamiliar MapReduce model for performing queries; and more. All in all, while we agree with much of the philosophy behind CouchDB, we feel that the implementation has not been able to realize the local-first vision in practice.
Towards a better future
As we have shown, none of the existing data layers for application development fully satisfy the local-first ideals. Thus, three years ago, our lab set out to search for a solution that gives seven green checkmarks.
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
??? 	✓ 	✓ 	✓ 	✓ 	✓ 	✓ 	✓
We have found some technologies that appear to be promising foundations for local-first ideals. Most notably are the family of distributed systems algorithms called Conflict-free Replicated Data Types (CRDTs).
CRDTs as a foundational technology
CRDTs emerged from academic computer science research in 2011. They are general-purpose data structures, like hash maps and lists, but the special thing about them is that they are multi-user from the ground up.
Every application needs some data structures to store its document state. For example, if your application is a text editor, the core data structure is the array of characters that make up the document. If your application is a spreadsheet, the data structure is a matrix of cells containing text, numbers, or formulas referencing other cells. If it is a vector graphics application, the data structure is a tree of graphical objects such as text objects, rectangles, lines, and other shapes.
If you are building a single-user application, you would maintain those data structures in memory using model objects, hash maps, lists, records/structs and the like. If you are building a collaborative multi-user application, you can swap out those data structures for CRDTs.
Two devices initially have the same to-do list. On device 1, a new item is added to the list using the .push() method, which appends a new item to the end of a list. Concurrently, the first item is marked as done on device 2. After the two devices communicate, the CRDT automatically merges the states so that both changes take effect.
The diagram above shows an example of a to-do list application backed by a CRDT with a JSON data model. Users can view and modify the application state on their local device, even while offline. The CRDT keeps track of any changes that are made, and syncs the changes with other devices in the background when a network connection is available.
If the state was concurrently modified on different devices, the CRDT merges those changes. For example, if users concurrently add new items to the to-do list on different devices, the merged state contains all of the added items in a consistent order. Concurrent changes to different objects can also be merged easily. The only type of change that a CRDT cannot automatically resolve is when multiple users concurrently update the same property of the same object; in this case, the CRDT keeps track of the conflicting values, and leaves it to be resolved by the application or the user.
Thus, CRDTs have some similarity to version control systems like Git, except that they operate on richer data types than text files. CRDTs can sync their state via any communication channel (e.g. via a server, over a peer-to-peer connection, by Bluetooth between local devices, or even on a USB stick). The changes tracked by a CRDT can be as small as a single keystroke, enabling Google Docs-style real-time collaboration. But you could also collect a larger set of changes and send them to collaborators as a batch, more like a pull request in Git. Because the data structures are general-purpose, we can develop general-purpose tools for storage, communication, and management of CRDTs, saving us from having to re-implement those things in every single app.
For a more technical introduction to CRDTs we suggest:
    Alexei Baboulevitch’s Data Laced with History
    Martin Kleppmann’s Convergence vs Consensus (slides)
    Shapiro et al.’s comprehensive survey
    Attiya et al.’s formal specification of collaborative text editing
    Gomes et al.’s formal verification of CRDTs
Ink & Switch has developed an open-source, JavaScript CRDT implementation called Automerge. It is based on our earlier research on JSON CRDTs. We have then combined Automerge with the Dat networking stack to form Hypermerge. We do not claim that these libraries fully realize local-first ideals — more work is still required.
However, based on our experience with them, we believe that CRDTs have the potential to be a foundation for a new generation of software. Just as packet switching was an enabling technology for the Internet and the web, or as capacitive touchscreens were an enabling technology for smartphones, so we think CRDTs may be the foundation for collaborative software that gives users full ownership of their data.
Ink & Switch prototypes
While academic research has made good progress designing the algorithms for CRDTs and verifying their theoretical correctness, there is so far relatively little industrial use of these technologies. Moreover, most industrial CRDT use has been in server-centric computing, but we believe this technology has significant potential in client-side applications for creative work.
Server-centric systems using CRDTs include Azure Cosmos DB, Redis, Riak, Weave Mesh, SoundCloud’s Roshi, and Facebook’s OpenR. However, we are most interested in the use of CRDTs on end-user devices.
This was the motivation for our lab to embark on a series of experimental prototypes with collaborative, local-first applications built on CRDTs. Each prototype offered an end-user experience modeled after an existing app for creative work such as Trello, Figma, or Milanote.
These experiments explored questions in three areas:
    Technology viability. How close are CRDTs to being usable for working software? What do we need for network communication, or installation of the software to begin with?
    User experience. How does local-first software feel to use? Can we get a seamless Google Docs-like real-time collaboration experience without an authoritative centralized server? How about a Git-like, offline-friendly, asynchronous collaboration experience for data types other than source code? And generally, how are user interfaces different without a centralized server?
    Developer experience. For an app developer, how does the use of a CRDT-based data layer compare to existing storage layers like a SQL database, a filesystem, or Core Data? Is a distributed system harder to write software for? Do we need schemas and type checking? What will developers use for debugging and introspection of their application’s data layer?
We built three prototypes using Electron, JavaScript, and React. This gave us the rapid development capability of web technologies while also giving our users a piece of software they can download and install, which we discovered is an important part of the local-first feeling of ownership.
Kanban board
Trellis is a Kanban board modeled after the popular Trello project management software.
Screenshot of Trellis, a clone of Trello
Trellis offers a Trello-like experience with local-first software. The change history on the right reflects changes made by all users active in the document.
On this project we experimented with WebRTC for the network communication layer.
On the user experience side, we designed a rudimentary “change history” inspired by Git and Google Docs’ “See New Changes” that allows users to see the operations on their Kanban board. This includes stepping back in time to view earlier states of the document.
Watch Trellis in action with the demo video or download a release and try it yourself.
Collaborative drawing
PixelPusher is a collaborative drawing program, bringing a Figma-like real-time experience to Javier Valencia’s Pixel Art to CSS.
Screenshot of the PixelPusher user interface
Drawing together in real-time. A URL at the top offers a quick way to share this document with other users. The “Versions” panel on the right shows all branches of the current document. The arrow buttons offer instant merging between branches.
On this project we experimented with network communication via peer-to-peer libraries from the Dat project.
User experience experiments include URLs for document sharing, a visual branch/merge facility inspired by Git, a conflict-resolution mechanism that highlights conflicted pixels in red, and basic user identity via user-drawn avatars.
Read the full project report or download a release to try it yourself.
Media canvas
PushPin is a mixed media canvas workspace similar to Miro or Milanote. As our third project built on Automerge, it’s the most fully-realized of these three. Real use by our team and external test users put more strain on the underlying data layer.
Screenshot of PushPin, showing images and text cards on a canvas
PushPin’s canvas mixes text, images, discussion threads, and web links. Users see each other via presence avatars in the toolbar, and navigate between their own documents using the URL bar.
PushPin explored nested and connected shared documents, varied renderers for CRDT documents, a more advanced identity system that included an “outbox” model for sharing, and support for sharing ephemeral data such as selection highlights.
Watch the PushPin demo video or download a release and try it yourself.
Findings
Our goal in developing the three prototypes Trellis, PixelPusher and PushPin was to evaluate the technology viability, user experience, and developer experience of local-first software and CRDTs. We tested the prototypes by regularly using them within the development team (consisting of five members), reflecting critically on our experiences developing the software, and by conducting individual usability tests with approximately ten external users. The external users included professional designers, product managers, and software engineers. We did not follow a formal evaluation methodology, but rather took an exploratory approach to discovering the strengths and weaknesses of our prototypes.
In this section we outline the lessons we learned from building and using these prototypes. While these findings are somewhat subjective, we believe they nevertheless contain valuable insights, because we have gone further than other projects down the path towards production-ready local-first applications based on CRDTs.
CRDT technology works.
From the beginning we were pleasantly surprised by the reliability of Automerge. App developers on our team were able to integrate the library with relative ease, and the automatic merging of data was almost always straightforward and seamless.
The user experience with offline work is splendid.
The process of going offline, continuing to work for as long as you want, and then reconnecting to merge changes with colleagues worked well. While other applications on the system threw up errors (“offline! warning!”) and blocked the user from working, the local-first prototypes function normally regardless of network status. Unlike browser-based systems, there is never any anxiety about whether the application will work or the data will be there when the user needs it. This gives the user a feeling of ownership over their tools and their work, just as we had hoped.
Developer experience is viable when combined with Functional Reactive Programming (FRP).
The FRP model of React fits well with CRDTs. A data layer based on CRDTs means the user’s document is simultaneously getting updates from the local user (e.g. as they type into a text document) but also from the network (as other users and other devices make changes to the document).
Because the FRP model reliably synchronizes the visible state of the application with the underlying state of the shared document, the developer is freed from the tedious work of tracking changes arriving from other users and reconciling them with the current view. Also, by ensuring all changes to the underlying state are made through a single function (a “reducer”), it’s easy to ensure that all relevant local changes are sent to other users.
The result of this model was that all of our prototypes realized real-time collaboration and full offline capability with little effort from the application developer. This is a significant benefit as it allows app developers to focus on their application rather than the challenges of data distribution.
Conflicts are not as significant a problem as we feared.
We are often asked about the effectiveness of automatic merging, and many people assume that application-specific conflict resolution mechanisms are required. However, we found that users surprisingly rarely encounter conflicts in their work when collaborating with others, and that generic resolution mechanisms work well. The reasons for this are:
    Automerge tracks changes at a fine-grained level, and takes datatype semantics into account. For example, if two users concurrently insert items at the same position into an array, Automerge combines these changes by positioning the two new items in a deterministic order. In contrast, a textual version control system like Git would treat this situation as a conflict requiring manual resolution.
    Users have an intuitive sense of human collaboration and avoid creating conflicts with their collaborators. For example, when users are collaboratively editing an article, they may agree in advance who will be working on which section for a period of time, and avoid concurrently modifying the same section.
When different users concurrently modify different parts of the document state, Automerge will merge these changes cleanly without difficulty. With the Kanban app, for example, one user could post a comment on a card and another could move it to another column, and the merged result will reflect both of these changes. Conflicts arise only if users concurrently modify the same property of the same object: for example, if two users concurrently change the position of the same image object on a canvas. In such cases, it is often arbitrary how they are resolved and satisfactory either way.
Automerge’s data structures come with a small set of default resolution policies for concurrent changes. In principle, one might expect different applications to require different merge semantics. However, in all the prototypes we developed, we found that the default merge semantics to be sufficient, and we have so far not identified any case requiring customised semantics. We hypothesise that this is the case generally, and we hope that future research will be able to further test this hypothesis.
Visualizing document history is important.
In a distributed collaborative system another user can deliver any number of changes to you at any moment. Unlike centralized systems, where servers mediate change, local-first applications need to find their own solutions to these problems. Without the right tools, it can be difficult to understand how a document came to look the way it does, what versions of the document exist, or where contributions came from.
In the Trellis project we experimented with a “time travel” interface, allowing a user to move back in time to see earlier states of a merged document, and automatically highlighting recently changed elements as changes are received from other users. The ability to traverse a potentially complex merged document history in a linear fashion helps to provide context and could become a universal tool for understanding collaboration.
URLs are a good mechanism for sharing.
We experimented with a number of mechanisms for sharing documents with other users, and found that a URL model, inspired by the web, makes the most sense to users and developers. URLs can be copied and pasted, and shared via communication channels such as email or chat. Access permissions for documents beyond secret URLs remain an open research question.
Peer-to-peer systems are never fully “online” or “offline” and it can be hard to reason about how data moves in them.
A traditional centralized system is generally “up” or “down,” states defined by each client by their ability to maintain a steady network connection to the server. The server determines the truth of a given piece of data.
In a decentralized system, we can have a kaleidoscopic complexity to our data. Any user may have a different perspective on what data they either have, choose to share, or accept. For example, one user’s edits to a document might be on their laptop on an airplane; when the plane lands and the computer reconnects, those changes are distributed to other users. Other users might choose to accept all, some, or none of those changes to their version of the document.
Different versions of a document can lead to confusion. As with a Git repository, what a particular user sees in the “master” branch is a function of the last time they communicated with other users. Newly arriving changes might unexpectedly modify parts of the document you are working on, but manually merging every change from every user is tedious. Decentralized documents enable users to be in control over their own data, but further study is needed to understand what this means in practical user-interface terms.
CRDTs accumulate a large change history, which creates performance problems.
Our team used PushPin for “real” documents such as sprint planning. Performance and memory/disk usage quickly became a problem because CRDTs store all history, including character-by-character text edits. These pile up, but can’t easily be truncated because it’s impossible to know when someone might reconnect to your shared document after six months away and need to merge changes from that point forward.
We continue to optimize Automerge, but this is a major area of ongoing work.
Network communication remains an unsolved problem.
CRDT algorithms provide only for the merging of data, but say nothing about how different users’ edits arrive on the same physical computer.
In these experiments we tried network communication via WebRTC; a “sneakernet” implementation of copying files around with Dropbox and USB keys; possible use of the IPFS protocols; and eventually settled on the Hypercore peer-to-peer libraries from Dat.
CRDTs do not require a peer-to-peer networking layer; using a server for communication is fine for CRDTs. However, to fully realize the longevity goal of local-first software, we want applications to outlive any backend services managed by their vendors, so a decentralized solution is the logical end goal.
The use of P2P technologies in our prototypes yielded mixed results. On one hand, these technologies are nowhere near production-ready: NAT traversal, in particular, is unreliable depending on the particular router or network topology where the user is currently connected. But the promise suggested by P2P protocols and the Decentralized Web community is substantial. Live collaboration between computers without Internet access feels like magic in a world that has come to depend on centralized APIs.
Cloud servers still have their place for discovery, backup, and burst compute.
A real-time collaborative prototype like PushPin lets users share their documents with other users without an intermediating server. This is excellent for privacy and ownership, but can result in situations where a user shares a document, and then closes their laptop lid before the other user has connected. If the users are not online at the same time, they cannot connect to each other.
Servers thus have a role to play in the local-first world — not as central authorities, but as “cloud peers” that support client applications without being on the critical path. For example, a cloud peer that stores a copy of the document, and forwards it to other peers when they come online, could solve the closed-laptop problem above.
Hashbase is an example of a cloud peer and bridge for Dat and Beaker Browser.
Similarly, cloud peers could be:
    an archival/backup location (especially for phones or other devices with limited storage);
    a bridge to traditional server APIs (such as weather forecasts or a stock tickers);
    a provider of burst computing resources (like rendering a video using a powerful GPU).
The key difference between traditional systems and local-first systems is not an absence of servers, but a change in their responsibilities: they are in a supporting role, not the source of truth.
How you can help
These experiments suggest that local-first software is possible. Collaboration and ownership are not at odds with each other — we can get the best of both worlds, and users can benefit.
However, the underlying technologies are still a work in progress. They are good for developing prototypes, and we hope that they will evolve and stabilize in the coming years, but realistically, it is not yet advisable to replace a proven product like Firebase with an experimental project like Automerge in a production setting today.
If you believe in a local-first future, as we do, what can you (and all of us in the technology field) do to move us toward it? Here are some suggestions.
For distributed systems and programming languages researchers
Local-first software has benefited tremendously from recent research into distributed systems, including CRDTs and peer-to-peer technologies. The current research community is making excellent progress in improving the performance and power of CRDTs and we eagerly await further results from that work. Still, there are interesting opportunities for further work.
Most CRDT research operates in a model where all collaborators immediately apply their edits to a single version of a document. However, practical local-first applications require more flexibility: users must have the freedom to reject edits made by another collaborator, or to make private changes to a version of the document that is not shared with others. A user might want to apply changes speculatively or reformat their change history. These concepts are well understood in the distributed source control world as “branches,” “forks,” “rebasing,” and so on. There is little work to date on understanding the algorithms and programming models for collaboration in situations where multiple document versions and branches exist side-by-side.
We see further interesting problems around types, schema migrations, and compatibility. Different collaborators may be using different versions of an application, potentially with different features. As there is no central database server, there is no authoritative “current” schema for the data. How can we write software so that varying application versions can safely interoperate, even as data formats evolve? This question has analogues in cloud-based API design, but a local-first setting provides additional challenges.
For Human-Computer Interaction (HCI) researchers
For centralized systems, there are ample examples in the field today of applications that indicate their “sync” state with a server. Decentralized systems have a whole host of interesting new opportunities to explore user interface challenges.
We hope researchers will consider how to communicate online and offline states, or available and unavailable states for systems where any other user may hold a different copy of data. How should we think about connectivity when everyone is a peer? What does it mean to be “online” when we can collaborate directly with other nodes without access to the wider Internet?
Example Git commit history as visualized by GitX
The “railroad track” model, as used in GitX for visualizing the structure of source code history in a Git repository.
When every document can develop a complex version history, simply through daily operation, an acute problem arises: how do we communicate this version history to users? How should users think about versioning, share and accept changes, and understand how their documents came to be a certain way when there is no central source of truth? Today there are two mainstream models for change management: a source-code model of diffs and patches, and a Google Docs model of suggestions and comments. Are these the best we can do? How do we generalize these ideas to data formats that are not text? We are eager to see what can be discovered.
While centralized systems rely heavily on access control and permissions, the same concepts do not directly apply in a local-first context. For example, any user who has a copy of some data cannot be prevented from locally modifying it; however, other users may choose whether or not to subscribe to those changes. How should users think about sharing, permissions, and feedback? If we can’t remove documents from others’ computers, what does it mean to “stop sharing” with someone?
We believe that the assumption of centralization is deeply ingrained in our user experiences today, and we are only beginning to discover the consequences of changing that assumption. We hope these open questions will inspire researchers to explore what we believe is an untapped area.
For practitioners
If you’re a software engineer, designer, product manager, or independent app developer working on production-ready software today, how can you help? We suggest taking incremental steps toward a local-first future. Start by scoring your app:
	1. Fast 	2. Multi-device 	3. Offline 	4. Collaboration 	5. Longevity 	6. Privacy 	7. User control
Your app 							
Then some strategies for improving each area:
    Fast. Aggressive caching and downloading resources ahead of time can be a way to prevent the user from seeing spinners when they open your app or a document they previously had open. Trust the local cache by default instead of making the user wait for a network fetch.
    Multi-device. Syncing infrastructure like Firebase and iCloud make multi-device support relatively painless, although they do introduce longevity and privacy concerns. Self-hosted infrastructure like Realm Object Server provides an alternative trade-off.
    Offline. In the web world, Progressive Web Apps offer features like Service Workers and app manifests that can help. In the mobile world, be aware of WebKit frames and other network-dependent components. Test your app by turning off your WiFi, or using traffic shapers such as the Chrome Dev Tools network condition simulator or the iOS network link conditioner.
    Collaboration. Besides CRDTs, the more established technology for real-time collaboration is Operational Transformation (OT), as implemented e.g. in ShareDB.
    Longevity. Make sure your software can easily export to flattened, standard formats like JSON or PDF. For example: mass export such as Google Takeout; continuous backup into stable file formats such as in GoodNotes; and JSON download of documents such as in Trello.
    Privacy. Cloud apps are fundamentally non-private, with employees of the company and governments able to peek at user data at any time. But for mobile or desktop applications, try to make clear to users when the data is stored only on their device versus being transmitted to a backend.
    User control. Can users easily back up, duplicate, or delete some or all of their documents within your application? Often this involves re-implementing all the basic filesystem operations, as Google Docs has done with Google Drive.
Call for startups
If you are an entrepreneur interested in building developer infrastructure, all of the above suggests an interesting market opportunity: “Firebase for CRDTs.”
Such a startup would need to offer a great developer experience and a local persistence library (something like SQLite or Realm). It would need to be available for mobile platforms (iOS, Android), native desktop (Windows, Mac, Linux), and web technologies (Electron, Progressive Web Apps).
User control, privacy, multi-device support, and collaboration would all be baked in. Application developers could focus on building their app, knowing that the easiest implementation path would also given them top marks on the local-first scorecard. As litmus test to see if you have succeeded, we suggest: do all your customers’ apps continue working in perpetuity, even if all servers are shut down?
We believe the “Firebase for CRDTs” opportunity will be huge as CRDTs come of age. We’d like to hear from you if you’re working on this.
Conclusions
Computers are one of the most important creative tools mankind has ever produced. Software has become the conduit through which our work is done and the repository in which that work resides.
In the pursuit of better tools we moved many applications to the cloud. Cloud software is in many regards superior to “old-fashioned” software: it offers collaborative, always-up-to-date applications, accessible from anywhere in the world. We no longer worry about what software version we are running, or what machine a file lives on.
However, in the cloud, ownership of data is vested in the servers, not the users, and so we became borrowers of our own data. The documents created in cloud apps are destined to disappear when the creators of those services cease to maintain them. Cloud services defy long-term preservation. No Wayback Machine can restore a sunsetted web application. The Internet Archive cannot preserve your Google Docs.
In this article we explored a new way forward for software of the future. We have shown that it is possible for users to retain ownership and control of their data, while also benefiting from the features we associate with the cloud: seamless collaboration and access from anywhere. It is possible to get the best of both worlds.
But more work is needed to realize the local-first approach in practice. Application developers can take incremental steps, such as improving offline support and making better use of on-device storage. Researchers can continue improving the algorithms, programming models, and user interfaces for local-first software. Entrepreneurs can develop foundational technologies such as CRDTs and peer-to-peer networking into mature products able to power the next generation of applications.
Today it is easy to create a web application in which the server takes ownership of all the data. But it is too hard to build collaborative software that respects users’ ownership and agency. In order to shift the balance, we need to improve the tools for developing local-first software. We hope that you will join us.
We welcome your thoughts, questions, or critique: @inkandswitch or hello@inkandswitch.com.
Acknowledgments
Martin Kleppmann is supported by a grant from The Boeing Company. Thank you to our collaborators at Ink & Switch who worked on the prototypes discussed above: Julia Roggatz, Orion Henry, Roshan Choxi, Jeff Peterson, Jim Pick, and Ignatius Gilfedder. Thank you also to Heidi Howard, Roly Perera, and to the anonymous reviewers from Onward! for feedback on a draft of this article.

File addition: 20190225_ids.md (----------)

[0.1]


Part 4: Line IDs
February 25, 2019
I’ve written quite a bit about the theory of patches and merging, but nothing yet about how to actually implement anything efficiently. That will be the subject of this post, and probably some future posts too. Algorithms and efficiency are not really discussed in the original paper, so most of this material I learned from reading the pijul source code. Having said that, my main focus here is on broader ideas and algorithms, and so you shouldn’t assume that anything written here is an accurate reflection of pijul (plus, my pijul knowledge is about 2 years out of date by now).
Three sizes
Before getting to the juicy details, we have to decide what it means for things to be fast. In a VCS, there are three different size scales that we need to think about. From smallest to biggest, we have:
    The size of the change. Like, if we’re changing one line in a giant file, then the size of the change is just the length of the one line that we’re changing.
    The size of the current output. In the case of ojo (which just tracks a single file), this is just the size of the file.
    The size of the history. This includes everything that has ever been in the repository, so if the repository has been active for years then the size of the history could be much bigger than the current size.
The first obvious requirement is that the size of a patch should be proportional to the size of the change. This sounds almost too obvious to mention, but remember the definition of a patch from here: a patch consists of a source file, a target file, and a function from one to the other that has certain additional properties. If we were to naively translate this definition into code, the size of a patch would be proportional to the size of the entire file.
Of course, this is a solved problem in the world of UNIX-style diffs (which I mentioned all the way back in the first post). The problem is to adapt the diff approach to our mathematical patch framework; for example, the fact that our files need not even be ordered means that it doesn’t make sense to talk about inserting a line “after line 62.”
The key to solving this turns out to be to unique IDs: give every line in the entire history of the repository a unique ID. This isn’t even very difficult: we can give every patch in the history of the repository a unique ID by hashing its contents. For each patch, we can enumerate the lines that it adds and then for the rest of time, we can uniquely refer to those lines like “the third line added by patch Ar8f.”
Representing patches
Once we’ve added unique IDs to every line, it becomes pretty easy to encode patches compactly. For example, suppose we want to describe this patch:
Here, dA is the unique ID of the patch that introduced the to-do, shoes, and garbage lines, and x5 is the unique ID of the patch that we want to describe. Anyway, the patch is now easy to describe by using the unique IDs to specify what we want to do: delete the line with ID dA/1, add the line with ID x5/0 and contents “work”, and add an edge from the line dA/2 to the line x5/0.
Let’s have a quick look at how this is implemented in ojo, by taking a peek at the API docs. Patches, funnily enough, are represented by the Patch struct, which basically consists of metadata (author, commit message, timestamp) and a list of Changes. The Changes are the most interesting part, and they look like this:
pub enum Change {
    NewNode { id: NodeId, contents: Vec<u8> },
    DeleteNode { id: NodeId },
    NewEdge { src: NodeId, dest: NodeId },
}
In other words, the example that we saw above is basically all there is to it, as far as patches go.
If you want to see what actual patches look like in actual usage, you can do that too because ojo keeps all of its data in human-readable text. After installing ojo (with cargo install ojo), you can create a new repository (with ojo init), edit the file ojo_file.txt with your favorite editor, and then:
$ ojo patch create -m "Initial commit" -a Me
Created patch PMyANESmvMQ8WR8ccSKpnH8pLc-uyt0jzGkauJBWeqx4=
$ ojo patch export -o out.txt PSc97nCk9oRrRl-2IW3H8TYVtA0hArdVtj5F0f4YSqqs=
Successfully wrote the file 'out.txt'
Now look in out.txt to see your NewNodes and NewEdges in all their glory.
Antiquing
I introduced unique IDs as a way to achieve compact representations of patches, but it turns out that they also solve a problem that I promised to explain two years ago: how do I compute the “most antique” version of a patch? Or equivalently, if I have some patch but I want to apply it to a slightly different repository, how do I know whether I can do that? With our description of patches above, this is completely trivial: a patch can only add lines, delete lines, or add edges. Adding lines is always valid, no matter what the repository contains. Deleting lines and adding edges can be done if and only if the lines to delete, or the lines to connect, exist in the repository. Since lines have unique IDs, checking this is unambiguous. Actually, it’s really easy because the line IDs are tied to the patch that introduced them: a patch can be applied if and only if all the patch IDs that it refers to have already been applied. For obvious reasons, we refer to these as “dependencies”: the dependencies of a patch are all the other patches that it refers to in DeleteNode and NewEdge commands. You can see this in action here.
By the way, this method will always give a minimal set of dependencies (in other words, the most antique version of a patch), but it isn’t necessarily the right thing to do. For example, if a patch deletes a line then it seems reasonable for it to also depend on the lines adjacent to the deleted line. Ojo might do this in the future, but for now it sticks to the minimal dependencies.
Applying patches
Now that we know how to compactly represent patches, how quickly can we apply them? To get really into detail here, we’d need to talk about how the state of the repository is represented on disk (which is an interesting topic on its own, but a bit out of scope for this post). Let’s just pretend for now that the current state of the repository is stored as a graph in memory, using some general-purpose crate (like, say, petgraph). Each node in the graph needs to store the contents of the corresponding line, as well as a “tombstone” saying whether it has been deleted (see the first post). Assuming we can add nodes and edges in constant time (like, say, in petgraph), applying a single change is a constant time operation. That means the time it takes to apply the whole patch is proportional to the number of changes. That’s the best we could hope for, so we’re done, right? What was even the point of the part about three size scales?
Revenge of the ghosts
Imagine you have a file that contains three lines:
first line
second line
last line
but behind the scenes, there are a bunch of lines that used to be there. So ojo’s representation of your file might look like:
Now let’s imagine that we delete “second line.” The patch to do this consists of a single DeleteLine command, and it takes almost no time to apply:
Now that we have this internal representation, ojo needs to create a file on disk showing the new state. That is, we want to somehow go from the internal representation above to the file
first line
last line
Do you see the problem? Even though the output file is only two lines long, in order to produce it we need to visit all of the lines that used to be there but have since been deleted. In other words, we can apply patches quickly (in timescale 1), but rendering the output file is slow (in timescale 3). For a real VCS that tracks decade-old repositories, that clearly isn’t going to fly.
Pseudo-edges
There are several ingredients that go into supporting fast rendering of output files (fast here means “timescale 2, most of the time”, which is the best that we can hope for). Those are going to be the subject of the next post. So that you have something to think about until then, let me get you started: the key idea is to introduce “pseudo-edges,” which are edges that we insert on our own in order to allow us to “skip” large regions of deleted lines. In the example above, the goal is to actually generate this graph:
These extra edges will allow us to quickly render output files, but they open up a new can of worms (and were the source of several subtle bugs in pijul last time I used it (i.e. two years ago)): how do we know when to add (or remove, or update) pseudo-edges? Keep in mind that we aren’t willing to traverse the entire graph to compute the pseudo-edges, because that would defeat the purpose.

File addition: 20190222_other_than_git.md (----------)

[0.1]

# Document Title
    Is there a VCS for a person who doesn't want to spend every day thinking about how they are supposed to use their VCS?
Mercurial. It is quite good. I used it for quite some while for my personal stuff, but migrated to git becausr it is a useful thing to know if I ever decide to take a dev job.
CVS is easy to set up for local, personal use (set CVS_HOME to a local directory). I used it for a few years, initially out of curiosity, before using Mercurial.
I still use RCS for single files (mostly plain text documents).
Darcs and Pijul are interesting projects that have patches as a primitive. Darcs is known to be slow on bigger projects, and AFAIK Pijul is an attempt at a similar but faster system using Rust (Darcs is written in Haskell).
8 votes
hereticalgorithm
February 22, 2019
(edited February 22, 2019)
Link
It might be a problem of git's model clashing with your intuition of how a VCS should work.
Darcs (more stable, but has performance issues with merging) & Pijul (experimental, faster algorithms) both operate on a patch-based model. Instead of tracking a series of snapshots in time (and branches horizontally), they track sets of changes (and possible combinations).
Users report that this model eliminates the eldrich horror unexpected behavior that lurks dreaming beneath R'lyeh is sometimes obscured by the porcelain interface but still lurks within the plumbing.
That being said, those two projects are step closer towards "academic elegance/purity" (reflecting their theoretical origins) and away from "fast and dirty hacks" (Git was an emergency replacement for a proprietary VCS that suddenly cut ties w/ the Linux Foundation). This may make them a bad cultural/philosophical fit for an EE (or exactly what you've been missing!). YMMV
6 votes
teaearlgraycold
February 22, 2019
Link
Fossil is also an option.
Personally I like git and would use it even for small projects.
4 votes
babypuncher
February 22, 2019
Link
Git is as simple or complicated as you want it to be. Find a good GUI for it and forget about all the features you don't want to use. I don't think any "simpler" VCS will actually be easier in your day to day use than Git, if you're using the right client that offers a workflow you can learn easily.
3 votes
InherentlyGloomy
February 22, 2019
Link
Mercurial is a well established and widely used VCS. Subversion is another one, although I rarely hear people speak well of it.
CVS is an older one that's usually found in legacy systems. It does handle large files and binaries pretty well, to it's credit.
I've heard good things about Bazaar, but I haven't used it personally.
1 vote
Silbern
February 22, 2019
Link
Parent
Subversion was a very popular system back in its day, and tons of open source projects used it. The thing about Subversion is that it's a hierarchical VCS vs Git, which is distributed. Some of the greatest proponents of Git, and the greatest opponents of Subversion, were open source projects, for whom Git worked vastly better with their workflow of merging patches freeform and experimenting with different branches. If you want a hierarchical system, with its pros and cons, Subversion is actually a pretty good choice.
4 votes
        Amarok
        February 22, 2019
        (edited February 22, 2019)
        Link
        Parent
        That's what we were using where I worked... once I managed to pull the 16GBish of data the company had accumulated over decades inside visual sourcesafe out and process it into something that subversion could import without losing all of the versioning history. That was a fun science project. The perl scripts took almost three days to finish processing it all.
        We got pretty drunk to celebrate the day we retired VSS. Good riddance.
        We picked subversion partly because it was an optimal target, built similar to VSS... but more than that, the windows svn server was tied properly into active directory, so I could manage access to the source code using AD groups just like everything else. At the time there weren't many alternatives to do that, and git was a young pup freshly minted without that capability. TortoiseSVN integrated perfectly with windows explorer and we had modules that made subversion into a visual studio native. Everything just worked, no hassles, good security, lots of convenience.
        1 vote
meghan
February 22, 2019
Link
Git is too complicated on purpose because the core program is a CLI app. If you don't want to have to deal with git, then get a GUI app for it and you'll never have touch the commands again if you don't want to. Some good options are https://desktop.github.com/ and https://www.sourcetreeapp.com/
mftrhu
February 22, 2019
Link
If you just want to track changes to a single file, RCS is ancient but it works well enough.
I also use it for another reason (AKA laziness): the comma vee (,v) files it creates are pretty distinctive, and I check in my system configuration files when I fiddle with them - both to keep a history, and to quickly find the ones I modified via locate.

File addition: 20190219_cycles.md (----------)

[0.1]

Part 3: Graggles can have cycles
February 19, 2019
Almost two years ago, I promised a series of three posts about version control. The first two (here and here) introduced a new (at the time) framework for version control. The third post, which I never finished, was going to talk about the datastructures and algorithms used in pijul, a version control system built around that new framework. The problem is that pijul is a complex piece of software, and so I had lots of trouble wrapping my head around it.
Two years later, I’m finally ready to continue with this series of posts (but having learned from my earlier mistakes, I’m not going to predict the total number of posts ahead of time). In the meantime, I’ve written my own toy version control system (VCS) to help me understand what’s going on. It’s called ojo, and it’s extremely primitive: to start with, it can only track a single file. However, it is (just barely) sophisticated enough to demonstrate the important ideas. I’m also doing my best to make the code is clear and well-documented.
Graggles can have cycles
As I try and ease back into this whole blogging business, let me just start with a short answer for something that several people have asked me (and which also confused me at some point). Graggles (which, as described in the earlier posts are a kind of generalized file in which the lines are not necessarily ordered, but instead form a directed graph) are not DAGs; that is, they can have cycles. To see why, suppose we start out with this graggle
The reason this thing isn’t a file is because there’s no prescribed order between the “shoes” line and the “garbage” line. Now suppose that my wife and I independently flatten this graggle, but in different ways (because apparently she doesn’t care if I get my feet wet).
Merging these two flattenings will produce the following graggle:
Notice the cycle between “shoes” and “garbage!”
Although I was surprised when I first noticed that graggles could have cycles, if you think about it a bit more then it makes a lot of sense: one graggle put a “garbage” dependency on “shoes” and the other put a “shoe” dependency on “garbage,” and so when you merge them a cycle naturally pops out.

File addition: 20180320_why_i_prefer_fossil_over_git.md (----------)

[0.1]

a while ago I did a video on get and why
I thought get needed to get good as I
put it and wound up talking about fossil
which is normally my go-to source
control manager or revision control
system version control system or
whatever you use to refer to these
things I got a request to talk about
fossil and this was something I actually
had planned to do I have been busy and
had some other things I wanted to do but
I'm more or less prepared to do this now
[Music]
you
I'm not going to get a hardcore into
details
mostly because Richard hip already has
an absolutely great presentation on some
issues with get and why fossil has a
better approach in his opinion he's the
author of fossil so you know obvious
biases but inevitably when you are the
author of a product you designed it a
certain way because you had needs for it
to be that certain way
we'll start off with installing it if
you're on Linux or BSD or anything like
that the installation procedures fairly
straightforward use your package manager
it's going to be called fossil end of
story on Windows if you are running a
recent enough version of Windows you
will be able to use find package and
fossil
or it'll say from the chocolatey source
there is a fossil package available if
you are not on a recent enough version
of Windows and the fine package and
other package management commandlets are
not present you can install chocolaty
directly although you can also just go
to the fossil web page and download the
binary there the only thing you have to
be aware of with Windows is by default
the search path is a little not what
commandline junkies typically expect but
it's it's not particularly hard to have
it set up now in my case I have it set
up so that the anything that chocolatey
and others it wind up installing gets
automatically put in there so what I can
do is install package and fossil because
there's only one provider I don't need
the specifier for brighter and then I
can just go ahead and get this installed
and now should be good to go granted
with a very odd-looking command name but
because you can tell it gets a bit
confused but you know it still runs
exactly the same on say like Linux that
would look a lot cleaner
fossil
something a little bit different from
many other version control systems
especially distributed version control
systems that sort of confuses some
people fossil is almost like a hybrid of
the centralized version control and
decentralized version control
and it follows this in a lot of ways
one thing you'll need to keep in mind
much like with centralized version
control is that the repository is
separate from your working directory
so no the creator and the actual
repository first now I'm going to move
over to a different Drive which
apparently
yes it does why are you white why that
habit is signed to a difficult it's
different
now
oh yeah I have a issue to resolve
involving this connection not being made
initially so I was trying to connect
when I was trying to move to the drive
when the connection hasn't been made yet
it used to auto mount it and then after
an update that stopped and I just
haven't tracked down the error because
of not that big of a deal but I
obviously need to get around to doing
that so I actually want to create the
the repository itself over here and this
is done through fossil I did I think
just Newell work
I mean using get so much recently and
much as I hate yet
you
using it for all the stuff that I hope
people will contribute to is caused me
to forget some of the fossil commands
but I think it's just in it so I have
the actual repository name and what this
is actually it should be unsurprising if
you recognize who Richard hip is is
actually a sequel light database
is it just a single file is the
repository and it's obviously set up a
special way but it's just a sequel late
database I know you can name this
anything the extension doesn't matter
and in my case I'm just going to create
a repository for the spin objects I
haven't really done anything with them
so but I have a I have a test file in
that for the EBS code extension I was
developing so we can check that in there
and it should work fine
you
should not take this long
you
no I'll go yeah look like it went
through okay okay
so one of the first things you're going
to want to do of course is configure it
and this is one of the actually really
big selling points except it's free
behind fossil
you
you
you
I don't have to do anything
there's already a
websites with branches and tags and even
a ticketing system a wiki system and
configuration is done through here
assuming that it doesn't
okay we're good we're good
one of the things I'm going to do is
that's an interesting rendering err
you
I would not recommend doing this in the
majority of cases but since I am the
only developer and I don't want to mess
around with
other things and just giving myself full
permissions
in a team setting like if anybody else
winds up contributing to this which is
unlikely because the spin objects thing
is not actually going to be public then
I would scale back my own permissions
and you know assign them just the
permissions they need and and similar
but it's a very convenient way of
managing this kind of stuff and of
course contact info I'm not going to
bother with that and a password I will
deal with this later
one of the other admin tasks I strongly
recommend doing at the beginning I
believe it's here
yes but the actual project name which of
course you do not want a named fossil
project
now we'll just leave it as that it's not
really that important it's an internal
project so it's not like this would be
something that search engines would
really scan because it's internal
and that should go through Ida not
entirely sure why it doesn't go through
the very first time when you had applied
but did as you can see did did actually
go through one setting I strongly
recommend doing if you are intermixing
operating systems
is the carriage return newline glob
let's just just do that
if you're only if you're in a controlled
environment where you can guarantee that
the only checkouts and the only place
you get commits from and all the editor
settings and everything follows a
specific convention then you don't
really need to worry about that but if
you're very intermixed just just that's
going to help out a lot
other than that
this is good for now you know obviously
fill it out more as time goes on but
that's basically it
anything you want to add to the wiki did
all that stuff just do whatever you feel
like
and we can go ahead and stop that and
yes but as you notice this this isn't
really anything you can work with so
again much like with a centralized
version control which you actually need
to do is check out from this repository
now true to decentralized systems you
can actually have tons of these
repositories exist on the same machine
on different machines in
they can all you know push to each other
pull from each other all of the things
that you would expect to see with
decentralized our district yet
decentralized version control are
present they're just
you can reasonably operate fossil like a
version control as well which actually
has quite a bit of uses especially when
you have say like a team of developers
on one machine or one one server that
hosts the repository they can all have
their working directories and then just
commit to the same repository then push
out from the business on to say get well
that's not going to be github but that
that same idea they can push out to
chisel app or wherever else or in the
case of say multiple teams within an
organization they can each team can have
their own repository on their system and
then you know push and pull throughout
the entire organization as opposed to
with say get where the repository and
the working directory are the same thing
you're left with say III I've seen
somebody complain once that they had 32
different instances of the same
repository on the machine just because
they needed four but the entire team 32
different checkouts
it's a little crazy
especially trying to synchronize the
changes between all of them
so what we can do for this I'm just
going to move back over to the desktop
and we can go into the existing
directory and then it should just be
fossil open and spin optics
and then we're open
this directory just became the working
directory for this repository and I can
show that through I think its fossil
status
yes so we have a National empty check-in
now one of the things I commented on
really liking about fossil and just
finding absolutely obnoxious about git
is the way changes to existing files are
done in fossil when we want to add a
file it can be done through a simple ad
and then I can just go ahead and add the
the test object now when I want to
commit that I can go ahead and commit
and give it a message by default if you
just fossil commit it'll bring up the
system default text editor I don't want
to configure this on Windows so I'm just
passing it the the messages through the
zoo the command line which is fine
because both the command prompt and
PowerShell have what's the term for them
the command line editors anyways so
we're not talking about the really old
days where you just did in once and you
could you know move through and edit the
message
and of course because that's what it is
now even though we're in the working
directory and not on the repository
because the working directory knows what
repository it came from we can still do
this
the only difference is because we're in
the working directory it opens from the
time line because that's usually the
most interesting thing when you're
working on things so
there was nothing for me to configure
here I didn't eat need to set up a
webserver I didn't need to set up some
third-party component to display this
stuff I have a
visual representation of the timeline
you
but this doesn't really show off what I
was talking about just yet because the
file there already a you know adding
this the first time is basically the
same way as it's done in and get just
called code so Bo we open up vs code and
edit the file
nothing in here at all yet so let's just
go through and
so we've got the changes saved with get
what you would need to do is add it add
what's called staging you staged the
changes and then when you go to commit
the changes it actually commits them
with fossil it just immediately
recognized that has changed because the
file already exists in the repository it
goes through and checks like is this one
the same do I need to do anything with
it and just since it's changed
automatically knows that those changes
should be committed now this is the
default behavior it is still possible to
do staging like with git with fossil
it's just that the default behavior is
what at least hip believes is the more
common case and at least with the way I
develop this approach is definitely my
common case very rarely do I want to
only commit part of what I've actually
changed
we can go through and
now remember I didn't actually add this
back or anything so if I were to just
commit that like with yet it would not
have actually committed the changes made
you
if we gold view be
file for this specific check-in you can
see that it actually did
automatically
check this commit this
this is absolutely the single biggest
reason why I prefer fossil over yet and
over most version controls
it might just be a way I do development
but overwhelmingly I
check the the changes I make I want to
check them in all at once
I don't work on things unrelated to what
I am
checking in
that makes sense III don't ever wind up
having a need
shouldn't say that I rarely wind up ever
having a need to do partial check-ins
I've done them twice or a lot of commits
easily you know easily 600 plus commits
I've done them twice
so to have something like this where the
only time you were ever adding a file is
when it's truly a new file is just a
major help because I never wind up with
situations in fossil where I forgot to
stage part of my commit and then have to
make a committee meeting after going oh
yeah I forgot to add these things here
they are
now the UI the the automatic web
interface and that the fact that you can
actually very easily couple this into an
existing website using
CGI just absolutely huge reasons why I
prefer fossil along with like I had
mentioned the fact that the repository
in the working directory are separate so
in team scenarios you can have the team
share one repository and have several
working directories from it and then you
know all the deep decentralized
interactions that you need to can be
done based on the different repositories
for the different teams instead of a
repository for every single developer
you can get this thing kind of set up
with get although fossil has another
concept that makes it quite easier to
work with called auto-sync where if it's
enabled and
gotta remember where it was
auto-sync yep auto-sync is on
if I were to check out from the
repository I had set up and so like on
another machine create a check out based
on this repository we just set up here
if auto sync is on what happens when you
commit to it is it's also automatically
syncs with the repository that it was
cloned from I'm in cloned not checked
out checks out sorry for working
directories I'm in cloned but it'll auto
sync between the clones which is
extremely useful behavior in in the
typical way that at least I'm seeing
things are done
I would absolutely love something like
this to exist forget because I've seen
in all too often when I migrated my
stuff over to get I will commit changes
and totally forget to push them to get
hub and so I can rack up change sets to
3 I've seen six before
why they're not showing up on github I
have to realize that oh yeah I didn't
sync those as well now a lot of editors
will have a commit and sync or commit
and push button that you can press
[Music]
which helps if you're using that
specific editor but doesn't help if
you're using the command line and just
of a hack around a feature that github
would that get would benefit from but
just doesn't have
are more technical matters as well for
why I prefer fossil over get like I had
said way earlier richard hip has a great
presentation on this then that I will
link to that down in the video
description definitely check that out
it goes into considerably more detail
I'm just talking here about my biggest
reasons for preferring fossil so roughly
how things are done in fossil so that it
can be you can compare see kind of where
I'm coming from
yeah the link for richard hips talk will
be down in the video description
definitely check that out
there are a lot of lot of technical
reasons as well for why like auditing
purposes and stuff why fossil
I would say is the superior version
control now go to

File addition: 20170527_beyond_git.md (----------)

[0.1]

# Document Title
Beyond Git
by Paweł Świątkowski
27 May 2017
If you are a young developer chances are that Git is the only version control system you know and use. There is nothing wrong with it, as this is probably the best one. However, it’s worth to know that other systems exist too and sometimes offer interesting features as well.
In this post I present a list of VCSs I came across during my programming education and career.
Concurrent Versions System (CVS)
I admit I never used this one. The only contact I had with it was often when I criticized SVN. More experienced developers used to say:
    Dude, you haven’t used CVS, so shut up!
This is probably the oldest VCS, at least from the ones that were widely adopted. It was first released in 1986 and last update is from 2009. Initially it was a set of shell scripts, but later it was rewritten in C. The “features” that led it to die out included:
    Non-atomic operations – when you pulled (whatever it was called) from the server and for example your connection broke, it was possible that only part of files were updated.
    No diff for changing filename – it was stored as deletion and addition anew.
    No file deletion
Subversion (SVN)
Subversion is a more or less direct successor to CVS. It was designed to solve its most dire problems and became a de-facto standard for many years. I used SVN in my second job for a long time (after that we managed to migrate to Git).
SNV offered atomic operations and ability to rename files or delete them. However, compared to Git it still suffers from poor branching system. Branches are in fact separate directories in the repository and to apply some changes to many branches you have to prepare a patch on one branch and apply it to the other. There is no branch merging and creating a new branch is not a simple and cost-free.
One more important difference: SVN (and CVS) are not distributed, which means they require a central server where main repository copy is stored. It also means that they can’t really be used offline.
Still, Subversion has (or had) some upsides over Git:
    Because of it’s directory-based structure, it is possible to checkout any part of the tree. For example, you could checkout only assets if you are a graphic designer. This led to a custom to keep many projects in one huge repository.
    Subversion was slightly better at handling binary files. First of all, Git was reported to crash with too many binaries in the repo, while SVN worked fine. Secondly, those files tended to take a bit less space. However, later Git LFS was released, which in tipped the balance on Git side.
Because of it’s features, SVN is still used today not only as legacy VCS, but sometimes as a valid choice. For example, Arch Linux’s ABS system was ported from RSync to SVN not too long ago.
Mercurial (HG)
Mercurial and Git appeared more or less in the same time. The reason was that previous popular VCS I was not aware of (BitKeeper) stopped to offer its free version (now it’s back again). And the problem was that Linux kernel was developed using BitKeeper. Both Mercurial and Git were started as successors to BK, but finally Git won. Mercurial is written mostly in Python with parts in C. It was released in 2005 and is still actively developed.
I haven’t actually ever found any crucial differences between HG and Git. For a longer period of time I prefered the former as I felt that it’s better at conflict solving (Git was failing to resolve basically every time back then). Also, BitBucket was a hosting for Mercurial repositories and it offered private repos, which Github did not. It’s worth noting that back then BitBucket did not support Git (it started in 2008 and Git support was added in late 2011).
I know a couple of teams that choose Mercurial over Git for their projects today.
Darcs
Darcs always felt a bit eerie to me. It’s written in Haskell and the language’s ecosystem is probably the only place where it gained some popularity. It was advertised as truly distributed VCS, but probably Git caught up since then, because differences between those two seem not that important. Plus, Darcs still does not have local branches.
Latest release of Darcs is from September 2016.
Bazaar
With its name derived probably from famous Eric Raymond’s essay, Bazaar is another child of year 2005 (like Mercurial or Git). It’s written in Python and is part of GNU Project. Because of that, it gained some popularity in open source world. It is also sponsored and maintained by Canonical, creators of Ubuntu.
Bazaar offers choice between centralized (like SVN) and distributed (like Git) workflows. Its repositories may be hosted on GNU Savannah, Launchpad and SourceForge. Last release was in February 2016.
Fossil
Written in C, born in 2006, Fossil is interesting, though not really popular, alternative to other version control systems. It includes not only files, revisions etc., but also bug tracker and wiki for the project. It also has built-in web server. It’s under active development. Most prominent project using Fossil is probably Tcl/Tk.
Pijul
Pijul is a new kid in the block. It’s written in Rust and attempts to solve performance issues of Darcs and security issues (!) of Git. It is advertised as one of the fastest and having one of the best conflicts-resolving features. It is also based on “mathematically sound theory of patches”.
It still does not have version 1.0 though, but it’s definitely worth watching, as it might be a thing some day. You can host your Pijul repository at nest.pijul.com.
Summary
| Name       | Language  | First release | Last release (as of May 2017) |
| CVS        | C         | 1986          | May 2008                      |
| Subversion | C         | 2000          | November 2016                 |
| Mercurial  | Python, C | 2005          | May 2017                      |
| Darcs      | Haskell   | 2003          | September 2016                |
| Bazaar     | Python    | 2005          | February 2016                 |
| Fossil     | C         | 2006          | May 2017                      |
| Pijul      | Rust      | 2015 (?)      | May 2017                      |
| Git        | C         | 2005          | May 2017                      |

File addition: 20170513_pijul.md (----------)

[0.1]

# Document Title
Part 2: Merging, patches, and pijul
May 13, 2017
In the last post, I talked about a mathematical framework for a version control system (VCS) without merge conflicts. In this post I’ll explore pijul, which is a VCS based on a similar system. Note that pijul is under heavy development; this post is based on a development snapshot (I almost called it a “git” snapshot by mistake), and might be out of date by the time you read it.
The main goal of this post is to describe how pijul handles what other VCSes call conflicts. We’ll see some examples where pijul’s approach works better than git’s, and I’ll discuss why.
Some basics
I don’t want to write a full pijul tutorial here, but I do need to mention the basic commands if you’re to have any hope of understanding the rest of the post. Fortunately, pijul commands have pretty close analogues in other VCSes.
    pijul init creates a pijul repository, much like git init or hg init.
    pijul add tells pijul that it should start tracking a file, much like git add or hg add.
    pijul record looks for changes in the working directory and records a patch with those changes, so it’s similar to git commit or hg commit. Unlike those two (and much like darcs record), pijul record asks a million questions before doing anything; you probably want to use the -a option to stop it.
    pijul fork creates a new branch, like git branch. Unlike git branch, which creates a copy of the current branch, pijul fork defaults to creating a copy of the master branch. (This is a bug, apparently.)
    pijul apply adds a patch to the current branch, like git cherry-pick.
    pijul pull fetches and merges another branch into your current branch. The other branch could be a remote branch, but it could also just be a branch in the local repository.
Dealing with conflicts
As I explained in the last post, pijul differs from other VCSes by not having merge conflicts. Instead, it has (what I call) graggles, which are different from files in that their lines form a directed acyclic graph instead of a totally ordered list. The thing about graggles is that you can’t really work with them (for example, by opening them in an editor), so pijul doesn’t let you actually see the graggles: it stores them as graggles internally, but renders them as files for you to edit. As an example, we’ll create a graggle by asking pijul to perform the following merge:
Here are the pijul commands to do this:
$ pijul init
# Create the initial file and record it.
$ cat > todo.txt << EOF
> to-do
> * work
> EOF
$ pijul add todo.txt
$ pijul record -a -m todo
# Switch to a new branch and add the shoes line.
$ pijul fork --branch=master shoes
$ sed -i '2i* shoes' todo.txt
$ pijul record -a -m shoes
# Switch to a third branch and add the garbage line.
$ pijul fork --branch=master garbage
$ sed -i '2i* garbage' todo.txt
$ pijul record -a -m garbage
# Now merge in the "shoes" change to the "garbage" branch.
$ pijul pull . --from-branch shoes
The first thing to notice after running those commands is that pijul doesn’t complain about any conflicts (this is not intentional; it’s a known issue). Anyway, if you run the above commands then the final, merged version of todo.txt will look like this:
That’s… a little disappointing, maybe, especially since pijul was supposed to free us from merge conflicts, and this looks a lot like a merge conflict. The point, though, is that pijul has to somehow produce a file – one that the operating system and your editor can understand – from the graggle that it maintains internally. The output format just happens to look a bit like what other VCSes output when they need you to resolve a merge conflict.
As it stands, pijul doesn’t have a very user-friendly way to actually see its internal graggles. But with a little effort, you can figure it out. The secret is the command
RUST_LOG="libpijul::backend=debug" pijul info --debug
For every branch, this will create a file named debug_<branchname> which describes, in graphviz’s dot format, the graggles contained in that branch. That file’s a bit hard to read since it doesn’t directly tell you the actual contents of any line; in place of, for example, “to-do”, it just has a giant hex string corresponding to pijul’s internal identifiers for that line. To decode everything, you’ll need to look at the terminal output of that pijul command above. Part of it should look like this:
DEBUG:libpijul::backend::dump: ============= dumping Contents
DEBUG:libpijul::backend::dump: > Key { patch: PatchId 0x0414005c0c2122ca, line: LineId(0x0200000000000000) } Value (0) { value: [Ok("")] }
DEBUG:libpijul::backend::dump: > Key { patch: PatchId 0x0414005c0c2122ca, line: LineId(0x0300000000000000) } Value (12) { value: [Ok("to-do\n")] }
By cross-referencing that output with the contents of debug_<branchname>, you can reconstruct pijul’s internal graggles. Just this once, I’ve done it for you, and the result is exactly as it should be:
What should I do with a conflict?
Since pijul will happily work with graggles internally, you could in principle ignore a conflict and work on other things. That’s probably a bad idea for several reasons (for starters, there are no good tools for working with graggles, and their presence will probably break your build). So here’s my unsolicited opinion: when you have a conflict, you should resolve it ASAP. In the example above, all we need to do is remove the >>> and <<< lines and then record the changes:
$ sed -i 3D;5D todo.txt
$ pijul record -a -m resolve
To back up my recommendation for immediate flattening, I’ll give an example where pijul’s graggle-to-file rendering is lossy. Here are two different graggles:
But pijul renders both in the same way:
This is a perfectly good representation of the graggle on the right, but it loses information from the one on the left (such as the fact that both “home” lines are the same, and the fact that “shop” and “home” don’t have a prescribed order). The good news here is that as long as your graggle came from merging two files, then pijul’s rendering is lossless. That means you can avoid the problem by flattening your graggles to files after every merge (i.e., by resolving your merge conflicts immediately). Like cockroaches, graggles are important for the ecosystem as a whole, but you should still flatten them as soon as they appear.
Case study 1: reverting an old commit
It’s (unfortunately) common to discover that an old commit introduced a show-stopper bug. On the bright side, every VCS worth its salt has some way of undoing the problematic commit without throwing away everything else you’ve written since then. But if the problematic commit predates a merge conflict, undoing it can be painful.
As an illustration of what pijul brings to the table, we’ll look at a situation where pijul’s conflict-avoidance saves the day (at least, compared to git; darcs also does ok here). We’ll start with the example merge from before, including our manual graggle resolution:
Then we’ll ask pijul to revert the “shoes” patch:
$ pijul unrecord --patch=<hash-of-shoes-patch>
$ pijul revert
The result? We didn’t have any conflicts while reverting the old patch, and the final file is exactly what we expected:
Let’s try the same thing with git:
$ git init
# Create the initial file and record it.
$ cat > todo.txt << EOF
> to-do
> * work
> EOF
$ git add todo.txt
$ git commit -a -m todo
# Switch to a new branch and add the shoes line.
$ git checkout -b shoes
$ sed -i '2i* shoes' todo.txt
$ git commit -a -m shoes
# Switch to a third branch and add the garbage line.
$ git checkout -b garbage master
$ sed -i '2i* garbage' todo.txt
$ git commit -a -m garbage
# Now merge in the "shoes" change to the "garbage" branch.
$ git merge shoes
Auto-merging todo.txt
CONFLICT (content): Merge conflict in todo.txt
Automatic merge failed; fix conflicts and then commit the result.
That was expected: there’s a conflict, so we have to resolve it. So I edited todo.txt and manually resolved the conflict. Then,
# Commit the manual resolution.
$ git commit -a -m merge
# Try to revert the shoes patch.
$ git revert <hash-of-shoes-patch>
error: could not revert 4dcf1ae... shoes
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'
Since git can’t “see through” my manual merge resolution, it can’t handle reverting the patch by itself. I have to manually resolve the conflicting patches both when applying and reverting.
I won’t bore you with long command listings for other VCSes, but you can test them out yourself! I’ve tried mercurial (which does about the same as git in this example) and darcs (which does about the same as pijul in this example).
A little warning about pijul unrecord
I’m doing my best to present roughly equivalent command sequences for pijul and git, but there’s something important you should know about the difference between pijul unrecord and git revert: pijul unrecord modifies the history of the repository, as though the unrecorded patch never existed. In this way, pijul unrecord is a bit like a selective version of git reset. This is probably not the functionality that you want, especially if you’re working on a public repository. Pijul actually does have the internal capability to do something closer to git revert (i.e., undo a patch while keeping it in the history), but it isn’t yet user-accessible.
Sets of patches
The time has come again to throw around some fancy math words. First, associativity. As you might remember, a binary operator (call it +) is associative if (x + y) + z = x + (y + z) for any x, y, and z. The great thing about associative operators is that you never need parentheses: you can just write x + y + z and there’s no ambiguity. Associativity automatically extends to more than three things: there’s also no ambiguity with w + x + y + z.
The previous paragraph is relevant to patches because perfect merging is associative, in the following sense: if I have multiple patches (let’s say three to keep the diagrams manageable) then there’s a unique way to perfectly merge them all together. That three-way merge can be written as combinations of two-way merges in multiple different ways, but every way that I write it gives the same result. Let’s have some pictures. Here are my three patches:
And here’s one way I could merge them all together: first, merge patches p and q:
Then, merge patches pm (remember, that’s the patch I get from applying p and then m, which in the diagram above is the same as qn) and r:
Another way would be to first merge q and r, and then merge p in to the result:
Yet a third way would be to merge p and q, then merge q and r, and finally merge the results of those merges. This one gives a nice, symmetric picture:
The great thing about our mathematical foundation from the previous post is that all these merges produce the same result. And I don’t just mean that they give the same final file: they also result in the same patches, meaning that everyone will always agree on which lines in the final file came from where. There isn’t even anything special about the initial configuration (three patches coming out of a single file). I could start with an arbitrarily complex history, and there would be an unambiguous way to merge together all of the patches that it contains. In this sense, we can say that the current state of a pijul branch is determined by a set of patches; this is in contrast to most existing VCSes, where the order in which patches are merged also matters.
Reordering and antiquing patches
One of the things you might have heard about pijul is that it can reorder patches (i.e. that they are commutative). This is not 100% accurate, and it might also be a bit confusing if you paid attention in my last post. That’s because a patch, according to the definition I gave before, includes its input file. So if you have a patch p that turns file A into file B and a patch q that turns file B into file C, then it makes sense to apply p and then r but not the other way around. It turns out that pijul has a nice trick up its sleeve, which allows you to reorder patches as long as they don’t “depend” (and I’ll explain what that means precisely) on each other.
The key idea behind reordering patches is something I call “antiquing.” Consider the following sequenced patches:
According to how we defined patches, the second patch (let’s call it the garbage patch) has to be applied after the first one (the shoes patch). On the other hand, it’s pretty obvious just by staring at them that the garbage patch doesn’t depend on the shoes patch. In particular, the following parallel patches convey exactly the same information, without the dependencies:
How do I know for sure that they convey the same information? Because if we take the perfect merge of the diagram above then we get back the original sequenced diagram by following the top path in the merge!
This example motivates the following definition: given a pair of patches p and q in sequence:
we say that q can be antiqued if there exists some patch a(q) starting at O such that the perfect merge between p and a(q) involves q:
In a case like this, we can just forget about q entirely, since a(q) carries the same information. I call it antiquing because it’s like making q look older than it really is.
One great thing about the “sets of patches” thing above is that it let us easily generalize antiquing from pairs of patches to arbitrarily complicated histories. I’ll skip the details, but the idea is that you keep antiquing a patch – moving it back and back in the history – until you can’t any more. The fact that perfect merges are associative implies, as it turns out, that every patch has a unique “most antique” version. The set of patches leading into the most antique version of q are called q’s dependencies. For example, here is a pair of patches where the second one cannot be antiqued (as an exercise, try to explain why not):
Since the second patch can’t be made any more antique, the first patch above is a dependency of the second one. In my next post, I’ll come back to antiquing (and specifically, the question of how to efficiently find the most antique version of a patch).
I promised to talk about reordering patches, so why did I spend paragraphs going on about antiques? The point is that (again, because of the associative property of perfect merges) patches in “parallel” can be applied in any order. The point of antiquing is to make patches as parallel as possible, and so then we can be maximally flexible about ordering them.
That last bit is important, so it’s worth saying again (and with a picture): patches in sequence
cannot be re-ordered; the same information represented in parallel using an antique of q
is much more flexible.
Case study 2: parallel development
Since I’ve gone on for so long about reordering patches, let’s have an example showing what it’s good for. Let me start with some good news: you don’t need to know about antiquing to use pijul, because pijul does it all for you: whenever pijul records a patch, it automatically records the most antique version of that patch. All you’ll notice is the extra flexibility it brings.
We’ll simulate (a toy example of) a common scenario: you’re maintaining a long-running branch of a project that’s under active development (maybe you’re working on a large experimental feature). Occasionally, you need to exchange some changes with the master branch. Finally (maybe your experimental feature was a huge success) you want to merge everything back into master.
Specifically, we’re going to do the following experiment in both pijul and git. The master branch will evolve in the following sequence:
On our private branch, we’ll begin from the same initial file. We’ll start by applying the urgent fix from the master branch (it fixed a critical bug, so we can’t wait):
Then we’ll get to implementing our fancy experimental features:
I’ll leave out the (long) command listings needed to implement the steps above in pijul and git, but let me mention the one step that we didn’t cover before: in order to apply the urgent fix from master, we say
$ pijul changes --branch master # Look for the patch you want
$ pijul apply <hash-of-the-patch>
In git, of course, we’ll use cherry-pick.
Now for the results. In pijul, merging our branch with the master branch gives no surprises:
In git, we get a conflict:
There’s something else a bit funny with git’s behavior here: if we resolve the conflict and look at the history, there are two copies of the urgent fix, with two different hashes. Since git doesn’t understand patch reordering like pijul does, git cherry-pick and pijul apply work in slightly different ways: pijul apply just adds another patch into your set of patches, while git cherry-pick actually creates a new patch that looks a bit like the original. From then on, git sees the original patch and its cherry-picked one as two different patches, which (as we’ve seen) creates problems from merging down the line. And it gets worse: reverting one of the copies of the urgent fix (try it!) gives pretty strange results.
By playing around with this example, you can get git to do some slightly surprising things. (For example, by inserting an extra merge in the right place, you can get the conflict to go away. That’s because git has a heuristic where if it sees two different patches doing the same thing, it suppresses the conflict.)
Pijul, on the other hand, understood that the urgent fix could be incorporated into my private branch with no lossy modifications. That’s because pijul silently antiqued the urgent fix, so that the divergence between the master branch and my own branch became irrelevant.
Conclusion
So hopefully you have some idea now of what pijul can and can’t do for you. It’s an actively developed implementation of an exciting (for me, at least) new way of looking at patches and merges, and it has a simple, fast, and totally lossless merge algorithm with nice properties.
Will it dethrone git? Certainly not yet. For a start, it’s still alpha-quality and under heavy development; not only should you be worried about your data, it has several UI warts as well. Looking toward the future, I can see reasonable arguments in both directions.
Arguing against pijul’s future world domination, you could question the relevance of the examples I’ve shown. How often do you really end up tripping on git’s little corner cases? Would the time saved from pijul’s improvements actually justify the cost of switching? Those are totally reasonable questions, and I don’t know the answer.
But here’s a more optimistic point of view: pijul’s effortless merging and reordering might really lead to new and productive workflows. Are you old enough to remember when git was new and most people were still on SVN (or even CVS)? Lots of people were (quite reasonably) skeptical. “Who cares about easy branching? It’s better to merge changes immediately anyway.” Or, “who cares about distributed repositories? We have a central server, so we may as well use it.” Those arguments sound silly now that we’re all used to DVCSes and the workflow improvements that they bring, but it took time and experimentation to develop those workflows, and the gains weren’t always obvious beforehand. Could the same progression happen with pijul?
In the next post, I’ll take a look at pijul’s innards, focussing particularly on how it represents your precious data.
Acknowledgement
I’d like to thank Pierre-Étienne Meunier for his comments and corrections on a draft of this post. Of course, any errors that remain are my own responsibility.

File addition: 20170508_merging.md (----------)

[0.1]

# Document Title
A new version
HomeArchiveAbout
Part 1: Merging and patches
May 08, 2017
A recent paper suggested a new mathematical point of view on version control. I first found out about it from pijul, a new version control system (VCS) that is loosely inspired by that paper. But if you poke around the pijul home page, you won’t find many details about what makes it different from existing VCSes. So I did a bit of digging, and this series of blog posts is the result.
In the first part (i.e. this one), I’ll go over some of the theory developed in the paper. In particular, I’ll describe a way to think about patches and merging that is guaranteed to never, ever have a merge conflict. In the second part, I’ll show how pijul puts that theory into action, and in the third part I’ll dig into pijul’s implementation.
Before getting into some patch theory, a quick caveat: any real VCS needs to deal with a lot of tedious details (directories, binary files, file renaming, etc.). In order to get straight to the interesting new ideas, I’ll be skipping all that. For the purposes of these posts, a VCS only needs to keep track of a single file, which you should think of as a list of lines.
Patches
A patch is the difference between two files. Later in this series we’ll be looking at some wild new ideas, so let’s start with something familiar and comforting. The kind of patches we’ll discuss here go back to the early days of Unix:
    a patch works line-by-line (as opposed to, for example, word-by-word); and
    a patch can add new lines, but not modify existing lines.
In order to actually have a useful VCS, you need to be able to delete lines also. But deleting lines turns out to add some complications, so we’ll deal with them later.
For an example, let’s start with a simple file: my to-do list for this morning.
Looking back at the list, I realize that I forgot something important. Here’s the new one:
To go from the original to-do list to the new one, I added the line with the socks. In the format of the original Unix “diff” utility, the patch would look like this:
The “1a2” line is a code saying that we’re going to add something after line 1 of the input file, and the next bit is obviously telling us what to insert.
Since this blog isn’t a command line tool, we’ll represent patches with pretty diagrams instead of flat files. Here’s how we’ll draw the patch above:
Hopefully it’s self-explanatory, but just in case: an arrow goes from left to right to indicate that the line on the right is the same as the one on the left. Lines on the right with no arrow coming in are the ones that got added. Since patches aren’t allowed to re-order the lines, the lines are guaranteed not to cross.
There’s something implicit in our notation that really needs to be said out loud: for us, a patch is tied to a specific input file. This is the first point where we diverge from the classic Unix ways: the classic Unix patch that we produced using “diff” could in principle be applied to any input file, and it would still insert “* put on socks” after the first line. In many cases that wouldn’t be what you want, but sometimes it is.
Merging
The best thing about patches is that they can enable multiple people to edit the same file and then merge their changes afterwards. Let’s suppose that my wife also decides to put things on my to-do list: she takes the original file and adds a line:
Now there are two new versions of my to-do list: mine with the socks, and my wife’s with the garbage. Let’s draw them all together:
This brings us to merging: since I’d prefer to have my to-do list as a single file, I want to merge my wife’s changes and my own. In this example, it’s pretty obvious what the result should be, but let’s look at the general problem of merging. We’ll do this slowly and carefully, and our endpoint might be different from what you’re used to.
Patch composition
First, I need to introduce some notation for an obvious concept: the composition of two patches is the patch that you would get by applying one patch and then applying the other. Since a “patch” for us also includes the original file, you can’t just compose any two old patches. If p is a patch taking the file O to the file A and r is a patch taking A to B, then you can compose the two (but only in one order!) to obtain a patch from O to B. I’ll write this composition as pr: first apply p, then r.
It’s pretty easy to visualize patch composition using our diagrams: to compute the composition of two paths, just “follow the arrows”
to get the (dotted red) patch going from O to B.
Merging as composition
I’m going to define carefully what a merge is in terms of patch composition. I’ll do this in a very math-professor kind of way: I’ll give a precise definition, followed by some examples, and only afterwards will I explain why the definition makes sense. So here’s the definition: if p and q are two different patches taking the file O to the files A and B respectively, a merge of p and q is a pair of patches r and s such that
    r and s take A and B respectively to a common output file M, and
    pr = qs.
We can illustrate this definition with a simple diagram, where the capital letters denote files, and the lower-case letters are patches going between them:
Instead of saying that pr = qs, a mathematician (or anyone who wants to sound fancy) would say that the diagram above commutes.
Here is an example of a merge:
And here is an example of something that is not a merge:
This is not a merge because it fails the condition pr = qs: composing the patches along the top path gives
but composing them along the bottom path gives
Specifically, the two patches disagree on which of the shoes in the final list came from the original file. This is the real meaning underlying the condition pr = qs: it means that there will never be any ambiguity about which lines came from where. If you’re used to using blame or annotate commands with your favorite VCS, you can probably imagine why this sort of ambiguity would be bad.
A historical note
Merging patches is an old idea, of course, and so I just want to briefly explain how the presentation above differs from “traditional” merging: traditionally, merging was defined by algorithms (of which there are many). These algorithms would try to automatically find a good merge; if they couldn’t, you would be asked to supply one instead.
We’ll take a different approach: instead of starting with an algorithm, we’ll start with a list of properties that we want a good merge to satisfy. At the end, we’ll find that there’s a unique merge that satisfies all these properties (and fortunately for us, there will also be an efficient algorithm to find it).
Merges aren’t unique
The main problem with merges is that they aren’t unique. This isn’t a huge problem by itself: lots of great things aren’t unique. The problem is that we usually want to merge automatically, and an automatic system needs an unambiguous answer. Eventually, we’ll deal with this by defining a special class of merges (called perfect merges) which will be unique. Before that, we’ll explore the problem with some examples.
A silly example
Let’s start with a silly example, in which our merge tool decides to add some extra nonsense:
No sane merge tool would ever do that, of course, but it’s still a valid merge according to our rule in the last section. Clearly, we’ll have to tighten up the rules to exclude this case.
A serious example
Here is a more difficult situation with two merges that are actually reasonable:
Both of these merges are valid according to our rules above, but you need to actually know what the lines mean in order to decide that the first merge is better (especially if it’s raining outside). Any reasonable automatic merging tool would refuse to choose, instead requiring its user to do the merge manually.
The examples above are pretty simple, but how would you decide in general whether a merge is unambiguous and can be performed automatically? In existing tools, the details depend on the merging algorithm. Since we started off with a non-algorithmic approach, let’s see where that leads: instead of specifying explicitly which merges we can do, we’ll describe the properties that an ideal merge should have.
Perfect merges
The main idea behind the definition I’m about to give is that it will never cause any regrets. That is, no matter what happens in the future, we can always represent the history just as well through the merge as we could using the original branches. Obviously, that’s a nice property to have; personally, I think it’s non-obvious why it’s a good choice as the defining property of the ideal merge, but we’ll get to that later.
Ok, here it comes. Consider a merge:
And now suppose that the original creators of patches p and q continued working on their own personal branches, which merged sometime in the future at the file F:
We say that the merge (r, s) is a perfect merge if for every possible choice of the merge (u, v), there is a unique patch w so that u = rw and v = sw. (In math terms, the diagram commutes.) We’re going to call w a continuation, since it tells us how to continue working from the merged file. To repeat, a merge is perfect if for every possible future, there is a unique continuation.
A perfect merge
Let’s do a few examples to explore the various corners of our definition. First, an example of a perfect merge:
It takes a bit of effort to actually prove that this is a perfect merge; I’ll leave that as an exercise. It’s more interesting to see some examples that fail to be perfect.
A silly example
Let’s start with the silly example of a merge that introduced an unnecessary line:
This turns out (surprise, surprise) not to be a perfect merge. To understand how our definition of merge perfection excludes merges like this, here is an example of a possible future without a continuation:
Since our patches can’t delete lines, there’s no way to get from merged to future.
A serious example
Here’s another example, the case where there is an ambiguity in the order of two lines in the merged file:
This one fails to be a perfect merge because there is a future with no valid continuation: imagine that my wife and I manually created the desired merge.
Now what patch (call it w) could be put between merged and future to make everything commute? The only possibility is
which isn’t a legal patch because patches aren’t allowed to swap lines.
Terminological remarks
If you’ve been casually reading about pijul, you might have encountered the word “pushout.” It turns out that the pattern we used for defining a perfect merge is very common in math. Specifically, in category theory, suppose you have the following diagram (in which capital letters are objects and lowercase letters are morphisms):
If for every u and v there is a unique w such that the diagram commutes, then (r, s) is said to be the pushout of (p, q). In other words, what we called a “perfect merge” above could also be called a “pushout in the category with files as objects and patches as morphisms.” For most of this article, we’ll ignore the general math terminology in favor of language that’s more intuitive and specific to files and patches.
Conflicts and graggles
The main problem with perfect merges is that they don’t always exist. In fact, we already saw an example:
The pair of patches above has no perfect merge. We haven’t actually proved it, but intuitively it’s pretty clear, and we also discussed earlier why one potential merge fails to be perfect. Ok, so not every pair of patches can be merged perfectly. You probably knew that already, since that’s where merge conflicts come from: the VCS doesn’t know how to merge patches on its own, so you need to manually resolve some conflicts.
Now we come to the coolest part of the paper: a totally different idea for dealing with merge conflicts. The critical part is that instead of making do with an imperfect merge, we enlarge the set of objects that the merge can produce. That is, not every pair of patches can be perfectly merged to a file, but maybe they can be merged to something else. This idea is extremely common in math, and there’s even some general abstract nonsense showing that it can always be done: there’s an abstract way to generalize files so that every pair of patches of generalized files can be perfectly merged. The miraculous part here is that in this particular case, the abstract nonsense condenses into something completely explicit and manageable.
Graggles
A file is an ordered list of lines. A graggle1 (a mixture of “graph” and “file”) is a directed graph of lines. (Yes, I know it’s a terrible name, but it’s better than “object in the free finite cocompletion of the category of files and patches,” which is what the paper calls it.) In other words, whereas a file insists on having its lines in a strict linear order, a graggle allows them to be any directed graph. It’s pretty easy to see how relaxing the strict ordering of lines solves our earlier merging issues. For example, here’s a perfect merge of the sort that caused us problems before:
In retrospect, this is a pretty obvious solution: if we don’t know what order shoes and garbage should go in, we should just produce an output that doesn’t specify the order. What’s a bit less obvious (but is proved in the paper) is that when we work in the world of graggles instead of the world of files, every pair of patches has a unique perfect merge. What’s even cooler is that the perfect merge is easy to compute. I’ll describe it in a second, but first I have to say how patches generalize to graggles.
A patch between two graggles (say, A and B) is a function (call it p) from the lines of A to the lines of B that respects the partial order, in the sense that if there is a path from x to y in A then there is a path from p(x) to p(y) in B. (This condition is an extension of the fact that a patch between two files isn’t allowed to change the order.) Here’s an example:
The perfect merge
And now for the merge algorithm: let’s say we have a patch p going from the graggle A to the graggle B and another patch q going from A to C. To compute the perfect merge of p and q,
    write down the graggles B and C next to each other, and then
    whenever a line in B and a line in C share a “parent” in A, collapse them into a single line.
That’s it: two steps. Here’s the algorithm at work on our previous example: we want to merge these two patches:
So first, we write down the two to-be-merged files next to each other:
For the second step, we see that both of the “to-do” lines came from the same line in the original file, so we combine those two into one. After doing the same to the “work” lines, we get the desired output:
Working with graggles.
By generalizing files to graggles, we got a very nice benefit: every pair of patches has a (unique) perfect merge, and we can compute it easily. But there’s an obvious flaw: all the tools that we use (editors, compilers, etc.) work on files, not graggles. This is where the paper stops providing guidance, but there is an easy solution: whenever a merge results in something that isn’t a file, just make a new patch that turns it into a file. We’ll call this flattening, and here’s an example:
That looks like a merge conflict!
If your eyes haven’t glazed over by now (sorry, it’s been a long post), you might be feeling a bit cheated: I promised you a new framework that avoids the pitfalls of manual merge resolution, but flattening looks an awful lot like manual merge resolution. I’ll answer this criticism in more detail in the next post, where I demonstrate the pijul tool and how it differs from git. But here’s a little teaser: the difference between flattening and manual merge resolution is that flattening is completely transparent to the VCS: it’s just a patch like any other. That means we can do fun things, like re-ordering or reverting patches, even in the presence of conflicting merges. More on that in the next post.
Deleting lines
It’s time to finally address something I put off way at the beginning of the post: the system I described was based on patches that can’t delete lines, and we obviously need to allow deletions in any practical system. Unfortunately, the paper doesn’t help here: it claims that you can incorporate deletion into the system I described without really changing anything, but there’s a bug in the paper. Specifically, if you tweak the definitions to allow deletion then the category of graggles turns out not to be closed under pushouts any more. Here’s an example where the merge algorithm in the paper turns out not to be perfect:
(Since this post has dragged on long enough, I’ll leave it as an exercise to figure out what the problem is).
Ghost lines
Fortunately, there’s a trick to emulate line deletion in our original patch system. I got this idea from pijul, but I’ll present it in a slightly different way. The idea is to allow “ghost” lines instead of actually deleting them. That is, we mark every line in our graggle as either “live” or “ghost.” Then we add one extra rule to our patches: a live line can turn into a ghost line, but not the other way around. We’ll draw ghost lines in gray, and arrows pointing to ghost lines will be dashed. Here’s a patch that deletes the “shoes” line.
The last remaining piece is to extend the perfect merge algorithm to cover our new graggles with ghost lines. This turns out to be easy; here’s the new algorithm:
    Write down side-by-side the two graggles to be merged.
    For every pair of lines with a common parent, “collapse” them into a single line, and if one of them was a ghost, make the collapsed line a ghost.
The bit in italics is the only new part, and it barely adds any extra complexity.
Conclusion
I showed you (in great detail) a mathy way of thinking about patches in a VCS, although I haven’t shown a whole lot of motivation for it yet. At the very least, though, next time someone starts droning on about “patch theory,” you’ll have some idea what they’re talking about.
In the next post, I’ll talk about pijul, a VCS that is loosely based around the algorithms I described in this post. There you’ll get to see some (toy) examples where pijul’s solid mathematical underpinnings help it to avoid corner cases that trip up some more established VCSes.
Acknowledgement
I’d like to thank Pierre-Étienne Meunier for his comments and corrections on a draft of this post. Of course, any errors that remain are my own responsibility.
1: An earlier version of this post called them “digles” (for directed graph file), but a couple years later I decided that “graggles” sounds a bit better. Plus, if you mispronounce it a little, it fits in the pijul’s whole bird theme.
« Part 2: Merging, patches, and pijul

File addition: 20170109_pijul_sane_version_control.md (----------)

[0.1]

all right thanks Dean thanks everyone
this is my first software a conference
ever simple little impressed or I'm
going to talk about the project I've
been working on for about a year now
it's called people from the name of the
Spanish bird it's we're trying to do
same version control and by saying like
I'm going to like this talk mostly about
what sane means and what vertical should
be like so it all started because I was
trying to surf long Baker my coop in
this project myself we're trying to
convince our co-authors and an academic
paper to use version control instead of
just emails and like like reserving
files for a week and analytic file right
so so but well flow happens to be one of
the core darks developers which is a
like rather old ish version control
system and so we couldn't really
convince him to use anything else so we
try to convince our co-authors to start
installing darks and windows and then
like setting up SSH keys and pushing to
pushing patches and pulling stuff and
that didn't really work out so well also
turns out it's darks has performance
problems and get as simplicity problems
so well I wasn't really easy to do
anything about this and that's pretty
much the situation with vertical roll
today like most people most non hackers
don't even use it so it's write widely
regardless of hackers thing most people
like use extremely basic version
cultural even like top like world
experts in in computer science still use
like fight locking in emails and that
makes us lose a significant amount of
data and or time and even the situation
with programmers and sorry say isn't
much better because like as soon as
distributed version controls for
invented their like if they were so
complicated and hard to use that like
business is even started to resent
relies them and use the use the use them
in a centralized way in order to master
the Beast and
Pramod this functional programming as
finally convincing people that they can
really tackle everything like all feel
the computer science that were that were
previously regarded as like elite field
just like Ashley showed us for operating
systems we're trying to replace C++ with
rust recently also JavaScript it's like
it's trying trying to get replaced by
Elm package managers and Linux
distributions are getting replaced by
like nicks it's not widely accepted so
far but and so these talks about letting
adenine other functional tool to that
collection and it's called
we start out to replace gets probably
this goal is as ambitious as for us to
replace it C++ or maybe more but but
anyway so before starting this talk I
just want you all to like get give some
basic concepts about what I well I think
functional programming is so most people
have their own definition feel free to
agree or not with this so I think
functional so by what I like in
functional languages like rust is that
there are static types so we don't like
we can reason about code easily there's
fine control of immutability it's like
we're not mutating some some states all
the time we're just thinking about
transforms and stuff in functions most
operations are atomic which is what most
people use when they use like any
software at all and it's something
that's not really happening in most most
software Mo's tools and we also like
memory safe seaweed we don't want like
you just corrupt our back hand or
correct our storage our files or
something we don't want to lose data
just because we're using it like using
types in the wrong way so how does that
apply to has there any beneficial to
version control well because the way we
use version call today has mostly about
States we're talking about comets
if we use gifts we're talking about
comments and coming so state fool
they're just basically a hash like a
patch to like make it easier to expect
but it's just in optimizations just to
make it easier to store and and their
hash of the whole state the repository
that's where the comment is and will you
commit something you just advance the
head of the repository bye-bye one small
notch and you has to all state of the
repository that's the name of your new
new comment and that's quite different
from patches so patches are all about
transforms they are all about like
bringing like any stays like applying
when you apply a patch you can basically
apply a patch to a file and it's like
that's what most people think um it's
our with the first first discover guess
but but then they realize it's not
that's not the case and so this can have
major benefits like for instance in like
any system that uses three-way merge
such as like git mercurial lesbian CVS
and like most others they don't have
that really cool property that we call
the sensitivity in algebra and that
property is the following so if you have
two of two people at least in Bob and
you're not really good communication
right you're a couple
well at least bride's like the red patch
here and Bob in thrall that writes that
like first a blue patch and then a great
green patch well depending on the order
in which you merge things depending on
what like your merge strategy you might
get different results in three-way merge
like if the patch are exactly the same I
like first I try to merge the blue patch
first and then the green patch
I might guess in get sometimes I might
get different results then if I do the
other you have a theme like on the image
on the top if I merge the two branches I
may get one result and if I merge the
two like the exact same patches and they
are not conflicting they are not like
they are not weird the situation like
you're dealing they're really different
parts of the file sometimes
Bob's patches might get merged in parts
of the files is never seen and when you
consider applying tools like that to
like highly sensitive code such that
it's like cryptography libraries it's
kind of scary but it's like what
everyone does and so
so what's even worse is that you can
actually you can actually tell when this
hits you like if you're a gates user and
this happened to you there's no way it
doesn't wait to tell get says yeah I've
merged it's fine it works and you don't
you don't notice it maybe it maybe it
even complies like there are real-world
examples of this this thing it's not
just an implementation bug that can be
worked around it's like a fundamental
problem in the three-way merge algorithm
all right so we're we're not the first
ones to believe this is wrong there was
another version called system I talked
about in my intro slide which was called
darks it's still still called darks by
the way so they so they are their main
their main principle is that you say two
patches are okayed of like fighting with
each other down conflicts if they
commute so with that that means it's
simply that you can apply them in the
order well then you're like oh they
almost commute it's not exactly like if
for you would like do W RL into algebra
it's not really commuting because you
sometimes you may have to change the
like line numbers like for instance if
Bob at the line called which containing
just F and main and slang one and at
least adds a line that like line 10
saying println hello world when you
merge them well you might have to merge
at least Elise's line after Bob's line
at line 11 instead of 10 but that's what
that's kind of commuting anyway so
that's our that's the situation with
darks when it commutes it's fine
what if it doesn't commute which happens
sometimes well it's an under commanded
part the algorithm so no one really
knows what it does flow home my
co-author on people is one of the core
darks developers and he told me like I
was pretty confident in darks before
they told me at some point yeah
we have no absolutely no clue what it
does it seems to work most of the time
it's like it's highlights conflicts but
we don't really know other than that so
that praises obviously issues about
correctness so the situation with darks
not much much better than we were kids
well at least highlights complex and
warns the user about complex
but then what it does no one knows
another problem is slightly bigger
that's the main reason why darks was in
but I've been done even though it came
before before get is that it sometimes
exponentially slow so exponentially slow
in the size of history number of patches
in their position and so that some
people to device the like merge over
that we can work through where you like
try to synchronize all the patches you
made during the week and instead of
synchronizing like while you're working
on it you're just like waiting for
Friday night to arrive start to merge
like booster patches and well hopefully
by Monday morning when you come back to
office
well the patches are merged but
sometimes not so this is this is no this
makes it kind of like not really
acceptable and this is also one of the
reasons why we could not convince our
colleagues to out to use it because they
tried to use it
naively thought yeah well patches well
we cannot understand what they are so
let's try to push about bunch of patches
repository and well they were surprised
to have to wait for like half an hour
merge so it's not usually like that like
when we write we're developing people
were using darks to develop it and it's
like we've never had we've never ever
runs you that's brought that problem but
that's also because we're like we know
how it works we know what the drawbacks
are we know when the exponentially is
slow merges happen so but that's not
acceptable like that it's it's not like
I wouldn't recommend it to anyone and
that's where people comes in so we try
to after our day trying to convince our
colleagues to use this we just like grab
grab a beer and started discussing about
like the here if patchy is what which
could be like what what good way would
be like you have a cool patch algebra we
could use and that would be simple and
easy to use and that's where we're
starting to learning about category
theory so we don't exactly started
learning learning it back then because
it's a kind of it's now it's not the
easiest theory on earth on earth but
well so whether it is what it is
basically it's a general theory of
transformations
of things so clearly theorists like to
see it as like the general theory of
everything the there are numerous
attempts to rewrite all mathematics in a
categorical language and I think it's
pretty successful except that no one can
understand this so it's already well
written we have all mathematics in there
but no one knows what it does all right
so but it's pretty good for us though
because it's a so we're trying to talk
about changes and files and this is a
theory of changes of on things so one
particularly cool concept so this you
can you see all like I I should like
these things in darks these are
commutative diagrams and category
theories try like drawing them basically
all day long if several colleagues
working mates they have their whiteboard
through that these diagrams sometimes in
3d sometimes it like we are interleaved
arrows and no one knows but what they
are doing because they're yeah they're
talking about transformations on things
all right so in particular when very
cool concepts of category theory is the
concept of push out so push have is a
depreciated of two patches is like for
in our particular case the push out of
two patches would be a file such that no
matter what you do after these two
patches that yield the common States
like a common file so if at least in Bob
write some stuff and then later they can
find some patches to agree in a common
file
well the via star here is a we call it
the push out if you can reach the car
you can reach any common state state in
the future from that star that clear
alright so anything you can do to reach
common stage you can also reach it from
the Prashad so that means the push out
is kind of a minimal common state that
can be like that you that you really
need to reach leg and it's really watch
what we want in a version control system
we really want push out we want minimum
common states that any further
developments any further work in the
repository can also be reached from that
common States that's what a merge
really is and it's it's it's it's so the
run problem is that it's not the case at
all categories have push outs have all -
shouts like we see that like the
translation of that in English is it's
not clear than any editing any - of any
couple of editing operations on a file
will always be merge about like
sometimes we have conflicts we all know
like most programmers know bits are in
conflicts so it's not the case that
tries have piles and patches have old -
shouts but what's cool but category
theory is that there's a solution you
just need to dig into them like pretty
big books of like abstract diagrams that
I don't fit in but then you can
ultimately find something called the
free conservative code completion of a
category and that's a way to
artificially add all push outs into the
category so that means when you when you
have a category of files and passes
between files the free conservative code
completion is a construction that will
automatically give you a category like a
generalized generalization of files so
that's like that generalization will
have all push outs so in people if we
translate it into like files and patches
for instance if Bob adds the line
country like saying just print a line
Bob and at least at the lines in just
print a line at least when they merge
that in people what they get is just a
graph of lines where the two lines are
added they're not comparable if not yet
said what like how they should compare
and how they they should relate to each
other in the file you may be at least is
write anybody's right maybe you're both
right maybe they're both right but in a
different order
you don't mean oh that's a conflict and
so what that theory or the category
theory gives you here as a
generalization of files its which is
well in most cases a little more complex
that gets more the the benefits for a
three lines file or maybe not obvious
but when files get really large it's
it's really great to have that sound
theory that backs you up and right so
that's that's what people is about
that's that's how it works so I want to
say before moving on that this is quite
different from see our LEDs sociology's
are conflict-free replicated data types
and people is more like conflict
tolerant replicated data types so we're
not resolving all the conflicts all the
time we're just rather than that we're
just like accepting conflicts as part of
the system like the data structure can
handle like it can it can represent
conflicts doesn't have to resolve them
all the time and so that's what makes it
fast that's what makes people really
fast because you can apply many patches
they are conflicting but that's alright
and then once you are done applying a
bunch of patches you're you can just
like detect conflicts and I'll put the
files and that's it so what what would
the situation be with theologies if we
were trying to apply your leans to this
problem of like merging patches well in
theory oddities you need to always order
so the way it gets real complex is by
ordering all operation determine
deterministically so it finds whatever
solution like whatever deterministic
merge algorithm they can find to like
just order order things all their
operations in an arbitrary but
deterministic order like for instance we
can have a rule same lease Elisa's
patches always can always come first and
when there are two conflicting patches
from at least I'll just take like take
them in alphabetical order for instance
and so you know in our case that would
like lead to the following file like two
lines one saying print a line at least
one same printer and Bob and the user
would barely see anything they would be
like yeah that's that's it the merge has
succeeded but that doesn't that isn't
really right because that like it the
user should that be we at least be
warned that there's a conflict and they
should do something about it all right
so the end result of that it's well the
fury is a slightly more complicated in
what I've explained but not much more
the result of that is that we developed
a sound theory of fashions
sound the giraffe patches it has the
following
very cool properties so bear with me for
a moment what I'm right like saying
these like bad words so they're the the
fury is like the or algebra is
commutative
it means you can like if pet cheese
don't depend on each other you can apply
them in any order so that's that's what
you would expect from that cheese right
it's associative so we resolved the
initial the initial problem and started
this talk with so you can you can
basically like no matter how you merge
patches if the merge is right like
there's only one solution to merge and
you're not you're like you're not
getting into trouble like by you're
never like do you're never doing what
get does which is like merging things
from Bob in parts of the fight is never
seen that's what associativity is about
and also one very cool property is that
all past is about semantic inverse which
means when you're when you've read in a
patch you can you can derive in other
patch permits that has the like the
opposite effects and then like push it
to other predators you cancel previous
patches the thing is you can since you
can it's it's all commits all
commutative so you can basically compute
an inverse for pets you boost like 1,000
patches ago and it all just works and it
it does even better than just working
it's also pretty fast so merging and
applying patches it's basically the same
operation in our system
well it's possibly the last complicated
slide I've written like there I'll move
on to like easier stuff after that so
our complexity for people we know
complexity here our complexity is like
linear inside of the patch and the guy
with make in the size of history so it's
like it doesn't like you can have an
arbitrarily big history it's not it
doesn't really matter like you can you
can apply the new patch your new patch P
after an arbitrary alerts arbitrarily
large number of patches it doesn't
matter it always works the same like
almost and this is this is actually that
was surprising to us but it's actually
better than three-way merge which also
adds in a like square factor of the size
of the file
that's actually really also observed
that in real world cases where we're
trying to benchmark like in early stages
of our development we were trying to
burnish mark it against like really fast
competitors such as yet and as file size
increase we not we actually noticed the
difference in performance to the point
that people who was actually faster and
get when merging really large patches on
really large price which I'm not really
sure is a real-world case but anyway and
so that brings me to the last part of
this talk so why what made Russ a cool
tool to work with what did we like and
rust and how it helped us build this
this cool new system so one thing is
we're working on our algorithms
mathematical objects and that means we
need types we need to be able to reason
about our code and that's very important
to us like we couldn't really like we
want to develop a sound theory of
patches and we couldn't like the
implement it and then rely on like our
intuition to build correct CC Curcio C++
code that would be like yeah maybe the
theory is correct then what about the
implementation so because we have types
and rust that makes it easy to do so but
we also want to be fast like as I said
like the complexity theory stuff tells
us that we have the we have the
potential to become faster than gates
and we really want to exploit that
potential and we really want to to have
like to be as fast as we can and rust
that I was also asked to do that because
we can add like we can use a like fast
back hands we can add like roll pointers
to like the parameter memory like I'm
Maps and what now
that's really cool and that wasn't the
case in the early prototypes that were
really not in other languages but also
more maybe more importantly so we
started doing that because we couldn't
convince Windows users to install dark
cinder machines and and that's and and
our goal was to be as inclusive as
possible was to like green version
control to everyone but we cannot really
do that if
can only tell you a small portion of
computer users which are like expert
Linux users and so what we really loved
in rust is that we can write clients and
servers for real word protocols so I had
to write some of that myself but it was
actually quite pleasant right but also
yeah there was a there's a Lord unit an
HTTP stack working out pretty well and I
wrote the message library alright and so
we finally got Windows support and so
I'd like to thank the rust developers
for that that's that's really awesome
thanks ok so as part of like what needed
to be Bradon because rust pretty young
language there aren't that many
libraries so there are I just wanted to
conclude with two really hard things
like two really two things that I really
did didn't put didn't think I would have
to write when I when I started its
projects so when as a as a project that
I call sonically I'll just finish word
for dictionary there was infinite time
well security has a transactional on
this b3 with like many others like LM DB
if you know an MV like many other
database backends but it's particularly
is that it has a fork operation that
trends in logarithmic time like a fast
for corporation
so for cooperation means you can clone
the database in like log log in time and
then have two copies of the database
behave as two different databases so I
needed that to implement branches and
still work in progress because it's
really hard to do I'm coming back to it
in a minute
and then there's a another bacterial
agent just like crossed fingers like I
crossed it's called fresh it's nested
current and server library they wrote
it's it's been made like it's really
entirely in rust and it's been it's
gotten rid of all unsafe blocks two
weeks ago thanks to our brian smith
who's not
okay so the trickiest parts the
trickiest thing that I had to do in
where this is sanic area so why was it
tricky well because rust always wants to
free everything before the program
closes and when you're writing a
database back-end it doesn't really
sound right you won't like something to
remain on disk after the program closes
so that was really hard so you have to
do like manual memory management you're
like--you're yourself and that's not
really when you're used to a functional
programming high scale or cameras like
coming back to manual memory management
isn't free Pleasants so how does it work
so I'm going to explain anyway how rust
helped us do that so how does it work
it's just not going to be very technical
here but like most database and Giants
are like sorry some cool database is
enjoying that I liked are based on be
trees so B trees are basically like
trees made of blocks in each block yours
there are like ordered elements there's
no word list of elements there's a
constant number of elements and between
these elements you have pointers to
other other blocks to children blocks
and it's so insertion happens it
believes and then while the blocks plate
when they're gonna get too big but
that's well that's how B trees work
there's a good B tree library in the
standard brush library so now that's
cool but we can actually use it to like
it doesn't allocate inside the files so
yeah it's allocates the program's memory
so we have to write a like a different
like a new library for B trees that
would work in files and so the main
thing I've wish I knew when I started
writing that as about iterators so wait
I'm coming back to it so when when
sometimes in B trees you have to merge
blocks when they're when they get to
like well when they get to like under
fool we have to merge them and something
I wasn't doing in the beginning because
like since since I was doing manual
memory management I figured I could like
like the root
habits came back and I was like okay
let's do manual like manual stuff roll
pointers and manual things and that's
not really the right solution Rus
because you can use like cool right
things anyway even if you have to roll
pointers so main thing I've learned
while doing this is iterators and how to
use them so merging pages like merging
blocks like this can be done in the
following way
so if right you're basically right in
each other like two iterators one for
each page there will like give you yield
like all elements all successive
elements in the two pages and you can
then chain them and add other elements
in the middle that's exactly what you
want to do with your merchant pages and
well deletions and B trees are usually
pretty tricky and that allows you to our
get tested really easily and so the main
reason why I've why I really like this
is that when you're prototyping usually
your your head is kind of in a messy
state and you really don't know what
you're going to so most of the lines
you're writing will be deleted in the
end and you really don't know how to
proceed what to do and so this kind of
hurry concise and and short sorts of
statements allow you to uh to get that
phase and get like production ready code
much much faster because the prototype
start working much much faster another
thing is that I've learned is that well
then I was like oh yeah
so we can have saved I've tried like we
can have cool attractions anyway so then
there has Carol reflexes came back and I
was like oh yeah let's do recursion
let's put in record recursion everywhere
and one thing I've learned is that it's
not always so in Russ you can't you can
write very concise code but the
recursive way is not as surely the most
concise you can do so in sanik area for
instance sometimes we need so it's on
these cried so there's a every time you
load something from this from this kit
like cost you a lot so you want to avoid
like do you avoid doing that at all cost
like and any time you can avoid loading
your page it's a map right so anytime
you can avoid loading your page to the
program's memory you won't really want
to avoid it
and so in order to do that the solutions
that will do things lazily so we're
instead of deleting something or adding
like I think new element to the sabitri
what we're doing is just like we're
doing it lazily that means we're we're
just saying well next time you want to
do something you have to copy too you
have to copy the page and update
everything but just at the time of doing
it not right now because we we don't
really know what we're going through
this page maybe we will merge it maybe
we'll drop it and it would be a huge
waste of time to copy it and then edit
it or or copy it and then merge it that
would be like we would have to copy the
page twice instead of just once so in
order to do that we need to have like
write actually something that looks like
a recursive function well we just have
to the that function which we have
access to several consecutive elements
in the programs call stack and that's
not really easy when you're writing a
like writing it's really recursively so
the way it's done actually at the moment
it's there's a fake call stack it's just
basically an array and the fact that B
trees are well balanced means that there
are not actually they're not actually
going to eat up the world memory of the
computer so you know that the the the
death is not going to be larger than 64
so you can allocate an array on the
stack and and write like how does fall
off a stack pointer so you're basically
simulating the program stack and so that
these are the two main things I wish I
knew when I started writing something
yeah so just why do thank you
especially you're a spell trust
organizers this conference is awesome
we're really enjoying us and if you have
questions I'm really happy you're
answering Thanks
[Applause]

File addition: 20150613_git_just_say_no.md (----------)

[0.1]

yes can you hear me okay we are on now
do I have a Jeremy do I have a clicker
do I have to go up here I have my own
clicker can i plug it in yes this is an
SQLite clicker module it is not just
your average clicker yeah no I can't I
can't talk without moving my arms right
I can't do that I'm already live
unfortunately I think we only have a
single USB port on this machine oh it's
just slow to respond thanks everybody
for coming my name is Richard yep okay
okay thank you for being here my name is
Richard yep this is a talk on yet and
why you shouldn't be using it in its
current form really this is going to be
a talk about what we can do to make get
better here the complete copy of the
slides and the original OpenOffice
format there if you want to download
them I'm getting a little bit of
feedback do I need to be somewhere else
do I need to stand on the stage maybe
it's buzzing don't worry about the
buzzing okay so this talk is about yet
now just to be upfront with you I am the
creator and premiere developer for a
competing version control system but I'm
not here to push my system this is about
because really they're both about ten
years old and get has clearly won mine
share everybody used to get raise your
hand if you are using it raise your hair
if you want to be using it raising here
in your hand if you want to be using if
you are using it but wish you were not
yes okay so it is the software that
everybody seems to love to hate and I'm
going to talk a little bit about what
some of its problems are and what we can
do to fix them now as I said I'm I'm
wrote I'm alright a competing system I'm
not pushing that today but my decade of
experience riding and maintaining the
system informs my criticism of yet
before we start gig before we get going
I have a collection of quotes about yet
and I'd love to collect these if you see
any letting you know I brought along a
few of my favorites here this is a
recent one from Benjamin
it is so amazingly simple to use that a
press a single publisher needs three
different books on how to use it it's so
simple that it's lycian and get her both
neat feel free to write through an
online tutorial to clarify the main get
tutorial on the actual get website it's
so transparent that developers routinely
tell me that the easiest way to learn
yet is to start with the file formats
and work up to the commands so here's I
love this one this was Jonathan Hartley
it's simplest to think of the state of
your git repository as a point in high
dimensional code space in which branches
are represented as the in dimensional
membranes mapping spatial loci of
successive commits on to the projected
manifold of each clone the repository
and if you understand what that means
you should probably be using it this is
from Nick Farina co-founder of Meridian
Murray this is a different Meridian from
the people who are right outside that
door showing this is a different company
but he wrote yet is not a Prius
yet is a Model T it's plumbing and
wiring to stick out all over the place
you have to be a mechanic to operate it
successfully or you'll be stuck on the
side of the road when it breaks down and
it will break down emphases as in the
original so in this and this was in an
article really pushing him it's the
greatest thing in the world but he he's
really up thrown about its limitations
I've got a ton of these but my favorite
is the next one this is this is my
all-time favorite from a guy named T
stain on reddit Klingon code warriors
embrace get we enjoy arbitrary conflicts
get is not for the weak and feeble today
is a good day to code so you know you're
all getting users you're laughing at
this and so the reason you're laughing
is because you know it's true so here
are my top 10 things that I think top 10
needed enhancements needed forgetting
I've got a longer list I tried to limit
it to 10 and I tried to order them I
think an order of importance so I'm
going to start with the first one appear
show the descendants of a check-in this
is a showstopper for me because git does
not do this well is I can't use git I
have to use a different system what do I
mean by that
you know if
you think to the if you think back to
how git is implemented and apparently in
order to use get successful you kind of
have to know the low-level David
structures you've got this this length
list of commit objects now there's four
different types of objects in the file
format there's commit objects tree
objects blob objects and tagged objects
we're only dealing with commit objects
here and and the course they have off
each one has a complete sha-1 hash label
I have shortened the label on these to a
single hex digit just for readability so
you're the first check in was committed
number nine and and so that blob goes in
there that's great and then there were
some additional check-ins after that
which were a and F and it was a at fork
there was a branch there and the F
branch went it to D and the e branch
went up to C and then then there was a
merge operation at B and then a is the
latest a is head and each one of these
commit objects has a pointer to its
parents so it forms a graph just like
this and in the get documentation it
does show the graphs going this way so
the arrows are pointing backwards in
time which is kind of counterintuitive I
mean yeah this is the way you need to
implement it under the covers definitely
but you know users ought to see the
graphs going forward and or the arrows
going forward in time but the thing is
you know if you're if you're sitting at
E and you want to what wonder what comes
next there's nothing here to tell you
you can follow the arrow and find its
parents but you can't find its children
and there's no way to go in you know
after somebody commits C you can't look
at E and you can't modify e to add a
list of parents because if you were to
modify E that would change its hash it
would become a different check-in it
would become a different commit so
there's no way to do that and this is a
big end and for that reason if you if
you look at any of the user interfaces
it's very very difficult to find the
descendants of a check-in
how do we solve this I propose solving
it by having a shadow table off the side
you keep and keep the same internal data
structure because that that's the way
you need to do this but we could you
could make a table in a relational
database that keeps track of the
parent-child relationship and so I've
got a little simple table here and it's
got the parent hash the child hash and
then rank is a field that says whether
or not this is a merge check in or if
it's the primary that the merge parent
or primary parent so let's look at how
this goes
we see that a has a parent which is B on
the first entry here so B is the parent
a is the child and the rank is zero
because it's the prime your parents it's
not emerge B is a child of two different
check-ins both C and D C is its primary
parent and D is its merge parent and you
can have a three-way merge too and then
you number that way and and you can see
how this table very succinctly
represents this graph in fact you could
this is not like a primary data
structure you could build this table
just by looking at the get log you could
build this table very quickly and then
once you have this table um it becomes
very simple to do a query to get the
children or the parents of a particular
check-in and if you have an index on
this table or appropriate indices on
this table then that query becomes very
very fast and this allows you to to find
the ancestors so and if you want to get
and of course you can you can do more
complicated things with this so for
example usually what you want to do is
you've got to check-in you want say what
came afterward you want to say fund the
50 most recent descendants in time of a
particular check-in to see what was
happening so somebody checks in a bug
you find out about it two years later
okay
what was the follow-on to this and for
this example I've added another
to the table called in time which is
just a timestamp and then I just and
then using a simple recursive common
table expression I can immediately get
the 50 most recent and descendants of
that check in now this is just a single
simple SQL query it's a not a common
thing it's uses a common table
expression and if you don't know what
that is there's actually a talk by me
right after lunch and you can come and I
will explain how this works to you but
the point is it's just a simple query
gives you this now you could do this by
you could do the same thing by looking
through the git log and doing lots of
complete table scans it would be a lot
of code it would take a lot of it would
be slow and I note after 10 years of
intensive use with lots of user
interfaces nobody does it and so this
information is just not available to
people who want it you could do all
sorts of other instruments if you had
this table so for example you could find
all of the check-ins that occurred
during some interval of time and that's
just by selecting doing a select on the
table with the in time between two
values that you've selected I do this
kind of thing all the time because in my
repositories I keep I keep separate
components of a project in separate
repositories so the SQLite source code
is in one repository the documentation
is in another repository some of this
test cases are when the original source
repository but I have other several
other repositories that contain
additional test cases so I get an email
or a phone call from a client and says
we've got a problem with this version
SQLite is three years old and we go back
and we we we buy say to trace it a
particular check in and we're turning
well why do we make this change what we
want to see what's happening in the
other related repositories at the same
time so we can kind of go back and
remember what we were thinking three
years ago I don't know about you but I
cannot remember what I was thinking
two years ago this week do you even know
where you were two years ago this way I
don't so but but by doing this query I
can I can get a complete listing taken
yeah the answer is that we were probably
here two years ago this week but by
doing a career like this I can go back
and give you a complete listing of what
was happening all of the relevant
repositories and oh yeah that's when we
were working on such-and-such a feature
and now I see why we made this change
this happens on a daily basis for me
it's not easy to do with the current yet
structure what are the thirty closest
check-ins to a particular point in time
same kind of thing where you know we've
got a change that we're investigating we
know that this change introduced about
what was happening around that point in
time not necessarily on that same branch
maybe on parallel branches what were we
doing at the same time this is a very
important thing when you're tracking
boats and and a system like this allows
you to do it you know I forgot to
mention when I'm showing you the
original chart that this table that
we've got here this lineage table it's
not really a primary data structure in
the sense that it's not holding any new
information all the information that's
in this table is in the original get
objects and if the this table been out
of date for some reason some software
bug or something you could just delete
the whole thing rescan yet log and
rebuild the table anytime you want and
you know my particular version control
system does you know has the same kind
of situation and there's a rebuild
command just you know type rebuild and
it actually rebuilds this table from
primary sources so here's an example of
doing 30 the 30 nears check-ins in time
and it's just a couple of selects with a
union and ordered by time differences
and limit
to the first 30 and that's very fast so
and you could do something like this and
so I imported the complete git
repository for get itself into a
different version control system and
then ran this query on it just to find
out what was the old what were the
oldest commits and get itself and so
these were the first five commits to get
itself and I was amused by the very
first one where Lin is checked in the
very first code to get and he says his
own words the initial revision of git
I don't know if you can see this the
information manager from hell so you can
you can actually see this in the git
repo it's right in here at the bottom
that's a very first check-in notice in
this particular rendition the errors
point forward in time rather than
backward in time which I personally find
more intuitive but you can you can get
the same information by doing git log
and then piping it through tail and in
just see the last few entries yes
there's also a git log reverse that
Princeton in in reverse order I did not
know about that one I did time it this
took three milliseconds when I did get
log and piped it retail that took the
better part of a second so this is
faster okay so that's that's that's
that's the big complaint I have I can't
go back and explore the history it's
just this this this length list what the
air is going the opposite direction that
I normally want to go the next big
problem I have with yet is it has an
overly complex mental model
in particular if you're working in get
you need to keep in your mind five
different snapshots or five different
commits you need to remember what's in
your working directory the the files
that you're editing right now you need
to remember what is in the index or
staging area you need to be mindful of
your local head the branch that you're
working on you need to be aware of the
local copy of the remote head that is
your copy of what's going to be on the
server and then also you need to be
aware of what is actually the remote
head and there are commands and get to
move information between all five of
these things really if you're a
developer you really only need to be
concerned with two of these which is the
first one in the last when you're
working directory and what's actually on
the server what everybody else sees all
this other stuff in the middle B C and D
is just complication it forces you to
keep in your mind two and a half times
more information than you really need
and you know every one of us has a
finite number of brain cycles you can
only think about so much at a time and
my view is the version control system
should get out of your way and use as
few brain cycles as possible so that you
can devote as many brain cycles as
possible
whatever project it is you're working on
and having to keep in mind B C and D she
seems to just be stealing cycles from
from your normal thinking activity so
one of the first things that I think
really ought to go is and of course
these things to be available in in the
rare cases where they're actually needed
one of the first things I think needs to
go is the staging area I mean I talk to
a lot of people about this and if you
have any views I'd really like you to
share them
some people are fanatical about the get
index is a great thing and ask them why
and the usual answer is I get is that
well it allows you to do a partial
every other version control system in
the world allows you to do a partial
commit to and they don't have a staging
area so I'm not sure why that's the
advantage the the fact that you've got
the differences between what you're
committing to and the local and the
remote head that these don't
automatically stay in sync there may be
cases where that would be a desirable
thing but those are the exceptions not
the rule usually when you want when you
do a commit you'd like it to immediately
yeah you keep it on your machine but
you'd like it to immediately go out to
the server so that everybody else can
see it too now if sometimes you're off
networking that doesn't work and so you
have to be aware of these things but
that's the exception not the rule the
usual case is that you want to go
immediately and automatically yes
some people say that's not their usual
case but you know the people the
experience I have and you know when I
was originally doing the distributed
version control system doing my own work
the same way where you would have to
explicitly push as a separate step and
we developed some experience with that
after a while and we eventually found
out that it works a whole lot better to
automatically push so every time you
commit it automatically pushes and this
really solves a lot of problems in fact
there were some some users on on the
mailing list of my system recently and
sometimes you can get in a race
condition where two people commit at the
same time are nearly at the same time
and and there's a race to see who pushes
first and it would automatically create
a branch and everything and and they
were upset about that that that they
weren't getting in a feedback that there
were two people committing at the same
time and it goes beyond just automatic
they want not only automatic but they
want automatic notification that other
people have committed as well and all
this other thing this is what people
really want in right
number three it doesn't really store
your branch history in the get world a
branch is just a symbolic name for the
most recent commit on the end of one of
these commit chains and as you commit
new things on there that that pointer
moves so it doesn't really remember the
name of the branch where it originally
got committed now you can kind of use
some inference and figure it out some of
the tools will show you this but it's
not really first client's branch history
when you're doing analysis of foe and
people come back to you and you need to
go back and look at what was having two
or three years ago you often want to
know what branch was this chicken
originally checked in on where was this
originally what was the original name of
the branch or what are all the
historical branches that we'd handle in
this project where they're starting and
ending dates and how were they finally
resolved where they merged are they
still active were they abandoned what
was the what was the solution ear list
all the historical branches I want to do
a bisect only over this particular
branch so a lot of times what we do is
by six very important to us so what will
be somebody will want to implement a new
feature and they'll do a sequence of
check-ins in a branch and then they'll
merge that branch on to the trunk and
but then later on when we're doing a
bisect we don't want to bisect into all
these little incremental changes whether
they're adding the feature we want just
that one case where they added it but
because there's no branch history and
yet they there's there's no way to keep
up with this you can have them make
inferences based on on on on the commit
logs but it's
it there there's there's no permanent
record of the name of the branch where
things were committed I didn't think
this is real important I asked for
feedback from the user community a lot
of people are saying this is my number
one complaint and get if it forgets the
name of my branch's number for multiple
check outs from the same repository
right now with get the working area is
part of the repository you can only have
one working area per repository if
you're working on something you've got
your project all taken apart you're in
the middle of editing an email or phone
call comes in it requires you to go back
and look at something historical well
you've got you can stash your work but
that's that's kind of bad because you
can only even even with the stash you're
gonna lose context you can clone your
repository to a new repository and then
go back and look at the historical
version in the plume but that's just any
aim can work around it that way but it's
unsatisfying it would be so much nicer
if you just had a repository and then
you can have multiple working
directories sitting on different
checkouts from that one repository
people who have worked in both systems
tell me repeatedly this is a very
important thing to them multiple
checkouts from the same repository
sliced and cloned checkouts
are sliced checkouts and clones excuse
me so a slice would be this is this is
kind of a feature that you had with
subversion and CVS where you've got like
a massively wide project like net bsd if
you ever look at their repository they
have the entire user space with 60,000
different files all in one repository
it's massive and most people don't want
all 60,000 files they only want to look
at one subdirectory the only one working
on one side directory and so a slice
check out me I want to check out or
clone something but I don't want the
entire repository I just want this one
subdirectory wouldn't it be great if you
could do that there's really no
technical reason why you can't it's just
that it isn't supported yes what is a so
a shallow clone is where you clone a
repository but you don't get all of its
history um yeah that's that's another
thing that's nice to have and that is a
new feature hello fellow clansmen around
it's just been like a year or two and
then so um the other thing a lot of
people request and it does support this
now is that you've got a project that
goes back ten years if I want to access
this project why do I have to go to ten
years of history can I get by with just
loading two months of history and save
bandwidth that's a shallow climb and
they've got that but where else we're
doing that a slice clone we're so
shallow clone is slicing it this way a
slice is doing it this way and so it
would be nice to be able to do a slice
and a shallow clone yes question so the
the comment is that on some projects the
directories move around and things
change and so that slicing doesn't work
as well there so don't slice on that
project but on some projects like net
bsd the directory structure stays the
same for like 25 years and and and they
want to do this sort of thing and and so
yeah yes question the
pointed point was raised the get
solution are the district I'm gonna go
beyond yet and just say the distributed
version control solution because they
all have this problem including when I
read the the the distributed version
control solution for this problem is to
have separate repositories for each one
of your little components that you might
want to load yeah this kind of a
workaround though isn't it because
you've got that means you have to
predict in advance which directors are
going to be in interest as a separate
piece and historically we've not been
really good at predicting that so that's
not it would be much better and the
software could in theory do it it's just
to be able to clone or check out a slice
that would work check outs and commits
against a remote repository right now in
order to work and yet you have to make a
clone and has to be on your local
machine everything has to be local and
I'm asking why is that well I know the
technical reasons why but from a user's
perspective that seems like an
unnecessary burden now if you're an
active developer yeah you do want your
local copy and if you if you're going to
be working off network definitely you
want a local copy and that should
probably be the default but if I'm just
a low git hub somewhere and I see some
interesting project and I want to look
at it and look at the source code why do
I have to include the entire 15-year
history just to look at the latest
version one can't I just check out
directly from get up without cloning
that seems like it's something to be
really easy to do for that matter if I
make a change why can't I one can I
commit it back why do I yeah there are
there advantages if you're doing a lot
of work on a project it's certainly
better to have it local but if it's just
an occasional thing why why do I have to
do that why can't I commit and go over a
network okay the next one is a busybox
version of good you know what busybox is
who knows what busybox is busybox is of
course the that single program that has
all of the unit standard UNIX
command-line utilities built-in now this
is not a perfect analogy because busybox
also has limitations it doesn't do the
full thing
but right now when you install get and
installs what is it 134 different
programs you know because each command
is implemented by different executable
and all of these little programs get put
in a special directory somewhere and
then the one get program looks at the
arguments all I want to run this one
it's got a huge number of dependencies
it's this big pile of stuff and a lot of
people tell me that they really want a
version control system it's just an
executable they download an executable
yet sexy or yet if you're on Linux and
you put it on your path and it works
there's nothing's installed you don't
have to have app yet if you want to
upgrade you can put it in a changer jail
if you want to upgrade you just
overwrite the old one with the new one
if you want to uninstall it you just
delete the binary very simple
whereas yet you really have something
like app yet just to manage it this is
you know you know a big pile of programs
like that that's great for development
it's great for an application it's works
fine for development work ruin your
prototyping but for a mature product
that's 10 years old that everybody's
using you'd think that there would be
some better packaging for it you know so
the pink ones just download it really
quickly I hear a lot of people they work
in companies where they're going on a
trip they have to check out a laptop
they don't have their own laptops you
check out a laptop and it comes
pre-configured and it doesn't have the
version control system you want so for
them they have to go and install yet
wouldn't be better just to have a single
binary they could just plot their on the
machine all comes via HTTP or HTTPS so
my wife is a faculty member at UNCC the
local university and I go over on campus
a lot and over there they have guest
Wi-Fi niner-niner guest this man
right here is probably in charge of
known and yes so I am I am grateful I am
great I am grateful for you know having
the free Wi-Fi access but you know like
so much of the world they confuse they
think that the internet and the world
wide web are the same thing that means
that nine or guest only allows you to
use TCP port 80 and 443 those are the
only two options okay so you cannot
secure shell into your back into your
server and and furthermore you can't run
any other protocols that don't learn
every port 80 and don't look like HEV
when you're online or guest and it's not
just UNCC that does this there's a lot
of places to do this I hear a lot of
people they they they use my alternative
version control system because it does
just use HTTP and then we use it because
it's the only one that will penetrate
our corporate firewall you know there's
nothing about yet they couldn't be
finagled to work over HTTP it's just
that they don't so I I think that's
something to be very important may
greatly improve their usability I think
there needs to be a get all command this
is the thing we did where it all means
it works on all I keep track of all of
your repositories as I said we have I
have dozens of repositories open on my
desktop in any particular point in time
and I lose track of them I can't
remember them all so I'm working all day
and working on all these different
projects on different repositories and
get to the end of the day and I'd like
to be able say get all status and it's
gonna go around and it's gonna find all
my repositories and it's gonna do a
status on them to show me what I have it
what I forgot to commit
the way it would do this of course is
there's already a dot get file in your
home directory that keeps track isn't it
called dot get they well in no in your
in your home directory that keeps track
of the your username and and all that
stuff
dot yet config ok so there's already
that file and every time you run a git
command it's going to consult that file
it's gonna read it so and and people
complain what you can't keep track of
all other repositories because you can
freely move them around so buddy just do
a move man and move it to a different
place and that's fine so this the dot
git config keeps track of the last known
position of a man
so every time you run a git command it
says ok this repository I just read git
config is my repository that I'm working
on is it listed and they get config
usually it will be if it's not let's add
it and then when you run again all
command it actually goes down the list
of possible git repositories but then it
checked each one to see if it still is
because you might have moved it away so
this is easy to implement and it's not
so it's not a 100% but in practice it
works well enough so like I'm working on
my desktop and of course I don't use yep
I use a different system but if I were
using hit and we're going on my desktop
and I'm getting ready to go on the road
and I got all these things here I can do
get all push it pushes everything out to
the server and then I go over to my
laptop and to get all pull and to make
sure that everything is synced and then
I can go off network on my laptop and I
don't have to worry that I forgot about
one of the critical projects one of my
critical repos it's a very important
thing and finally get serve no that's
not finally there's a bonus question
again get sir this anybody use mercurial
you know about the serve command do you
use HG serve ok and the comment from the
from the audience was this is exactly
why we use mercurial rather than he is
because it has a serve command so what
this does is it kicks up a
you know mercurial doesn't really go far
enough in my view let me tell you so so
if you do hg sir it starts up a a web
server a little web server running there
and then you can point your web browser
at it and lots of really useful
informations about your repository but
one I implement for my system goes one
step further when you type fossil UI it
starts up the server and it also causes
your favorite web browser to pop up on
that page okay so there's so when
mercurial is a two-step process you have
to start the server and then you have to
type the URL into the web browser mind
does them both in one step but the point
is there's this very rich environment
you know I know that you there are lots
of tools out there for giving you a web
interface to your git repository and but
you know what their separate install
they usually required that you also have
a patchy there too and there's lots of
requirements and there's a big set up
and it's you know and you and people do
it on a server or something like that
because you know that you've taken the
time to set it up but wouldn't it be
really cool if every time you had a
repository you automatically had a
server you could just say git serve
immediately your web browser pops up and
you got all this graphical historical
information that you just click around
and find if you had that if you have
been using this for a couple of weeks I
promise that you would never you you
would never believe that you've got one
without it this is a very amazing thing
I'm probably blazing through these lines
way faster than how much more time do I
have ninety minutes a lot of answered
questions and answers but I do have one
bonus feature this is a thing that I
never personally needed but I hear from
a lot of people they would really like
to have advisory logs what do I mean by
this this is coming from the game
development community when you're
developing with ASCII text files the
they let me take you back
to some of the older version control
systems and this if if you're if you're
younger than me you may not remember
some of these so those things like SCC s
an RCS and the way these things worked
is that when you would do a check out
all your files would come out read-only
you couldn't edit them and if you wanted
to change a file you had to do a special
check out for editing which would then
lock the file so that nobody else only
one person could have a check out for
editing in return and that way you would
never get a conflict of any kind because
only one person would be editing a file
at a time of course the big downfall of
that approach is that somebody would
check something out for editing and then
immediately leave for two-week vacation
and and so we then we'd have to go
running around finding an administrator
to unlock it and so forth but and so CVS
came along and it gave you the ability
to just edit without anything to check
out for editing and that was the coolest
teacher in the world that was just
amazing I know that it's very popular
these days for people be bad-mouthing
CVS and I recognize that CVS has
limitations and it is an older
technology but if you've ever you know
those of us who have had to use what
came before CVS will never speak ill of
CVS so but and but so now we have all
this really cool merging stuff so that
you you you know when two people make
simultaneous changes they get merged
together that works great for text files
it does not work for JPEGs it does not
work for EM pegs it does not work for
these binary resources that are a big
part of game development for example but
also other things and so a lot of people
would love to have the ability to in in
in the main server
the central repository put in advisory
license I'm editing this JPEG and so if
somebody else wants to edit that JPEG
they'll get a warning now it's an
advisory lock so that you know if they
start
and then go on vacation you don't have
to go run up an administrator in order
to fix it but it still gives you it
helps you to coordinate editing of
binary resources that way so there's
we've had a progression of open-source
version control systems I mean in the
old days there's a CCS and RCS and then
there were CVS and subversion which were
huge huge innovations and think it came
along you know it was really based on
this thing called monotone which I think
really kind of pioneered the idea of
distributed version control systems but
so but yet was the one that was
successful and but that's team that's
been 10 years and in the question what
is and get his really other than the
other than adding a few features around
the edge of such as a shallow clone it
really has an advanced in you in ten
years it has anything new much and so my
question is what is going to come next
I've outlined some ideas here about what
I think the direction I think version
control needs to go I'm hopeful that
some of you guys might be interested in
going out and hacking on it and
implementing some of these ideas maybe
somebody who's watching the video would
see this if you have other ideas
criticisms or complaint if you think
that I'm completely hear from you again
I don't use it on a daily basis I use a
different system that I wrote myself and
so I'm just I can be some somewhat out
of touch if you if you think I'm
completely off-base I really do want to
understand your point of view so please
give me feedback that is the extent of
my talk and so I will be happy to take
questions comments criticisms at this
point so we already said you got a
question in the mat no no I was going to
say we've already established everybody
here is using it is that correct
who's also using subversion raise your
hand if you are a current or a current
subversion user a former subversion user
okay CBS Kurtz of CBS users nobodies
still using CBS former CBS users some
material material who's using material
something different call that what you
got dark sand RC RC s not sure enough I
just won fine okay yes okay well you
know what I even you know even when I
have just one file I'll set up a
repository for that one file and then I
will also set up a clone on the server
somewhere and then my system every time
I do a check-in pushes it and that's my
backup so you know anyway a different
one perforce how do you like perforce it
gets the job done okay yeah okay another
one of a very obscure and called fossil
that I might not have heard of yeah okay
great
yeah so um what are your present are you
happy with yet I see a lot of heads
going this way you're handy with yet you
like it okay so the comment was his
problem with yet is that it's not get
itself it as other people you invested a
lot of time to learn all of the obscure
commands to learn the tree structure and
so now you can get around and get pretty
well and other people come along after
him and blows everything a mess up your
repository right yeah okay do you think
it's fair that you know people like
because a lot of people use version
control other than programmers I mean
their data scientists and your are you
and your not uses so a lot of people
need to use version control other than
programmers you know there's we I wish
that more scientists would use version
control
wish that climate scientists would use
version control okay there you go but
you know it gets hard to use you have to
you have to spend a lot of time learning
all these obscure commands to go between
the five different things that you have
to keep in mind and so one of the one of
the funny quotes that didn't didn't tell
you about was I'll get to in just a
second one of the funny quotes that I
have that I didn't have a slide for was
I'm in order to effectively use git I'm
waiting for Emory you have to have the
man page tattooed on your arm yeah and
and and for the video people the
audience was saying yes you do okay come
yet Oh question we wish that Congress
would use version control and we have
applause from the audience yes question
coming all right the comment from the
audience was that he looked at the
online documentation for say yet push
which is a command which is very common
it's not something obscure and the
documentation makes no sense yep
push rest to remote repository what does
that mean really this is this goes back
to the first moment where you have to
you know the best way to learn get is to
start with the data structures and then
work up to the commands yeah and you're
right and and if you're not a program if
you're not hardcore if you're not a
Klingon warrior
you shouldn't have to learn this stuff
that's the point yes all right so I'm
going to try to summarize that remarks
teaches robotics high school students
their new program
and it is so complex it's just beyond
the ability of a newbie to obtain it's
just a barrier to entry you need to be a
kernel hacker in order to really
understand it but on the other hand we
need to be teaching new programmers
version control as a core skill and they
take one look at it and turn around and
change their major to history yes coming
in the back yeah okay I'll just go ahead
and say where the I develop the fossil
so yeah so get in the in fossil right
okay so the comment was that that get
and fossil and also mercurial and
monotone are commits oriented whereas
darhk's is patch oriented and can I
comment on that and what what we should
do is you and I should get together over
lunch and you because I have never
really understood darks and I tried but
it just wasn't making sense to me so in
in if you look at a graph the patches
are just the arcs between the nodes
right and so darks is really focused on
the arcs whereas gets focused on the
nodes is there is there a difference
here than I'm missing
mm-hmm I'm gonna friend all right so so
the point was made that the darks can
answer questions that yet cannot answer
and to just kind of summarize the
remarks I think it's that some the idea
of keeping track of patches works better
for some people's way of thinking than
keeping track of commits and that may be
the case and I'm not opposed to that
I'm just see I'm not here to tear down
yet I'm trying to make it better and
hopefully improve this and really if it
gets good enough I'll confer all of my
fossil repositories over to get and just
start using that but right now it's not
anything close to where I need it to be
so my point is to improve it for
everybody any other comments or remarks
questions yes oh yes there's the famous
get man page generator yeah I should
have made a link to that to see me see
me down because I want to put that in
his every of this talk again I also you
know those who never had a if you've
using it who's ever had a detached head
yeah did you know I just want to get
back to my first point if you had this
relational database keeping track of it
detached head becomes an impossibility
there are no more detached heads it
completely solves the dementia problem I
meant to mention it when I just forgot
we've come up with in addition to fixing
yet we need to come up with a new aid
for presenters so that we can have
points that we you know reminders here
on the screen to tell us all right and
in in and get master here tells me that
he does detached heads on purpose
yeah yeah one person saying he likes
detention it's cuz he does them on
purpose and then somebody else says
detectives are all fun and games until
you do it without meaning to and it will
eventually garbage collect your
detention kids wanted yeah so so your
detentions will go away
yes okay but but with an approach like
this detached kids just appear you know
in the little grants that you get with
your your graphical interface the
initiation is a nice little grant for
your history that is hatch kids just
appear there and you can click on them
and see what they're about it's not some
mystery that you have to go dig up out
of a log you don't have to remember that
they're there it just shows you and you
never lose them any other comments
questions if you wanna know more about
the alternative you can meet me in the
mall we have to give you demonstrations
and a sales talk thank you for coming
enjoy the conference if the videos still
going I'm getting reports from the
audience that the main page generator
forget is very funny mute this
you
your customers rely on your website or
application if it's slower
non-responsive it infuriates your users
and costs you money keeping your
business critical systems humming along
requires insight into what they're doing
your system metrics tell stories stories
that can reveal performance bottlenecks
resource limitations and other problems
but how do you keep an eye on all of
your systems performance metrics in real
time and record this data for later
analysis enter long view the new way to
see what's really going on under the
hood the long view dashboard lets you
visualize the status of all your systems
providing you with a bird's-eye view of
your entire fleet you can sort by CPU
memory swap processes load and network
usage click a specific system to access
its individual dashboard then click and
drag to zoom in on chokepoints and get
more detail comprehensive Network data
including inbound and outbound traffic
is available on the network tab and disk
rights and free space on the disk stab
while the process Explorer displays
usage statistics for individual
processes the system info tab shows
listening services active connections
and available updates adding longview to
a system is easy just click the button
copy the one-line installation command
then run the command on your Linux
system to complete the process the agent
will begin collecting data and sending
it to long view then the graphs start
rolling use long view to gain visibility
into your servers so when your website
or app heats up it stays up
citrix xenserver gives you everything
you need to integrate manage and
automate a virtual data center all on an
enterprise class cloud proven virtual
platform and at a third of the cost of
other solutions but why even bother with
virtualizing your server infrastructure
in the first place well let's say you
have a traditional one server to one
application architecture but you're
running out of resources and performance
is suffering once you order new server
hardware you'll wait for delivery
configure it install your business
application stage and test the server
and finally add it to your production
farm if you've been through this process
before you know it can take weeks or
even months you also know it's a
manually intensive process that will
burden your team every time you outgrow
your current setup with a virtual server
solution you could accomplish all of
that in less than half a day server
virtualization software separates the OS
and application from the underlying
server hardware and with multiple
virtual machines on a single server you
can use each of them to run different
OSS and applications this makes it
possible to move your virtual machines
from one piece of hardware to another
whenever you want to maximize
utilization simplify maintenance or
recover from a hardware failure and
without slowing down your applications
or users clearly server virtualization
provides big benefits and Citrix
XenServer provides even more since it's
built on an open platform
XenServer plays well with your existing
hardware storage systems and IT
management software as well as with the
industry's leading cloud service
providers best of all you can get
started by downloading a fully
functional production ready version of
Zen server for free after a 10-minute
installation process you'll see how easy
it is to start virtualizing your
workloads and automating your IT
management processes and when you're
ready for a richer set of management
tools just upgrade to one of the premium
editions of Zen server so whether you're
interested in virtualizing servers for
the first time expanding your server
virtualization footprint or moving
server workloads to the cloud
download and install zests River today
and see how it can help you simplify
your IT environment Citrix XenServer do
more don't spend more
you

File addition: 20141127_darcs_what_why_and_how.md (----------)

[0.1]

thank
you you haven't even heard what I have
to say yet um okay so since we're a nice
cozy audience I mean the material I've
prepared is introductory um mostly
anyway so to the idea is to give a
flavor of what darks is about why it
exists um how it works at at a fairly
high level um but I do know that at
least a third of the audience is quite
familiar with darks and I don't know
about the rest of the audience but I'm
happy to sort of you know dive into more
details in terms of questions and so
forth if if people would like that um so
um yeah so in when I get to the demo for
example if everybody feels that they
know exactly what dark stars and so
forth I can skip that and um instead you
can ask difficult questions about some
of the other
slides um so I guess does does anybody
not know what darks
is has anybody here not used
dos okay so then at least some
introductory material is
worthwhile um
sorry that one right okay so what is um
so I guess from a point of view of um a
Haso programmer you probably only come
across dark you may I mean these days
you probably only come across darks
occasionally um a lot you know five
years ago most high school projects were
um were in dark were hosted in darks and
now roughly speaking get as one and um
there are only a few hard projects that
are still using darks
um so what is it exactly well I mean in
like get material and so forth it is
it's a distributed version control
system and historically it was one of
the very early ones to be around um
after T and what's now Arch and um
things in some ancient stuff in that
mold um so in other words it's
distributed it's not like subversion in
CVS it doesn't have one Central server
or at least you're not forced to have
one Central server but I think the
concept of distributed version control
is now pretty familiar to people um
unlike other distributed systems or in
fact any other version control system um
the first class objects in darks are
patches um so in most Version Control
Systems what you try to do what really
what the Version Control System cares
about is storing trees of um your Source
tree and you change it a bit and it
stores another copy of your Source tree
probably with a lot of sharing and um
optimization so that it's not it's not
storing an entire copy each time um but
as far as the version control system is
concerned its job really is just to give
you back uh Source tree at a particular
revision um and keep track of all that
for you um darks tries to in its sort of
fundamental model keep um use patches um
so the changes between trees rather than
the actual trees themselves and I'll go
into more about that
later um so why bother um and by the way
I just add that it is I'm very happy to
have heckling from G users who think
that g is the be all and end all we can
have a nice argument
um so but why bother keeping darks going
um I mean we could basically stop say to
tell everybody who's using darks you
might as well stop you might as well
switch to GitHub right now um so those
of us who are still working on it think
that it does actually this this idea of
using first class patches is actually
important it does actually give you a
kind of power um and a different a
different kind of model for working with
um working with Version Control that
makes um that that really does give you
some something that get and mural and so
forth don't give you
um and the other way of the other aspect
of this is that I mean in practice most
projects are are hosted in git nowadays
and we need um we're working on uh on
having a bridge to get so that you know
people will be able to say host their
projects and get and um some people are
able to use darks or some people might
have their Central host be in darks and
allow people to submit git patches um so
that's kind of in progress but um you
mention gives the impression of being
based on
in what way does it I
mean well no what you what you see um
what you see in the git user interface
is are commits right so there are and so
those are that's that's a specific
specific state of a repository right so
if you know if you say bring up if you
bring up the git user interface and you
look at you will see a tree of um or
graph really of of revisions and each of
those revisions will correspond to a
specific sourc tree um so whereas in
darks when you look at the changes in a
repository each of those changes really
does correspond to the changes from the
previous date and that's that's how dark
stores them and that's how darks lets
you manipulate them that that would be
my I'm sure you know what I mean really
here but um that that would be my Spiel
for that that
question okay um so what what does it
really mean to have first class patches
um well I guess part of part of this is
really just is is really just a
philosophical difference I
mean hasal has first class functions so
does c right um but it's it's fairly
clear which you'd prefer to use if you
actually if you really want to write
functional programs um and that's that's
the same kind of thing with darks you
know you could if you look if you look
at gear to M or whatever through the
right glasses um you could think of you
could think of the diff to the previous
revision as being the first class
objects in that system but really um it
doesn't make it easy to work with it in
that way um darks does its very best to
um to make it easy as possible to treat
Pat patches as the as the real unit of
work in a in a dark
repository um so what does it actually
mean in practice um so one of the really
key benefits that you get from using
darks is that cherry picking changes so
is very easy very it's encouraged by the
user interface um it's very cheap um and
once you've done it um once you've
Cherry Picked a change and you decide
you want the you know the changes that
you skipped over when you did the cherry
picking you pull those back into your
repository and you've got exactly the
same thing as you would have had if you
pull them in the normal order okay so
darks kind of gives you and I'll show
you um I'll show you that a bit more uh
in My
Demo um and whereas if you did that in
git you'd end up having to um you'd end
up with two different two different
repositories you'd have to merge the two
to get get back to the same
state um the other thing about darks is
that merges a deterministic so anytime
you do a merge in darks um it will give
you exactly the same result wherever you
are in the world whatever version of
darks you're using um if it would the
and the behavior of the merge will just
depend on um what what darks chose to
record at the time you actually saved
your
patches um from the point of view of
user of darks we like to think that we
have a simple intuitive user model if
you just sort of start using darks um
without knowing how to do so you know we
hope that you'll quick quickly get to
grips with what what's going on behind
the scenes enough to understand how to
use it but not not really need to
understand the the guts too much um and
you I think that's that's a criticism
that a lot of people make of git that
it's it's very hard to use it well
without having expert level knowledge of
um of a large portion of the um of a
large portion of how git works and how
you're expected to interact with it um
so by contrast again darks tries to have
a simple command set um another thing
another philosophical point about darks
is that we try to make commands
interactive by default so you know when
you've got got some options about how to
about how the what you could do with a
command and rather than having to pass
some Flags to figure out exactly you
know to get that behavior um darks will
try to ask you questions I mean not it
it that's not entirely the case but it
will it will ask you questions about um
what would you um how how much of this
would you commit would you like to make
for example that kind of
thing so what doesn't darks have um well
a whole load really um so the features
that people tend to find missing most
more and more are um multi multiple head
repositories so in darks you you have to
have one directory per um per checkout
basically per Branch so if you want to
branch and have two different things in
flight and switch between them you need
to just use two different directories in
your file system um hopefully we'll fix
that at some point but it's not not
trivial um and some people argue that we
shouldn't do it but I think we've most
of us most of dark developers concluded
that we should um darks makes a mess of
conflict handling if you have
complicated conflict so simple conflicts
aren't particularly problematic um but
when you start having conflicts with
conflicts and that kind of thing the
both the user interface and the
internals of darks tend to get a little
bit upset um I mean so you know in
practice people don't run into this very
often I mean darks works fine with small
to mediumsized repositories it doesn't
really work brilliantly with really huge
repositories um and another part of that
is another point on here that it doesn't
it's not all that
fast um it doesn't have unique um
revision identifiers for a particular
particular state of the repository which
really annoys some people cuz you know
things like G to even subversion you
just say here this number represents an
exact state of the repository you give
that to somebody else and if they've got
enough State they'll be able to
reproduce that exactly what you
had
um and really these days what darks is
missing is is not GitHub um and of
course it doesn't have the whole set of
tooling or in fact a large user
Community around it any well it never
did have a large user Community but it's
shrunk a
bit um and yeah it's not not all that
fast but
still what about code
complexity complexity for
what
um it okay without if there are no
conflicts involved then I mean that I
mean s emerge okay so emerge emerg in
DOS is more is ASM totically more
complex than emerg in git if you've got
a very large um if you've got a very
large history leading up to the merge so
if you if you've branched and you've
done you've made 100 changes on each
side and you've got say one one file
repository then in dark's emerge will
scale with the number of changes that
you've made um on each you know like the
product of the number of changes on each
side um whereas in get it would scale
with the number of um with these well
with the size of the tree roughly so
that that's kind of I mean it's you know
darks might be faster probably wouldn't
be but um it could in principle be with
you know just a single change well maybe
not is the word
do you mean with ter in with respect to
conflicts or with respect to just
merging in
general right so yes I mean so it is
quadratic yes it's the product if you've
got if you've got two branches 100
changes on each side of the branch that
are not that are unique to that side to
that Branch to the each separate Branch
then it will be 100 times 100 will be
the the amount of work dark has darks
has to do to merge those those
changes um I mean I
think I don't think that's such a big
big problem in practice
um
sorry yeah I mean so if you get
conflicts then that is a real problem um
but if you don't get if it's just if
it's a clean merge um yeah I mean I
think I mean that is a significant point
that it does it does do these merges one
step at a time um but I don't I
personally I don't think you run into it
that that often with sort of at least
mediumsized
repositories um
okay so I'll dive into a demo now um so
there are some people who haven't seen
darks so I I guess I'll go into it
properly
um okay
so um I'm going to start by making a
dark's reposit a directory to hold my
repository um can everyone read this
okay I think I checked from the back
earlier so I think it should be all
right okay so I'm going to keep my
repository in this directory and the
first thing I do is do a dark in it um
do well just get get the basic metadata
in place get this make this be an in
make this be an empty
repository um and then I'm going to copy
in a um a simple source file that I
prepared earlier for the purpose sorry
um so just to show you what's in that
file for now um it's just sort of a very
basic high school program that I shall
make make a few changes
to okay and so the next thing I need to
do is tell darks that this file exists
in my repository and it's something that
it should
track so darks
add and so in darks um there is a um the
command the equivalent of what might be
called commit in many other Version
Control Systems is called record um so
which means record a patch um so I'm
going to run that
command
and uh it scrolled
confusingly because I've zoomed in too
much now I can't find my mouse pointer
sorry
okay
um is it going to be okay if I make the
font a bit smaller so I can actually see
the full
width right is that still readable for
everybody
okay um all right so dark is just going
to um now is this is what I was saying
about it being interactive before it's
just going to ask me about what I've
done and what I've done is I've I've
added the file and I've put some
contents in the file so um I'm just
going to record this
initial um this initial patch so this is
just the in initial
version okay so then dark asks me some
stuff about um just confirms for sure
that I want to record exactly what I
said I wanted to say record and asks me
what the name of the commit name of the
patch is going to be and if I wanted to
it would bring far up an editor and let
me edit um put a longer comment than a
single line yep question the first CH
was adding the file yeah second was
adding the content that's right yes
um so you only need to do the ad once
file that's right yes I mean it's
exactly the same as tracking a file in
any other system you just it's so that's
otherwise it I mean there is also I
could have also done this by saying
darks record- L which says looked for
ads and it would have then gone off and
looked for every file in the repository
and let me add that so um it's just it's
a trade-off between whether you want to
have untracked files to worry about or
not okay so I'm not going to add a long
comment okay so now I'm going to edit
that file
um okay so I'm going to make a couple of
changes to this file um so firstly I'm
going
to I'm going to just add one more um one
more thing that the this program is
going to
do and secondly I'm going to change one
of these messages that it's already um
that it's already printing
out
okay and then I'm going to come back and
record that changes so um darks allow
accepts prefixes of any command that it
normally um that it that that are unique
so I can use wreck as a synonym for
record which I at least personally find
quite useful
blast
um okay sorry my Demo's just g a bit
wrong
um so what what I was okay so what I was
hoping at that point was that dark
actually I can no this is fine I can
still do this um so what I was intending
for at this point when I did um was that
darks would ask would give me those two
changes as separate um as separate
changes to the repository but what it's
done is it's decided that it's decided
um it's decided that the diff that this
all lives in one part of the diff so
it's showing me showing me two changes I
made at once and since I actually want
to show you how it will treat those to a
separate changes that's a bit
inconvenient um there going to sorry
there is still going to be a second
change yes but it's not going to make
semantic sense when I start making other
changes to the patch so what I'm going
to do actually is bring up something I
wasn't intending to show you um which is
the interactive um interactive editor
for um for CH for changes that you make
that for um for patches that you're
recording so what I actually wanted the
darks to do was say Okay I want I want
the addition of the exclamation mark to
be to be one change and I want this say
goodbye to be a different change so what
I'm going to do here is just edit this
kind of intermediate state that darks
has presented me with to say okay well
the first change I want you to show me
is that addition of the exclamation mark
and then the second change you can show
me will be the addition of goodbye
um okay so um oh actually yeah sorry and
what I actually wanted to do was so I've
gone I've used K to go back um because
what I actually wanted to do was record
the addition of the text first and then
the addition of the message for reasons
I'll explain in a minute change this
message so I shall say no to this one
and I'll I'll
blast
[Music]
all right fine let's
um okay I'm going to I'm going to start
again with this so this actually works
out properly so I'm going to so now I'm
jumping around in what I was actually
intending to do um us usually this kind
of thing doesn't I mean this you know
this this kind of problem where darks
wants to show you changes that you
didn't um that are stuck together in
ways you didn't quite expect um is less
likely when you're working with bigger
repositories because the changes that
you make are more likely to be widely
spread across the you know AC um either
inside a single file or between or
between different files um what's
happened here is I've tried to make a
very small demo um and as a result
everything's um the diffs that it's
asked me about are not not quite the
ones I wanted um put in
before um so what what I'm actually
going to do is just go back and put some
lines in so that it doesn't it
definitely gives me the diff I want just
to cheat a bit really um just to make
this demo work quick at least reasonably
quickly okay so I'm I'm using a called
Dark revert now which um that's what rev
is short for um which just offers me the
changes that I've made locally in my
repository and says do you want to throw
these changes away um and again it's
offering them to me one at a time so I
can just um but in this case I want to
get rid of all of
them okay so what I'm going to do is
um sorry supposed to be high
school
sorry can't use y
either
okay okay so now I'm going to do those
make those same changes again and
hopefully this time it won't it will
offer me those changes
independently if I can work by which I
apparently can't
um excuse
me oh yeah thanks I wasn't actually
going to run the program so it doesn't
really matter matter
but I like I like people to think that I
write valid
hasal okay so that's one of the changes
I could um that I could record which is
the addition of the exclamation mark and
the addition of the goodbye message is
the other change that I can record um
and what I'm going to do and I mean
clearly I could record any of these I
could record any subset of these three
things that I chose to yes sorry my
question is chain yeah say that now the
line between the changes has some has
some content in the previous case it was
an empty
but okay so the only reason it's
different is because the um is because
it's a heuristic Andi it's a heuristic
in diff so at the point so I mean I
guess this is jumping ahead a little bit
in some sense but if you um when when
you do a three-way merge in any Version
Control System what will happen is that
the Version Control System will will run
diff between the between the between the
base of your merge and the two the two
options and then it will try to merge
those changes and at the point where it
runs the diff um there are usually a
there are usually multiple choices for
what kind of diff you can
get right and so the what what's
happened here is that because there are
two blank lines in the file um the
choice about whether to line up the
blank lines with each in in the old file
in the new F in the old file there was
one blank line in between the say hello
and the um and Main and in the new one
there was a blank line between say hello
and goodby and say goodbye and there's
another blank line between say goodbye
and
Main so how diff has chosen to line up
the blank line between those between the
old version and the new version of the
file is what has affected the what's
been presented to
me and so what I did was I changed one
of those blank lines to be something a
bit more than just a blank line thus
forcing it which meant that it couldn't
possibly line those two
up for
me
really the Reas so the difference is
that that line with content when I added
the when I added the first I didn't copy
that content I put a blank line in so
the blank line is different from the
content that was that was there
before so if you so if I um so if you
look at
um yeah yeah it's it's really just it's
really just blank lines getting it diff
getting a little bit confused about it's
always istic that that diff
applies yes exactly so it's just all the
dark is doing at this point is get
making a diff and presenting me with the
individual pieces of that
diff doesn't make
the uh the number of chain sets too
large if you're just editing the P not
subsequent lines but just having one
line one line one line
this part of the line but this is n
Chang for me mhm so it will darks will
offer you all nine changes individually
um if yeah um if you've if you made
changes that are separated in the file
um and you do have so this this
interactive prompt actually has a bunch
of um a bunch of commands that bunch of
keys that you can use at this point to
to choose you don't have to say yes or
no to every single um to every single
change so the the standard thing you use
in that case is you if you know you want
everything that you've done into that
file then you just press F to say record
all changes to that file um or record
all further changes to this file so you
can you know for example if you've made
some change that you didn't want to
record you'd say no at the beginning um
and then you can say or a for example to
record everything that it's going to
offer
um okay so yeah so right now I'm going
to record I'm going to record this the
addition of say goodbye and the call to
say goodbye together so that those those
live live together with an atomic change
set an atomic
patch once you recorded those two
together kind of like stuck together
forever yes um yeah pretty much
y you two up three yeah what happens to
the original file that the the file in
the the file in your direct when you're
doing a record the file in your
directory doesn't get touched at all um
all you're doing is recording is
changing the sort of you know the
current version controlled state so
there's always an implicit diff between
what you've done in your direct your
working directory and what's what the
what the state of the very last Revision
in the repository was and so that diff
has now got smaller but the actual file
hasn't
changed um so on the other hand if I'd
use revert which is about changing the
changing the state of your working copy
then I could have reverted some of these
changes and not others and then the file
would have been put back into some you
know some sort of State a little bit
closer to the work to the to the
recorded copy
but okay so I've
I've recorded a
change that's called i' say goodbye
which has that change in it and now I'm
going to record Another patch
that that incorporates the last change
that I had and nothing
else okay so now what I've recorded is
apart from the initial setup of the
repository I've recorded the change to
um I've I've recorded two different two
different changes firstly I've added say
goodbye and secondly I've changed the um
I've changed the text of
hello okay now I just remembered that I
forgot to clone this repository at the
time I intended to um so I'm going to
well I'm going to do it now anyway um so
I'm going to clone I'm going to clone
the repository that I made um into
another one um and that's dark's get is
the um standard command for doing that
and you can use that with remote remote
repositories as well um and because I
forgot to do this at the time I intended
to which was before I recorded that
change I'm going to delete a couple of
changes from this repository using um
the obliterate
command Okay so and now I'll get back to
what I actually intended to do at this
point which is to show the pull command
which is about moving patches between
repositories and now this is the point
where I'm going to show you how cherry
picking Works in darks in a sort of
simple intuitive way so if I try to pull
from repository one it's going to offer
me the changes that I made to um that I
made to that repository um in the order
that they exist in that repository so
this is a set of changes that are in the
other repository not in my present one
um and it's going to show me them in the
order that that they would live in that
one you got comment there that's the one
comment yeah you see the code um
yes um so X will give me a summary of it
and V will show me the code um though
there's a little bit of metadata that
probably would be more user friendly to
have to have Allied but
[Applause]
yeah okay so at this point I'm going to
I don't want to I don't want to say
goodbye yet I just want to make that I
just want that hello change but that was
the second change I made so I'm going to
say no to this and it's still going to
offer me to offer me offer me the Hello
message pull
so I'm going to pull
that and um there we are and if I have a
look at the contents of demo. HS now
it's got that change to it's got that
change to hello but it hasn't got the
change to goodbye even though that was
never one of the recorded states that I
made in the previous
repository
is I disconect this CH no okay no it's
not um it's no it's not available
locally so if if you wanted that then
what you would need to do is clone the
Clone the entire repository and then
into a different directory and then
start pulling from that yes
um and in fact that's the way I
typically use dos if I'm when I'm I work
on the train a lot I I will just do a
pull from from the Upstream repository
to somewhere and then work work from
that
locally okay so yeah so I've got this um
I've got this change that was um that
was one of the two changes I made but
not in the order that I made it now I'm
going to go back to the other repository
and I'm going to use um I'm going to use
a command called darks
replace so darks replace is a search and
replace operation on your Repository it
takes um so what if I show you demo. HS
I've decided I want to rename the print
message function to something different
so I'm going to I'm going to do darks
replace print message output
message repo1 um not repo one sorry
demo.
HS okay now so firstly i' to emphasize
that this is a pretty dumb search in
replace right it doesn't take any
account of language semantics or
anything like that so if I had if I had
a variable called print message
somewhere else in scope then it would
happily rename that that as well
um okay so now I've done this
replace um it's changed my file for me
but it hasn't actually recorded a patch
to say this is um this is a change I
made so I'll just I'll just do that um
and the change is actually expressed as
this replacement
operation and yes I want to record
that okay so just looking at demo. again
you know that there's been there have
been a couple of occurrences of output
message of what word print message that
are now output message that have been
changed by this replace operation now if
I go back to repository
2 that doesn't actually have all those
call all those call sites for print
message because I didn't pull the change
where I said called say goodbye as well
um but if I pull from repository one I
can still ignore the say goodbye change
and I can pull the print message
change and what's it done to my file um
sorry okay well it's it hasn't pulled to
say goodbye so there's no there's no
change there um but it has um but it has
renamed print message to Output message
correctly and finally if I pull again
from repository
one and I bring back the say goodbye
change I could also shown you that patch
I did it it's pulled back say goodbye
this say goodbye change which was
originally recorded as an additional
print message um but now in the context
that it's just been pulled it's an ADD
of output message because that's what
that that's what makes your program make
sense um now that's only true because
the replace was a safe thing to do on my
repository um but nonetheless it's still
it's um it's done the right
thing um and so the other point about
this is that now this repository is
pretty much identical to that first one
there's no other changes I can pull from
the other one if if I sort of ask darks
you know are there any differences
between these two repositories it won't
there won't really be any um
sure okay
so darks Chang so darks changes is just
a list of um list of what patches you
have in your repository in reverse order
um and as you can see in this repository
it's this way around and in the other
repository it's um you've got them in
the order that I originally recorded
them mhm something which has the print
message inside
yeah and I want the print message to be
there and then put that change
into you want to keep the print message
you mean yes
um so is this
side okay so if I was to take if I was
to um now start editing after this
rename operation and put a new print
message in there right then that will be
that will be safe because it comes after
the replace but if you've if you've got
if you work in a repository that hasn't
seen the replace already but you have
the yeah but I mean so either repository
it doesn't make it doesn't really make
any difference now but if I I mean I can
I can safely change demo. HS now and put
in um no no no what I mean is if you the
change set which
had well it was originally recorded with
the print message change yes
CH which
hading message the
result and and you put it on top of your
already
replaced yeah still it's saying that
it's
message um well that I mean I I think
that's I mean that that was what was
expected to happen because the if you
look at it from the other repositories
point of view the replace was operating
on both print
messages
so um so so in the of the first Chang
the first one uh uh replace the names
and a change which is change uh which is
explicit have message
inside um so in so that I mean that
that's the that's the point of what
darks has done here is that it's decided
to it's it's changed the behavior of
that it's changed the contents of that
patch because of the order in which it's
put them um
so
um in which repository in repo 2
okay
so so there so the patch itself has been
changed by this op by the by the fact
that we've reordered it to say this
patch says this patch is an addition of
output
message it doesn't have the exactly the
same text in it and that's exactly the
point of having first class patches that
your um that dark will change the patch
so the point is that darks has tried to
preserve the meaning of the patch
right so the the meaning of the
repository that we
had
um so I mean that's that's the point of
using replace right you're you're you're
kind of asking darks to do that you're
saying that so looking at it from
another point of view right you've so
let's let's say that I've been
developing a program um with print
message in it and then I decide I'm
going to rename print message to Output
message okay and in parallel to that
somebody else decides that they're going
to add a call to print
message okay I'm
just I just don't know I
like y fair enough well
okay yeah so I mean I think I mean
otherwise if if they if pulling the say
goodbye had resulted in print message at
the end of repo 2 that would have been
tot that would have been incorrect right
I mean it wouldn't have actually been a
functioning program so if you want this
property about cherry picking and so
forth where you can get patches out of
order and then pull the other ones back
it has to do
that guess that's another way of
justifying what I did if you don't like
it but um
anyway kind of I put it in yes I think I
think of outut message and then kind of
like and out the
re take
that so if you um if you add if I now
added an output me a call to Output
message in my repository afterwards and
then I and then I unpulled the replace
um removed the replace patch it would
change that to print message and that
would be correct but the name capture
point that I I thought maybe you were
asking was that if I had say a local
variable called print message or output
message depending on where I in which
state I added it then the replace patch
is going to do the wrong thing with that
so you know if somebody else in another
branch has decided to write a function
that internally uses a print message
variable of its own yes I
supp it won't it w't if you do that it
won't let you do it so there there so
part part of the darks user interface is
that it will detect detect things that
don't make sense like from
a yes
and repl is global I
it's file by
file um there is no patch type for
moving code between different between
places in files um it's something that
we would like to add um but yeah we
haven't done it
yet okay
um
okay so I was um I've already shown you
rever actually no I'll show you a bit
more of that um so if I just edit demo.
HS a little bit more
um just add some garbage in two places
and I confused
by
okay so I've just added a couple of
lines um and now I want to get rid of
those lines and again as I showed you
before darks offered those independently
so I can say revert just the Wibble bit
and then I'll be left with wobble in the
file and then I can revert that as well
and that's gone um darks also has an
inverse of revert so darks tries to make
most operations undoable there are cases
where you that's not really feasible but
um for example darks has this
unreversed um but not the one before
that so I can't get Wibble back
anymore
um okay I'm going to change some the
file this time and keep the changes
um so I'm going to change this text
to
and I'm going to record Another change
with that that in it oops I forgot to
get rid of that never mind I'll do that
in a
minute
uh sorry my laptop's frozen
up
um I think it's frozen solid this what
you get for running
Windows um I'm going to have to reboot
it so
[Music]
I think it I think actually this I think
it may have Hardware problems of some
kind it's
occasionally well I mean all I can say
is that my dark my laptop does lock up a
lot and I do run darks in it a lot
but I I think it's a hardware problem
I'm not entirely certain I'm sorry about
this um so what I was about to do then
was just record that patch and then I
was going to show you how you can use a
command called aend record what the
okay I'll give it a little bit of time
to
[Applause]
um yeah so what what I was going to show
you at that point was um editing a um
editing a patch that you've already
recorded to add add some extra changes
to it um and I'll I'll just do that next
just to um is there kind of longm to
extend yes um but that that's
particularly difficult so the problem
with the with extending it to know about
syntax trees is that you have
to um you have to fix your paa um you
have to decide that you know this this
one paa that that exists right now is
the one that's actually going to we're
going to um we're going to use to pause
the codee um so you know if if the
language changes a bit it become you you
end up having to version
things so that's probably a reasonable
way down the
line um
actually uh yes sorry but you you have
to keep you you still have to keep the
old one around so that you can read old
patches
so yes you can record new patches with a
new paer um but then you have to figure
out how you merge between the patches
that were recorded with the different
paers um and you have
to
[Music]
you kind of imagine you're creating your
P by recording MH and then you've got at
some point
patches yeah you
need no no yes you're ask you need to
have the paer at recording stage but you
do need to the kind of meaning of that
tra so the pretty printer you do need to
have everywhere you need to be able to
invert that you need to be able to
invert what you recorded to to actually
you know show the user what the real
Source would be if I really wanted
to
and literally just Tex so us it four
times it might L change treat that as
yeah so absolutely so so then and then
you won't get that nice that nice
behavior I showed you where it break
start yeah so it will f it will refuse
to it will refuse to let your cherry
pick around that then
um okay let me record this change as I
was trying to do not that one I
can't exactly
okay and I'm going to revert that other
change that I forgot to revert
before
um okay and now I'm going to make
another change to this
file um
just
okay I won't bother calling it that'll
do okay so now this time instead of
recording a separate change for this um
for that edit I've just made I want to
actually incorporate it into this other
patch because I think that the two
belong together for some reason I mean
in this case they don't really
but and so the command I can use for
that is dark amend record and that says
that will go that will go back and say
okay here's a here's a here's the patch
that you just recorded would you like to
change that um sorry I was going to do
that with a flag saying edit allow me to
edit the message as
well so should iend this patch um yes
and then here's here's the change that
I've just made to my repository and
shall I add this change to the patch
that you've already record that I've
already recorded yes um and also give me
a chance to edit the message so I
say
add okay um so that's you know that's
use if you're kind of working with draft
patches as you go um and then you start
you want to sort of keep adding changes
to them um and then eventually you're
done with it and then you say okay I'm
going to send this off to other people
and at that point you should stop
amending
iten
you so if you if you make a change P it
to the other repository then amend the
change and pull it again it will look
like a different change and it will be a
conflict yes so you then have to go man
un pulling the other
one okay um so okay let me just go back
to my slides for a
second
that work
yes okay so I've shown you quite I've
shown you a sort of bunch of the a bit
of the basic command set of darks and
just sort of what the what the sort of
model is um a new command that we're
adding to Dos in the in the next release
that will come out in probably the next
few months is a rebase command um and
rebas is a concept that exists um in in
systems like git and so forth and it
hasn't existed in darks before um and in
part that's because some many of the
things that you'd want to use rebase for
in git aren't really necessary in darks
um because you don't you don't you don't
have all these merge commits with darks
um that are about that you know you
where you want to um because whenever
you do a merge you always get the same
results so dark doesn't have to present
you with a merge commit um and you don't
need to use um rebase to do cherry
picking which is another thing that um
it's used when git but nonetheless there
are things where there are cases where
you do want to use you do want to do a
rebase you want to change
a whole sequence of patches in your
history so that they're maybe more up to
date or so forth so um we're adding a
rebase
command
um and so the basic so the way that the
rebase command works is it kind of puts
your repository into a temporary state
where you're doing some you're doing
some work on some patches and some other
patches have been sort of suspended um
and then you can bring those patches
back into your repository with an
unsuspend command so I'll just show you
that
now uh except for
that what
happened okay so I'm just going to um
I'm just going to throw away my that
change I just recorded and do the same
thing again um but with with a slight
twist to
it which I can
use
never mind about the typo um and I'm
going to make a second change on top of
that um which is that I want
to
I also want to write this to standard I
want this to go to standard error as
well so here's my second change which is
use standard error okay so now I want to
do that amend that I did before I want
to add that extra I'm going to put that
call to say we back
in and like to change that patch that I
wrote before the one where I where I
changed the message I want to CH I want
to insert this say we into that that
patch again so I don't want to edit the
standard error patch so I say no to that
um oh but then darks isn't going to let
me do um do that amend record on the um
on the update message patch which is
what I actually want to change um
because I've made a patch on top of that
that depends on that patch there's no
way that those two patches can be
logically untangled from each other
because darks is a line by line has made
a line by line diff and this the diff
for the standard error lives directly on
top of the diff where I change say to
wants to say um so I'm out of luck if I
want to amend it and that's where rebase
comes in so rebase will say okay well
this standard error patch is in the way
now let's let's push it out of the way
for a bit
um okay so that's that's disappeared
from our repository and it's now
suspended so if I just look at the
changes in the repository you can't see
it anymore um but darks will keep
reminding of the fact that it's still
around if you um and that you can get it
back okay so now I
can now I can amend this patch
um okay great um so done that now I want
my suspended patch back now how's dark
going to deal with that because well I
mean it's we haven't magically got
around the fact that the um that the
change the second change I made that to
insert standard error really does depend
on that first change I made sorry will
you wait no
um okay so well okay it's going to let
me unsuspend it it's going to offer it
to me we
are okay now that's I think there's a
bug in the development version I just
used because it should have told me that
it actually had a conflict um
huh oh right sorry now I got this demo a
bit wrong that's fine this is so this is
that that that bed was actually supposed
to happen and it was it was just not not
quite what I ended to do with the demo
um so what happened here was that I
didn't um I didn't when I when I changed
that patch I didn't change anything
about the line that was actually a
problem right so the con there should
have been if i' if I tried to do
anything to the output message text then
darks really would not have been able to
figure out how to bring that patch back
un suspend that patch into my repository
but because all I want all I did was
added say we to that change that was
fine that um it was able to do that now
because of some technical technicalities
to way the internals of darks works it
still has brought this patch back with a
different identity so it will actually
clash with other patches in other
repositories even though it's it looks
like exactly the same patch but now let
me just show you what would have
happened if i' done what I intended to
do there um okay so I'll suspend that
one again and this time I'm going to
edit HS to also say something different
here so
um can't make up my mind which editor I
want to use okay so I'd like to do that
instead okay and now I can I can change
this update message so that it's now now
a patch that does
this okay and now if I um if I unsuspend
this patch okay then it tells me that
that things have gone a bit wrong that
it can't quite it can't quite unsuspend
this patch and have it have it be a same
patch
locally and it's also um inserted some
conflict markers into my repository
which is dark's unique style of doing
conflicts um to tell me to to to give me
the local changes that have gone wrong s
go it just well it's I mean yeah it
marks up your repository like like most
systems would it um it lets you do the
operation but says there's a conflict
and here's and this is what the conflict
is in terms of
I the uh in this case is the fact which
complet be
your changes support mhm is still in
your
yes so the the original one that was the
cause of the conflict is still there I'm
fine and this one is in a slightly
conflicted state that I'm now going to
I'm now going to fix the conflict
and yes um okay so sorry um I haven't
shown you conflicts in the normal case
of merging and in that case it really is
another patch on top of it when you're
using rebase the normal thing to do is
to fix your conflict and then amend that
into the patch you just unsuspended so
um I won't bother doing it here because
it um I've I've already spent quite a
while in this demo that I didn't intend
to but um what's so what this conflict
is telling us is that well the first
line is the base of the conflict before
the equals so um that was the original
state and then there are two different
changes that are conflicting with each
other one of which is to insert the
standard error and the other of which is
to change wants to sat to would like to
say so our job is to you know see that
and do do the do the change for
ourselves and then then edit the patch
that we just unsuspended to to bring all
that um to bring that into
line okay
um let me go back to my talk I did that
bit of the demo um this is just a quick
slide just to show you that the main
sort of set of
command oh sorry wrong way around um
thank you that would be a
good yes in fact so another re another
place you wouldn't another way we thing
we could do that would mean you wouldn't
get a conflict there is if we did
patches to us to the characters of a
line rather than to rather than a line
by line thing and that's probably more
more in reach as
well okay so just just to give you a
very brief overview of the sets of
commands that you you can do with darks
um you know how they how they all fit
together so one interesting thing is the
sort of specific kinds of patch types up
here um so I already showed you replace
and add um and move um moving a file
around is also a kind of operation that
nicely with and merges nicely with other
patches um that you know also add files
and so forth um some standard remote
operations get in its mirror put pull
and push um obliterate which is to
remove a patch which used to be called
unpo and still has the
Alias um there is a staging area called
pending which we tried we we do our
absolute best to hide from you unlike
git um so pending is used for things
like in fact things like replace um so
something that can't be something that
really can't be represented in the
repository um then we put it in the
staging area so you know if we if you've
made that semantic statement that you're
making a change to your repository that
you you can't just infer that from the
files then we'll then we put that into
pending um otherwise it will be just in
otherwise you just record what's in your
working
directory it doesn't make sense
because the way how
you changes is actually uh so what when
you assembling your your story is
exactly what when you are just saying
this is record changes
yeah yeah I think I think that I think
that so the mixture of that and also the
fact that amend record is designed to be
fairly user friendly are two things that
make it I think unnecessary to have that
index have that equivalent
um okay and yeah there's some functions
for quering a repository you know just
but there's you know to for basic
operation with a darks reposit you don't
need to know um all that many commands I
hope I'm getting across with
that okay so um let's just go into um a
bit more technical detail about um
what's inside a dark's repository um so
dark's patches are semantic descriptions
of what should um what should happen to
the previous tree to give you the new
tree um so example of um an example of a
darks patch is remove the text X that
starts at line three in in a particular
file and put text Y in there
instead um another kind of dark patch
would be add file a or rename file C or
replace token x with token Y in a
particular
file um an important property of darks
Patches from the point of view of the
internals of darks is that the patches
are invertible um so any anything that
you can record um in darks darks has to
know how to undo it as well so that it
can kind of get back to the old tree um
that was before that um so for example
so it is actually important that you say
remove X from the patch at line three
you can't just say remove one line um
the internals of darks were actually
record the fact that the old version was
X it won't be able to use
what World card um that's correct yes
yeah um I mean that's not to I mean it
it can't but that's not totally I mean
if you if you had a linear regular
expression it would be possible it would
be technically possible we just have we
haven't done that and it would probably
be quite confusing but yeah as long as
long as you're sort of replace was was
invertible in some way you could you
could Implement
that um so yeah so that I mean I'm not
going to go into the details of why
that's important but it it is it is one
of the features of you know getting this
reversible cherry picking operation kind
of requires that you're you're able to
manipulate patches um in that kind of
way and inverting them is one of those
properties um I guess another Point
that's worth about how darks patches
work in general is in most Version
Control Systems you're kind of committed
to this this series of changes that you
made in a particular order whereas darks
lets you sort of pull out random random
changes from the middle if there's no
textual conflict um and that means that
of course you can write you can pull out
changes that really doesn't make sense
to pull out um because you know for
example if in one patch you add a
function and then in another patch you a
call site to that function and then you
try to cherry-pick the call site of that
function on its own darks will probably
let you do that if they weren't weren't
right next to each other um but you're
not going to have a meaningful
repository um but that's some power that
it gives you that you can you have you
have to decide how to use
it um yeah so I've already kind of
alluded to some of this stuff before so
and merges in dark are deterministic so
every single time every time you do
emerge anywhere with two darks patches
that were recorded at the same um that
the original darks patches then you'll
get exactly the same result and that
means that darks doesn't create merge
commits when you actually when you do a
merge okay so if you get a conflict then
you have to record a new patch to
resolve the conflict and then you've
kind of got an equivalent to merge
commit um but there is no explicit sort
of marker in darks in general for a
merge happened
here um
and merging is associative so say you've
got three repositories and you decide to
merge two of them together and then you
merge it with a third one um you will
always get the same result as if you
merged the second two as your first
operation and then merged it with the
third with the initial with the first
repository um so I mean that that's
that's I guess it's kind of part of the
intuitive user model that we're trying
to get with darks that you don't really
have to think about what audit what what
things happen you you just when with the
darks user interface it tries it tries
to make that implicit that that you can
expect that to happen without having to
think about it and it's it's kind of
important that that it actually is true
you don't get nasty surprises that
way
um okay so going a bit more into the
internals of What's um of of how this
all works
um
so on the left I've got the kind of sort
of diagram that represents what happens
when you have a merge operation so we've
got two different repositories with
patch a and Patch B in them that's the
difference between them and so the base
of the merg is um the bottom leftand
corner and in two different in the two
different repositories we're going to
end up with two different two different
states of the reposit
and the the point of a merge operation
in any version control system is to
figure out what should that question
mark be um and then do do the
appropriate thing with
that um and in darks what we would do
when we do that merge is that um
supposing we were pulling B into a
repository that already had a um we
would compute what that we would compute
this alternative version of b b Prime um
that is so kind of the same effect as B
but in a different context it's now
living after a and if depending on what
depending on what A and B were we might
actually have to change B into something
slightly different for that to still
make
sense
um so the dark has this internally has
this concept called commuting which is
changing the order of two patches um and
that's what I'm trying to get across
with this diagram is that that really is
just the inverse of the merge operation
so um with a merge you have two patches
that start at the same place and you
want to figure out what how you how you
combine them and but an alternative way
of looking at what the sort of work
involved here is that you have two
patches that are already in the same
repository in sequence and you want to
figure out how to unpick that you want
to get back to two patches that are
totally separate from each other um so
if you already know A and B Prime then
what's
B okay and that's that's the fundamental
part of cherry picking because once you
can do that um if you can swap the order
of patches then if we don't if we've got
a repository with A and B Prime and we
don't want a we just want B Prime then
what we have to do is commute A and B
Prime to turn them into B and a prime
and then we can throw away a prime and
we've got a repository just has to
change B in it um and when we do the
merge again um because of all the stuff
that darks does behind the scenes it's
going to give you exactly the same
result B primed again so that's that's
why what I showed you um that's that's
kind of a flavor of what's behind what I
showed you it's not um there could be
many be
a there could be many so um in in the
merge diagram there have or in any of
these diagrams there has to be unique
darks that's one of the things darks
ensures so if you given a prime you
multiple ways of
facturing yes but they you have to um as
darks will to if you adding a new patch
type to darks that would define what
this commu would be you'd have to do it
in such a way way that um it would go
back to B Prim when you merged it again
so otherwise you'd break darks um
so
I
[Music]
a
will B Prime uh B primee yeah so it will
store them in a particular in the
particular order that's useful for that
working that repository per repository
yes
exactly okay so yeah this actually goes
into so I mean conceptually you can
think of a dark repository as being a
set of patches in most cases right so
the the set of patches you got in your
repository no matter what order you
brought them in in the end state of your
repository will be the same um but they
are they are actually stored in a
particular order that reflects how you
got them um and you know when you run
things like darks changes as as we
showed you that um that does show
up um there it's rare I mean I think it
there is a user demand for that kind of
thing because you know you want use
quite often want to see what's what the
changes in your repository are that it
isn't in a remote repository and it' be
useful to reorder the patches in your
repository so that just your local
patches are at the end of the repository
um so and we don't have a great command
set for doing that we do have one
command that lets you do that does some
of that for you um but it's
not uh that would be Overkill because it
would change the identity of your
patches so you you really wouldn't want
to use rebase for that and there's a
command called darks optimize reorder
which will do that up to the last tag in
the repository but it's not it's not
sort of General
um yeah so there's this um there's this
distinction in in darks in terms of well
I guess in terms of both the mental
model and in terms of the internals that
patches have an identity which is the
thing that you know they they retain
that as you pull them around between
repositories but as we saw the
representation of a patch can change as
you pull it around um and that you know
due to either merging or cherry pitting
depending on which way around you um how
you're pulling it
around
um so I mean there is this kind of
there's there's this kind of theory kind
of hiding at the uh hiding inside darks
and hiding very well because no none of
the people who really work work on darks
have managed to make a really formal
theory behind um behind what's happening
with these patches and it's that's a
real shame because it ought to be um it
ought to be possible to do um just
nobody's Managed IT um but we kind of I
mean I was sort of alluding to various
things that you have to get right
otherwise you'll otherwise you'll screw
up the behavior of darks and we kind of
intuitively know that the these things
have to behave this way and you know
there so there are some laws that we
believe are true about um all the dark
patch types that we have and that would
be true about any new ones that we added
so an important one for example is that
if you commute two patches if you if you
had a and you swap them around if you
swap them back again you're going to get
back to the same
result um so just the equivalent of
these two commutes um and also that if
you um if you kind of if you invert
patches and start commuting the inverses
then you get the same result as if you
were as if you left them uninverted to
begin with um I just put that up I don't
want to explain that that in
detail um another thing I'm used quite a
bit of time so I'm just going to go over
this very quickly is the um there's this
I said that merges have to be
associative um and the thing that
underlies that in terms of what your
patches have to do is that um if you
start commuting patches around so say
you have your patches in the order a b c
to begin with and then you start
changing some of those patches around um
and you keep doing that you can you kind
of tra you're kind of following a path
around this Cube if um picture and um
there's no absolute guarantee from any
sort of external source that you'll
actually get back to exactly the same
results when you've got to the end of
that because the commutes that you've
done in each case are are independent of
each other um but that's another law
that we expect to be true that you will
get back to the same result um and if
you look at think of that Cube as a kind
of merge that's about guaranteeing that
you will that you know that top right
hand corner of that Cube will always be
the same no matter what merges you
[Applause]
did um okay so I'll finish up um fairly
briefly by talking a little bit about
the internals of darks and one one
particular way we've used um H or to
good effect in the
internals um so one of the things that's
sort of I hope of what I've already said
about darks is that when you when the
code of darks is working with patches
it's going to do a lot of time it's
going to be moving patches around
between different contexts a lot it's
going to be um deciding that okay this
patch um this
patch have these patches before to now I
need to cherry pick commute it or merge
it and so that it's got some other
patches um leading up to that
patch um and a lot of the time when you
do that the representation of patch
doesn't change because you know for for
example if two P if you've got two
patches to two different files the order
in which they're in makes no difference
their representation but you've got two
patches to the same file the order in
which they're in you'll have to change
the line numbers of in the offsets of
the second the patch that's further down
the file um so it's it's quite easy to
get this um to firstly get start using
patches in the wrong context in darks
code in in the dark source and if you do
that you might not notice quickly
because a lot lot of the things you're
testing with won't actually have won't
actually trigger that um trigger this
being a
problem um so here's an example of um of
how you might actually um of some some
code you might want to write inside
darks um so supposing that we've
supposing we want to commute two patches
we've got um the patch C and we've got
the sort of sequential composition of
two other patches A and B that's the
intention of this bracketing so I want I
want to get to sort of C primed and then
a prime as the or a prime B Prime is the
alternative okay so if if we take an
imaginary patch type which is um either
a um a null patch or nothing or sort of
two patches stuck together and then
obviously there' be some more
Constructors for the real um for real
patch kinds in that um you might write a
functional commute that's um going to
have a case that's for about commuting
commuting this um a as one group with c
as the other group okay so how you going
to do I mean typical hasal you're going
to do this kind of operation by
decomposing it into its sub pieces um
commute you know commute one of these
two things with the other with the next
one and then um and then commute um
commute the results to get it all into
the right order um so that's that's one
reasonably plausible way of writing it
that happens to be totally wrong
um because what you should do is I mean
if you just look at the order of this it
makes sense to commute B and C
first um and then commute the result
you'll get um you'll get ACB and then
you commute the result A and C yeah this
order but what what I'm doing here is
I'm commuting a andc first and a andc
aren't next to each other originally so
it really doesn't make sense to do that
um but there's nothing in this code
that's going to stop me from doing
that um because dark doesn't GHC doesn't
know anything about the context of these
patches so um one thing we introduced in
darks quite some time ago now is um a
way of actually telling GHC at least to
some extent these are the this is the
context of a patch um don't um don't let
me screw up so instead of defining just
patches as being um so the these um so
what I've done is I've introduced a
couple of type variables here and these
type variables are Phantom types right
they don't have any reflection in
reality in terms of the ACT you know I
could even I it doesn't make sense for
me to say patch in Char or something
like that that's that's not what they're
about um they're about saying they're
about a sort of abstract representation
of the context that that patch lives in
and and it means that if I write some
certain low-level operations carefully
um and don't screw up the context in the
lowlevel operations then the high level
operations that I build on top of those
um I won't be able to um the type
Checker will actually stop me from
getting it wrong so the the key point of
this really is that this sequential
composition operator now operation now
um says the context must match up so a
patch AB is a patch that notionally goes
from an initial starting point initial
context a to a final context B okay so
the sequential composition of two
patches now goes from an initial context
a via an intermediate context B um so
the um to a final context C so the the
sequential composition is a patch from a
to c and it's made up of two indivi
independent individual patches that um
start at a and finish at B and start at
B and finish at
c um and so to actually to actually work
with this kind of stuff you you find
yourself having to redefine a whole load
of basic type so what used to be just a
tle um I've now had to make a witness a
tle with these extra witness types in it
um that says okay this this is a sequen
this is a pair of patches that I'm going
to want to commute um for the to be able
to write the type of commute so commute
is now talking about the fact that we've
got a pair of patches that goes from a
to c in total and I want another pair of
patches that goes from a to c in total
but is in the opposite order
internally so I can I can translate that
that incorrect code I wrote in the
previous slide into incorrect code using
the um using these Phantom types and GHC
will complain um now the downside of
using these Witnesses is that the GHC
complaint takes a little bit of getting
used
to um
so what with you know so with some
practice you kind of get used to
understanding what it's trying to tell
you here which is that well um there was
some other variable B1 that would have
had to existed for this to make sense
but it it doesn't really work um and
um and in particular really you know you
can see that this expected versus actual
B1 is different from B so you know that
you've got the intermediate context of
this this commute wrong but it certainly
makes um the darks code harder to get
into for newbies it also stops newbies
screwing up the darks code so um that's
that's a sort of two-edged
sword um yeah so that's um and these
contexts can't do everything you do have
to get the lowlevel operations right for
for the sort of stuff around it to make
sense and they don't um in some cases
you actually have to insert a assertions
that say I've thought about this I can't
convince the type Checker that this is
correct but I've thought about it and it
is it is correct and trust me please
um okay so that was just a sort of um
brief overview of all of that um stuff
so what are we going to do with darks
next um well 210 which is the upcoming
release we'll have rebase in it rebase
in it and it will also have a fast rer
take command if you've used dos before
you'd know that the dark annotate
command is one of the things that really
is very slow um and what we've done to
make that a lot better is introduced a
um what's called a patch index a look up
from the
um look up from lines of lines in a file
to what patches touch the line that line
in the file so you can um s Dart can
very quickly find out what patches are
um what patches affected a specific
file yes annotate is yeah
exactly
so what's
the you have V on the same
and yeah so annotate will give you a
different output depending on what order
those patches are in your repository
depends yes yeah so there there are
there are plenty of commands that will
depend on that and changes you know just
the list of changes the annotate um the
way that patches are presented to you
interactively all those things will be
different depending on the order of your
patches um yep so we're um we're
developing a um a bridge to get but it's
not really ready for prime time yet so
we want to um want to get that ready um
we don't have a great hosting story
though we do have um we do have a kind
of on um a standard host for darks
repositories that you can put online and
so forth but it's nowhere near as fully
featured as GitHub so we'd like to
improve that um we'd like to have better
most of interacting with darks is very
text based in fact all of it I mean the
darks command line is just a text tool
there are no standard graphical tools
for it um what we'd like to do is make
something web based that's um kind of
you know the the same the same code base
as the hosting um code that we have just
bring up a web browser locally and allow
you to interact with your repository
like that um but that's still in
progress um we'd love to have multi-head
repositories a bit of implementation
effort um conflict handling is a mess
for a couple of different reasons but
one of which is that we don't actually
have a good algorithm for figuring out
deeply nested conflicts and dealing
making them making them behave properly
and um we hope to improve that in the
future um more patch types that's I mean
that that's kind of the point of darks
right I mean what darks lets you do at
the moment is quite is nice useful and
it's still I I still personally find it
a lot more useable than get but really
we could we could make darks really nice
to use in terms of um you know by having
by having more patch types that let you
do things like moving around chunks of
text from one place to another um make
you know do indent a block of indent a
whole set of lines um by a certain
amount and have those things merge
nicely with other kinds of patches where
that makes
sense
that new is deping on the number of you
already have
so yes that's well it's not it's not
quite that bad because there are some
kinds of patches types that would never
that would never make sense to merge
with each other so for example if you
had a semantic patch that would you know
about um ch um treating a source file as
a pastry and doing stuff with that then
insert a patch that inserts lines in the
middle of that file is never going to
merge with that sensibly I think it we
just say that's a conflict um so you
you're kind of and the the other thing
is that um merging so you know files
that affect the internal contents of a
sorry patches that affect the internal
contents of a file are independent of
patches that say move around the things
in the file system and so on so the yes
you're right basically that that
quadratic explosion does happen will
happen but um there are some mitigating
factors and that but that's part of why
it's difficult to add new patch
types okay so um that was all I had to
say really um if you want to know more
about dos it's got a homepage and it's
got a hosting place um and we are um if
any of you students and interested in
doing some stuff for darks then we are
participating in G in some of code via
um h.org so please apply
[Applause]
um probably four or five some that sort
of um to move to syntax trees other than
um the complexity of the compiler how
conceptually how hard is
that um I don't know how to merge what
what the merge operations on what the
sort of basic patch types for two trees
would be so let's let's say that we've
paused a we've paused a sour file and
we've got a tree and we want to we need
to Define changes to that tree um and
it's not completely obvious to me how
you what those changes should be what
the actual patch Types on trees would be
um so I think that would and then and
given that it's not obvious you're also
going to need to have a good degree of
certainty that you actually that it's
all correct in all cases and so forth so
um probably probably want to bring in a
theor approver or something like that to
um to check it out that's quite hard
some
work
your demo on a single file
is so there are patch types that affect
an entire directory at a time so you can
you know rename a directory and all the
files within that will move and that
will merge and commute nicely with say
adding a new file to that directory on
an independent
Branch two files but if you edited two
files that's two separate patch well I
mean you can stick them together into
the same you know patch with name um you
wouldn't you wouldn't record them as
separate you wouldn't have to record
them as separate patches if I you know
if you if you make a change across your
sourc tree you can the interactive
record will ask you about all those
changes to all the files at once and you
can select some of them to be stuck
together into one patch I said
yes yes and then that will be an atomic
change set that will you know just kind
of BR to the next question which
is
small
um I think it varies quite a bit but um
you know varying from the quick typo fix
and so on which is worth recording as a
separate patch to you know maybe a few
hours work um
or yeah no you would so I
mean so the dark repository itself has
um about 10,000 changes in it and so
well
you know here here's a here's a very
simple one that I made that was just um
fixing a test
um this one looks quite a bit bigger um
that was you know not sure exactly what
it was about but um so you know it's you
you want to group together logically
work that belongs together logically
because otherwise there's not much point
in having Atomic change sets um you you
know you you want to stop being you want
to stop people cherry picking too much
as well you you want to encourage them
to cherry pick things that make sense
and not things that
are yes that's right
yeah how to um how complex is it to do
the transformation
for um not very um it's it's basically a
set of rules for so the the most
interesting cases come in for example
when you're um if you're say merging or
commuting two two hunk patches to you
know individual line changes to the same
file and then you have to make sure that
they it's safe to do it and that you
have to adjust the offsets appropriately
um similarly if you're doing that with
file renames and so forth you need to
make sure that you change the patches
that touch that file um but the the code
for doing that in itself is probably a
few a few pages um you know 00 that sort
of scale
thank you than
you

File addition: 20131226_bzr_init_bazaar_tutorial.md (----------)

[0.1]

# Document Title
Skip to content
Duck Rowing
Duck Rowing
by Fred McCann
    Home
    About
    Microblog
    Mastodon
    Search
    RSS
Bzr Init: A Bazaar Tutorial
Posted On December 26, 2013	
Around seven years ago I made the leap from Subversion to distributed version control. The problem I was trying to solve was simple- I used to commute which meant 2-3 hours of my days were spent on a train. To pass the time, I would learn new programming languages or frameworks. Eventually I became very productive in spite of the fact that a train is a hard place to work and battery technology was very poor.
I’d been a Subversion user, but there was no way I could work offline on the train. While this wasn’t the only reason I considered switching to distributed version control, it was the straw that broke the camel’s back.
Why Bazaar?
Bazaar is perhaps the least popular option of the big three, open source DVCS: Git, Mercurial, and Bazaar. At this point, I think you can make the case that Git is by far the most commonly used option. In fact, when I was experimenting on the train Git was the first tool I tried. After a few weeks with Git, it was clear things weren’t working out.
I still use Git, though not as often as other systems. Because of Github and Git’s popularity, it’s hard not to use Git. I also use Mercurial almost every day to collaborate on some projects that have chosen Mercurial. And obviously I’m a Bazaar user. The majority of the projects I work on are tracked by Bazaar.
I’m advocating for Bazaar, or that you should at least learn more about it. Unlike the infamous Why Git Is Better Than X site, I’m going to do my best to stay away from focusing on minor differences that don’t matter or blatantly inflammatory misinformation. A lot of DVCS advocacy focuses on the benefits of DVCS in general, and attributes these benefits to one tool. In reality, Git, Mercurial, and Bazaar are all completely distributed systems and all share the same basic benefits of DVCS.
In that spirit, I’d like to get a few things out of the way:
    Developers successfully use Git, Mercurial, and Bazaar on projects both large and small.
    Git, Mercurial, and Bazaar are basically feature equivalent.
    Git, Mercurial, and Bazaar are all fast.
What I will focus on is how much friction a tool introduces, that is how easy it is to get work done, and the philosophies and design decisions, which have an impact on how the tools are used.
What makes Bazaar the best choice is it’s by far the easiest tool to use, but more important, it gets the DVCS model right.
An Ideal Model of Distributed Version Control
I assume anyone reading this likely already uses a DVCS or at least is aware of the benefits. In a nutshell, when you have a number of people both collaborating and working independently on the same codebase, you essentially have a number of concurrent branches, or lines of development. A DVCS tool makes creating and merging these branches easy, something that was a problem with Subversion or CVS. This is the core aspect of DVCS from which all the benefits flow.
The most visible consequence of this is that sharing code is divorced from saving snapshots. This is what solved my working offline problem, but in general it gives people a lot more control over how they can model their workflows with their tools.
The most important aspect of a DVCS is how well it models and tracks branches and the ease with which we can fork branches and merge branches.
Before examining why I think Bazaar gets the DVCS model correct, I’m going to present an ideal view of a project history. What I’m presenting here should be uncontroversial, and I’m going to present it in tool neutral language.
Assume the following scenario:
    (trunk a) Steve creates an initial state as the beginning of a “trunk” branch
    (trunk b) Steve implements feature 0 in a mirror of trunk and pushes upstream
    (branch1 a) Duane branches into branch1 and commits work
    (branch1 b) Duane commits work in branch1
    (trunk c) Duane lands his work into trunk by merging branch1 and pushing upstream
    (branch3 a) Pete branches into branch3 and commits
    (branch2 a) Jerry branches into branch2 and commits
    (branch3 b) Pete commits work in branch3
    (branch2 b) Jerry commits work in branch2
    (branch2 c) Jerry commits work in branch2
    (branch3 c) Pete commits work in branch3
    (trunk d) Jerry lands feature 2 into trunk by merging branch2 and pushing upstream
    (trunk e) Pete lands feature 3 into trunk by merging branch3 and pushing upstream
We have four branches: Trunk, Branch 1, Branch 2, and Branch 3. A branch is a line development. We could also think of it as a thread of history.
Ideal
The dots on the diagram represent snapshots. The snapshots represent the state of our project as recorded in history. The ordered set of all snapshots is:
Project = {A, B, 1A, 1B, C, 2A, 2B, 2C, D, 3A, 3B, 3C, E}
This is the complete set of snapshots. You could call this ordering the project history across all branches. Note that this is not ordered by the time the work done, but rather the time work arrived in the trunk. Ordering either by merge time or actual time could be valid, but this ordering is more beneficial for understanding the project.
Also, snapshots have parents. The parent of a snapshot is the set of immediately preceding snapshots in the graph. For example, the parent of snapshot 3B is snapshot 3A. The parent of snapshot D is {C, 2C}.
A snapshot that has two parents represents the point at which two branches were merged.
We can further define the snapshots that define the branches:
Trunk = {A, B, C, D, E}
Branch 1 = {1A, 1B}
Branch 2 = {2A, 2B, 2C}
Branch 3 = {3A, 3B, 3C}
Furthermore we can define the concept of a head. A head, generally, is the most recent snapshot in a branch, after which new snapshots are appended.
Trunk head = E
Branch 1 head = 1B
Branch 2 head = 2C
Branch 3 head = 3C
With this ideal view of history, there are a number of questions we can answer such as:
What is the history of a branch?
The history of branch X is the set of snapshots that compose the branch.
What is the common ancestor of two branches (i.e. when did two branches diverge)?
This can be answered by traversing the graphs of the branches, starting at the head then working backwards through the graph to find a common ancestor.
When did two branches merge?
To answer this we find a snapshot that has two parents, one from each branch.
In which branch was a snapshot a added?
This is a question we’d ask when we want to determine which branch, or line of development, introduced a change. We can determine this because a snapshot is a member of only one of the branches’s set of snapshots.
Like I said, nothing here is controversial. The qualities I’m describing is what makes it possible to make sense of our project’s history.
When looking at the project above, we’d most often want to talk about the history of the trunk branch. If we made the assumption that branches were feature branches, a view of the trunk would look like this:
E [Merge] Implement Feature 3
D [Merge] Implement Feature 2
C [Merge] Implement Feature 1
B Implement Feature 0
A Initial State
The view of the trunk is the “big picture” of the project. The merge badges are an indicator that there is more history available in the associated feature branch’s history. For example, Branch 2 might look like:
2C Completion of Feature 2
2B Partial Completion of Feature 2
2A Start work on Feature 2
The branches don’t have to be features- they could represent individual developers, teams, etc. The distinction is that branches are lines of development with histories, and we want to see different levels of detail in the history depending on what branch we’re examining.
Git
Git is by far the most popular of the big three open source DVCS. It is also an order of magnitude harder to learn and to use than either Mercurial or Bazaar. I’ve expounded on this in my post on the Jenkins incident and one on Cocoapods.
At this point, I don’t think it’s inflammatory to point out that Git is difficult to use. A simple Google search will turn up enough detail on this point that I won’t address it here. Git’s interface is a very leaky abstraction, so anyone who’s going to be very successful using Git will eventually have to learn about Git’s implementation details to make sense of it.
This introduces a lot of cognitive overhead. More plainly, a user of Git has to expend a significant amount of focus and attention on using Git that would otherwise be spent on their actual work.
This overhead is inherent in a lot of Git commands as the metaphors are fuzzy and a lot of commands have bad defaults.
While Git’s interface is more complicated than is necessary, it certainly isn’t a large enough concern to stop people from successfully using it, or other UIs built on top of it.
The core problem with Git is that it gets the branching model wrong.
The first problem with Git is that it uses colocated branches by default. A Git repository has a single working directory for all branches, and you have to switch between them using the git checkout command.
This is just one source of cognitive overhead. You only have a single working tree at a time, so you lack cues as to which branch you’re operating in. It also makes it difficult to work in more than one branch at a time as you have to manage intermediate changes in your working directory manually before you can perform the switch with git stash.
A better default would be to have a directory per branch. In this simpler scenario, there’s never a concern about collisions between changes in the working directory and you can take advantage of all the cues from working in separate directories and take advantage of every tool that’s been developed over the years that understands files and directories.
Colocated branches essentially make directories modal, which requires more of your attention.
Of course, you could make the case that collocated branches take up less space, but considering how cheap storage is, this doesn’t impact most projects, which is why it’s an odd default behavior.
Another case you could make is that you can share intermediate products, such as compiled object files. While this seems like a good rationale, a better, simpler solution would to use a shared directory for build products.
While colocated branches aren’t a deal breaker, there is a more fundamental flaw in Git’s branch model. As I described in the ideal DVCS model, a branch is a line of development or a thread of history. Git in fact does not track this history. In Git, a branch is merely the head pointer, not the history.
In Git, branch 3 = 3C, not {3A, 3B, 3C}
What we would consider the history only exists in the reflog, which will eventually be garbage collected.
By default, when we send our work up to a remote repository, not only is the reflog information not sent, the head pointer is also discarded, so any notion of a branch is lost.
This means that Git does not match the ideal model I’ve put forth, and is not able to answer certain questions such as, in which branch was a snapshot a added?
Let’s take the example in the ideal model. We’ll assume each branch is a feature, and that a separate developer is working on each in their local repository. We’ll also assume that our “trunk” branch is the master branch on some remote repository.
If we use the default behavior in Git, and make all the snapshots as defined in the ideal model in the same chronological order, this is what our graph looks like:
Git
$ git log
commit 314397941825fb0df325601519694102f3e4a25b
Merge: 56faf98 a12bd9e
Author: Pete
    E  Implement Feature 3
commit 56faf98989a059a6c13315695c17704668b98bda
Author: Jerry
    2C  Complete Feature 2
commit a12bd9ea51e781cdc37cd6bce8a3966f2b5ee952
Author: Pete
    3C  Complete Feature 3
commit 4e36f12f1a7dd01aa5887944bc984c316167f4a9
Author: Jerry
    2B  Partial Completion of Feature 2
commit 2a3543e74c32a7cdca7e9805545f8e7cef5ca717
Author: Pete
    3B  Partial Completion of Feature 3
commit 5b72cf35d4648ac37face270ee2de944ac2d5710
Author: Jerry
    2A  Start work on Feature 2
commit 842dbf12c8c69df9b4386c5f862e0d338bde3e01
Author: Pete
    3A  Start work on Feature 3
commit b6ca5293b79e3c37341392faac031af281d25205
Author: Duane
    1B  Complete Feature 1
commit 684d006bfb4b9579e7ad81efdbe9145efba0e4eb
Author: Duane
    1A  Start work on Feature 1
commit ad79f582156dafacbfc9e2ffe1e1408c21766578
Author: Steve
    B  Implement Feature 0
commit 268de6f5563dd3d9683d307e9ab0be382eafc278
Author: Steve
    A  Initial State
Git has completely mangled our branch history.
This is not a bug, this is a fundamental design decision in Git. For lack of a better term, I’d differentiate Git from other DVCS systems by saying it’s patch oriented. The goal of Git is to arrive at some patch that describes the next state in a project- tracking history is viewed as not as important as tracking the patch or the content.
This means we don’t have a good way of figuring out what’s going on with our conceptual trunk branch. Only one branch was recorded (incorrectly), and everything is shoved into master in chronological order. We also don’t have a good way to see the history of any particular branch, nor which branch a branch was forked from or when it was merged back in.
If you’ve ever had to figure out how some change entered the system or whether some branch is cleared to be released, you’ll know that this matters.
Git’s broken branching model means the burden of correctly recording or interpreting history is thrust upon the user.
The safest workaround is to do all branch merging with the –no-ff option. This step will correct the object graph since it will memorialize all merges. The default Git graph loses snapshots {C,D}, but preventing fast forward merges will fix that.
The second problem is that Git doesn’t capture enough information in each snapshot to remember the branch. To get around this, we have to manually insert that information, in-band, in the commit message.
So, when a developer commits work, they’ll manually tag the commit with branch metadata. Then we can at least use git log’s grep option to see work clearly:
$ git log --grep="\[Trunk\]"
commit 5d155c1d81b9c2803f2f2de890298ca442597535
Merge: 6a4ba95 0ed512b
Author: Pete
    E [Trunk][Merge] Implement Feature 3
commit 6a4ba950c807d9cb8fe55236e8d787b4fd4a663a
Merge: f355800 36ef31b
Author: Jerry
    D [Trunk][Merge] Implement Feature 2
commit f3558008a174e56735d95432b5d27cf0a26db030
Merge: df9f0df 3bdf920
Author: Duane
    C [Trunk][Merge] Implement Feature 1
commit df9f0df24faf0de5e8653626b9eb8c780086fc28
Author: Steve
    B [Trunk] Implement Feature 0
commit 67ba4b110a8cb45ba4f5a4fc72d97ddffafd7db0
Author: Steve
    A [Trunk] Initial State
$ git log --grep="\[Branch1\]"
commit 4b327621647987c6e8c34d844068e48dab82a6ab
Author: Duane
    1B [Branch1] Complete Feature 1
commit 469a850c458179914f4a79c804b778e2d3f1bfbe
Author: Duane
    1A [Branch1] Start work on Feature 1
This is sort of a clunky and error-prone approach, but tagging commit messages is a common way to compensate for Git’s lack of branch support.
There is a more drastic and dangerous approach which is to use a Git rebase workflow, potentially with squashing commits.
In this scenario, rather than adding more information to commits, you actively rewrite history to store even less data than Git captures by default. In the rebase workflow, rather than getting the mangled Git graph produced by the normal Git process, we get a straight line:
Rebase
It’s ironic that the rebase workflow, a reaction to Git not storing enough information to model or make sense of project history, is to erase even more information. While the straight line graph is easier to read, it’s completely unable to answer questions laid out in the ideal model.
This is what a Git user might refer to as a “clean” history. Your average Mercurial or Bazaar user is confused by this because both of the other tools correctly model branches, and neither one would consider the butchered object graph “clean”. The rebase workflow conflates readable logs with “clean” histories.
Rebasing is significantly worse than the –no-ff work around as you can squash commits that might actually matter. Another problem with rebasing, beyond that you might accidentally select the wrong commits, is that it prevents collaboration.
In order to use a rebase workflow, you have to adopt byzantine notions of “private” and “public” history. If you rebase away (i.e. discard) commits that have been shared with others, it leads to complicated cascading rebases. So, if I’m working on a feature, and I intend to use the rebase workflow, I’m not going to be able to make ad-hoc merges with colleagues because no one can count on my snapshots sticking around. Basically, sharing my work becomes difficult. One of the benefits of DVCS is that collaborating should be easy.
It’s very hard to recommend Git. It’s the most difficult tool to use, and its broken branch model gives rise to Git specific solutions to Git inflicted problems that Git users may think are best practices rather than kludges.
44354577
Mercurial
Mercurial is much easier to use than Git. This is not to say that there aren’t some UI concerns. The biggest gripe I have with Mercurial is that it separates pulling changes into a repository from merging changes, which can lead to a pending merge situation where you have multiple heads on the same branch as they have not yet been merged. Normally this isn’t a problem, but if you’re careless, you can end up with three or four heads in a repository before catching it.
I’ve never understood if the manual multiple heads pull/merge pattern was considered a feature in Mercurial, or if it was an implementation detail leaking through. It certainly feels like that latter just as does Git’s “feature” of exposing the index to users.
That said, the gulf between Mercurial usability and Git usability is akin to the distance between New York City and Los Angeles. While Mercurial’s UI is more complicated than Bazaar’s, that difference is more like the distance between New York City and Boston, so I won’t belabor the point.
Mercurial has two popular models for branching (though I believe there are technically four). The first and most common model is to branch by cloning. You can think of this as a euphemism for not branching at all.
In Mercurial, all work happens in a branch, but by default, all work happens in the same branch, called the “default” branch. The primary advantage branching by cloning has is that it’s easier than the alternatives. There’s fewer commands to know and call and there’s a working directory per branch. However, if we apply all of the commits in the same chronological order as I described in the ideal model example, here’s what our changeset graph looks like:
Murky
$ hg log
changeset:   10:d21a419a293e
tag:         tip
parent:      9:53d61d1605f9
parent:      6:7129c383aa3b
user:        Pete
summary:     E Implement Feature 3
changeset:   9:53d61d1605f9
user:        Pete
summary:     3C Complete Feature 3
changeset:   8:fb340c4eb013
user:        Pete
summary:     3B Partial Completion of Feature 3
changeset:   7:2a35bc51a28d
parent:      3:e9292c14c2b2
user:        Pete
summary:     3A Start work on Feature 3
changeset:   6:7129c383aa3b
user:        Jerry
summary:     2C Complete Feature 2
changeset:   5:b13281f66a25
user:        Jerry
summary:     2B Partial Completion of Feature 2
changeset:   4:187abbd1b3c4
user:        Jerry
summary:     2A Start work on Feature 2
changeset:   3:e9292c14c2b2
user:        Jerry
summary:     1B Complete Feature 1
changeset:   2:a5e8ccb38d38
user:        Duane
summary:     1A Start work on Feature 1
changeset:   1:b60a08bf46c7
user:        Steve
summary:     B Implement Feature 0
changeset:   0:747376f7cfb9
user:        Steve
summary:     A Initial State
Mercurial has completely mangled our branch history. Well, to be fair, this is what happens when we don’t actually use branches.
Mercurial fared a bit better than Git on the logging front- at least the order of changesets reflects the order they were merged, though I’ve seen projects where either the complexity of the merges or perhaps because of an older version of Mercurial, the log looked nearly identical to Git’s.
Unlike Git, Mercurial has no equivalent to the –no-ff option for merging. This means that with this clone-as-branch model, we can’t get snapshots {C,D}.
While Git discards branch histories, Mercurial remembers them. So, if we use the named branches workflow, we have a different story:
Murky2
$ hg log
changeset:   15:5495a79dbe2b
tag:         tip
parent:      10:ab972541c15e
parent:      14:f524ef255a5d
user:        Pete
summary:     E Implement Feature 3
changeset:   14:f524ef255a5d
branch:      branch3
user:        Pete
summary:     3D Close Branch3
changeset:   13:90ea24b0f0e1
branch:      branch3
user:        Pete
summary:     3C Complete Feature 3
changeset:   12:82bef28d0849
branch:      branch3
user:        Pete
summary:     3B Partial Completion of Feature 3
changeset:   11:d26ae37169a3
branch:      branch3
parent:      5:7f7c35f66937
user:        Pete
summary:     3A Start work on Feature 3
changeset:   10:ab972541c15e
parent:      5:7f7c35f66937
parent:      9:4282cb9c4a23
user:        Jerry
summary:     D Implement Feature 2
changeset:   9:4282cb9c4a23
branch:      branch2
user:        Jerry
summary:     2D Close Branch2
changeset:   8:0d7edbb59c8d
branch:      branch2
user:        Jerry
summary:     2C Complete Feature 2
changeset:   7:30bec7ee5bd2
branch:      branch2
user:        Jerry
summary:     2B Partial Completion of Feature 2
changeset:   6:bd7eb7ed40a4
branch:      branch2
user:        Jerry
summary:     2A Start work on Feature 2
changeset:   5:7f7c35f66937
parent:      1:635e85109055
parent:      4:52a27ea04f94
user:        Duane
summary:     C Implement Feature 1
changeset:   4:52a27ea04f94
branch:      branch1
user:        Duane
summary:     1C Close Branch1
changeset:   3:ceb303533965
branch:      branch1
user:        Duane
summary:     1B Complete Feature 1
changeset:   2:a6f29e9917eb
branch:      branch1
user:        Duane
summary:     1A Start work on Feature 1
changeset:   1:635e85109055
user:        Steve
summary:     B Implement Feature 0
changeset:   0:918345ee8664
user:        Steve
summary:     A Initial State
The hg log command is going to show us something that looks very similar to the previous log with a few changes. First off, all of the merge changesets are present. Second, every changeset is associated with a branch (default is assumed if not listed). The last thing to notice is we have an extra changeset per branch to “close” the branch.
Also, we can now easily show the history of a branch:
$ hg log -b default
changeset:   15:5495a79dbe2b
tag:         tip
parent:      10:ab972541c15e
parent:      14:f524ef255a5d
user:        Pete
summary:     E Implement Feature 3
changeset:   10:ab972541c15e
parent:      5:7f7c35f66937
parent:      9:4282cb9c4a23
user:        Jerry
summary:     D Implement Feature 2
changeset:   5:7f7c35f66937
parent:      1:635e85109055
parent:      4:52a27ea04f94
user:        Duane
summary:     C Implement Feature 1
changeset:   1:635e85109055
user:        Steve
summary:     B Implement Feature 0
changeset:   0:918345ee8664
user:        Steve
summary:     A Initial State
$ hg log -b branch1
changeset:   4:52a27ea04f94
branch:      branch1
user:        Duane
summary:     1C Close Branch1
changeset:   3:ceb303533965
branch:      branch1
user:        Duane
summary:     1B Complete Feature 1
changeset:   2:a6f29e9917eb
branch:      branch1
user:        Duane
summary:     1A Start work on Feature 1
When we take advantage of Mercurial’s branching support, you can see how the Git notion of “cleaning” history is ridiculous.
There are some costs to using the named branches workflow. The first one is there’s more manual work to set up and tear down branches. We ended up with additional commits to “close” branches, which is a bit of bookkeeping Mercurial pushes off on the user. It’s possible with planning to make the final commit in a brach and close it at the same time, but in practice you may not actually land a branch in the trunk until after a review or a QA process, so you may not know when it’s time to close the branch until that happens.
Another problem with Mercurial’s implementation is it uses colocated branches with a single working directory by default. The additional cognitive overhead imposed by this choice is what I suspect is the reason more Mercurial users use the clone as a branch approach.
Also, using named branches means your repository will have multiple heads most of the time (one per branch). For intermediate and advanced Mercurial users that have a good mental model of what’s going on, this is no big deal. But beginning Mercurial users are often trained to see multiple heads as a sign of a pending merge, and a situation to be rectified.
In general, the named branch approach, while fully supported and completely functional, feels like an afterthought. Calling the hg branch command to create a new branch issues a warning as though it was advising you against using branches:
$ hg branch branch1
marked working directory as branch branch1
(branches are permanent and global, did you want a bookmark?)
Just for clarification, you do not want a bookmark.
Similarly, when pushing work to an upstream repository, you have to tell Mercurial that yes, you intended to push a new branch:
hg push --new-branch
If I had to sum up Mercurial, I’d call it repository oriented. Unlike Git, it fully tracks history, but it treats the repository as the most prominent metaphor in the system at the expense of the branch. Key Mercurial commands such as clone, pull, push, and log operate on repositories, while it has a separate branch command to operate on branches.
If you’re willing to put up with a bit of extra work and some odd warnings from Mercurial, the named branches workflow meets the ideal model I’ve presented. Even if you choose the clone as a branch model, you’d still have a lot less cognitive overhead than using Git, so while I find it hard to recommend Git, if you’re already a Mercurial user, I’d urge you to consider using named branches.
5pqqp
Bazaar
Bazaar is the simplest of the three systems to use. Common use of Bazaar is much like the clone as a branch approach of Mercurial without exposing the user directly to head pointers or multiple heads. As I said in the Mercurial section, Git is an order of magnitude harder to use than either Mercurial or Bazaar. Bazaar is unquestionably simpler from a UI perspective than Mercurial, especially compared to the named branches workflow, but it’s nothing compared to the gulf between Git and the others in usability.
Also, Bazaar by default uses a directory per branch, so each branch gets its own working directory, meaning you don’t have to use kludges like stashing working directory changes when switching branches. Switching branches is as easy as changing directories. Bazaar does support a method for colocated branches, but it is not the default and it’s rarely needed. As I pointed out in the Git section, there are easy ways to tackle the problems that colocated branches are supposed to solve.
Let’s repeat the experiment with the normal Bazaar workflow and see how it compares to the ideal model:
Bazaar
Bazaar, by default, exactly matches the ideal model. Now, I know my more cynical readers will assume that this is because I picked Bazaar’s model as the “ideal” model, but they would be incorrect. Bazaar is not the first DVCS I used, nor did my ideal model derive from Bazaar. The ideal model is what I think should happen when branching and merging. As I said earlier, I don’t think the model I laid out is controversial. I use Bazaar because it meets the model, not the other way around.
In fact, I think a lot of users get confused by Git and Mercurial’s branch by clone workflow precisely because the history graph that they record does not resemble user’s mental model.
Let’s take a look at the log:
$ bzr log
------------------------------------------------------------
revno: 5 [merge]
committer: Pete
branch nick: trunk
message:
  E Implement Feature 3
------------------------------------------------------------
revno: 4 [merge]
committer: Jerry
branch nick: trunk
message:
  D Implement Feature 2
------------------------------------------------------------
revno: 3 [merge]
committer: Duane
branch nick: trunk
message:
  C Implement Feature 1
------------------------------------------------------------
revno: 2
committer: Steve
branch nick: trunk
message:
  B Implement Feature 0
------------------------------------------------------------
revno: 1
committer: Steve
branch nick: trunk
message:
  A Initial State
------------------------------------------------------------
Use --include-merged or -n0 to see merged revisions.
This is very different than the logs presented by either Mercurial or Git. Each of the logs I’ve shown were taken from the upstream “authoritative” repository. Unlike the others, bzr log does not operate on a repository, it operates on a branch. So by default, we see the history of the project from the perspective of the trunk branch.
Another distinction with Bazaar’s logs is that they are nested. To show the complete log, we can set the nesting level to 0, which will show an infinite level of nesting:
$ bzr log -n0
------------------------------------------------------------
revno: 5 [merge]
committer: Pete
branch nick: trunk
message:
  E Implement Feature 3
    ------------------------------------------------------------
    revno: 3.2.3
    committer: Pete
    branch nick: branch3
    message:
      3C Complete Feature 3
    ------------------------------------------------------------
    revno: 3.2.2
    committer: Pete
    branch nick: branch3
    message:
      3B Partial Completion of Feature 3
    ------------------------------------------------------------
    revno: 3.2.1
    committer: Pete
    branch nick: branch3
    message:
      3A Start work on Feature 3
------------------------------------------------------------
revno: 4 [merge]
committer: Jerry
branch nick: trunk
message:
  D Implement Feature 2
    ------------------------------------------------------------
    revno: 3.1.3
    committer: Jerry
    branch nick: branch2
    message:
      2C Complete Feature 2
    ------------------------------------------------------------
    revno: 3.1.2
    committer: Jerry
    branch nick: branch2
    message:
      2B Partial Completion of Feature 2
    ------------------------------------------------------------
    revno: 3.1.1
    committer: Jerry
    branch nick: branch2
    message:
      2A Start work on Feature 2
------------------------------------------------------------
revno: 3 [merge]
committer: Duane
branch nick: trunk
message:
  C Implement Feature 1
    ------------------------------------------------------------
    revno: 2.1.2
    committer: Duane
    branch nick: branch1
    message:
      1B Complete Feature 1
    ------------------------------------------------------------
    revno: 2.1.1
    committer: Duane
    branch nick: branch1
    message:
      1A Start work on Feature 1
------------------------------------------------------------
revno: 2
committer: Steve
branch nick: trunk
message:
  B Implement Feature 0
------------------------------------------------------------
revno: 1
committer: Steve
branch nick: trunk
message:
  A Initial State
Nested logs is a simple but incredibly useful innovation. We don’t really need a graphical tool to visualize history. Also, it’s clear that the Git notion of “cleaning” history is equally as ludicrous as it was when looking at Mercurial’s logs, perhaps even more so.
There is a real advantage though as compared to Mercurial’s named branches workflow. For lack of a better term, Bazaar is branch oriented. Both git init and hg init create new repositories. However, bzr init creates a new branch. Both git clone and hg clone create copies of repositories. The equivalent command in Bazaar, bzr branch forks a new branch. Both git log and hg log examine the history of the repository. The bzr log command shows branch history.
It’s a very subtle change in point of view. Bazaar elevates branches to the primary metaphor, relegating the concept of repository to more of a background player role.
The result is that the basic Bazaar workflow is to always branch and to always accurately track branch history. It accomplishes this with a cognitive overhead that is comparable to working with a centralized system like Subversion or Mercurial’s clone as a branch method.
Bazaar does have some downsides. The most significant one is it has the smallest user base of the big three open source DVCS. This means it will be a little harder to find answers to your questions in blog postings and Stack Exchange answers. Also, commercial hosting companies like Bitbucket aren’t going to offer Bazaar support. Nor is it as actively developed as Mercurial and Git.
However, Bazaar’s positives so strongly outweigh the negatives that you’d be crazy to not consider Bazaar.
44354687
Branching is the Secret Sauce in DVCS
What makes DVCS work is the ability to handle branches from creation to merging, and to do it in the least intrusive way possible so we can get our work done.
Bazaar gets branching right, by default. It’s also the easiest system to learn and use. You don’t have to expend nearly as much focus and attention on Bazaar as Mercurial and Git, which means you have more attention to devote to your actual work.
I did not go into detail showing the commands used to create the graphs and logs presented in this post because rather than getting bogged down in the syntax and the details, I want to make the big picture clear. I leave the commands as an exercise to motivated readers so they can prove to themselves that their favorite tool just might not be recording history as they visualize it.
Bzr Init: A Bazaar Tutorial
At this point, you might be wondering where the actual tutorial is. First I wanted to convince you why Bazaar is a good choice, if not the best choice.
The actual tutorial is much longer than this rant, so I gave it it’s own site:
Bzr Init: A Bazaar Tutorial
This tutorial is not meant to be exhaustive; it exists to show you how easy Bazaar is and how it can be easily adapted to your workflow. It’s inspired by (let’s say ripped off from) Joel Spolsky’s Mercurial Tutorial- HgInit. Hginit is not only one of the best Mercurial tutorials, it’s one of the best DVCS tutorials. I’d recommend you read it even if you’re not a Mercurial user.
Hopefully you’d get the same experience from the Bzr Init tutorial. In any case, if I’ve piqued your interest, read on!
Categories: Programming
Tagged: bazaar, bzr, dvcs, git, hg, mercurial, vcs, version control
Post navigation
Previous Post‹
Not That There’s Anything Wrong With That, 1952 Edition
›Next Post
Chromebook Boom?
© 2024 Duck Rowing |
Footer Menu
Privacy Policy
    Home About Microblog Mastodon Search RSS

File addition: 20131113_categorical_theory_of_patches.md (----------)

[0.1]

# Document Title
arXiv:1311.3903v1 [cs.LO] 13 Nov 2013
A Categorical Theory of Patches
Samuel Mimram Cinzia Di Giusto
CEA, LIST∗
Abstract
When working with distant collaborators on the same documents, one
often uses a version control system, which is a program tracking the history
of files and helping importing modifications brought by others as patches.
The implementation of such a system requires to handle lots of situations
depending on the operations performed by users on files, and it is thus
difficult to ensure that all the corner cases have been correctly addressed.
Here, instead of verifying the implementation of such a system, we adopt
a complementary approach: we introduce a theoretical model, which is
defined abstractly by the universal property that it should satisfy, and
work out a concrete description of it. We begin by defining a category of
files and patches, where the operation of merging the effect of two coinitial
patches is defined by pushout. Since two patches can be incompatible,
such a pushout does not necessarily exist in the category, which raises
the question of which is the correct category to represent and manipulate
files in conflicting state. We provide an answer by investigating the free
completion of the category of files under finite colimits, and give an explicit
description of this category: its objects are finite sets labeled by lines
equipped with a transitive relation and morphisms are partial functions
respecting labeling and relations.
1 Introduction
It is common nowadays, when working with distant collaborators on the same
files (multiple authors writing an article together for instance), to use a program
which will track the history of files and handle the operation of importing mod-
ifications of other participants. These software called version control systems
(vcs for short), like git or Darcs, implement two main operations. When a user
is happy with the changes it has brought to the files it can record those changes
in a patch (a file coding the differences between the current version and the last
recorded version) and commit them to a server, called a repository. The user
can also update its current version of the file by importing new patches added
by other users to the repository and applying the corresponding modifications
to the files. One of the main difficulties to address here is that there is no global
notion of “time”: patches are only partially ordered. For instance consider a
repository with one file A and two users u1 and u2. Suppose that u1 modifies
file A into B by committing a patch f , which is then imported by u2, and
∗This work was partially supported by the French project ANR-11-INSE-0007 REVER.
1
then u1 and u2 concurrently modify the file B into C (resp. D) by committing
a patch g (resp. h). The evolution of the file is depicted on the left and the
partial ordering of patches in the middle:
C D
B
g
`
`
❆❆❆ h
>
>
⑥⑥⑥
A
f O O
g h
f
]
]
✿✿✿ A A ✄✄✄
E
C
h/g > > ⑥⑥⑥ D
g/h``❆❆❆
B
g
`
`
❆❆❆ h
>
>
⑥⑥⑥
A
f O O
Now, suppose that u2 imports the patch g or that u1 imports the patch h.
Clearly, this file resulting from the merging of the two patches should be the
same in both cases, call it E. One way to compute this file, is to say that there
should be a patch h/g, the residual of h after g, which transforms C into E and
has the “same effect” as h once g has been applied, and similarly there should be
a patch g/h transforming D into E. Thus, after each user has imported changes
from the other, the evolution of the file is as pictured on the right above. In
this article, we introduce a category L whose objects are files and morphisms
are patches. Since residuals should be computed in the most general way, we
formally define them as the arrows of pushout cocones, i.e. the square in the
figure on the right should be a pushout.
However, as expected, not every pair of coinitial morphisms have a pushout
in the category L: this reflects the fact that two patches can be conflicting (for
instance if two users modify the same line of a file). Representing and handling
such conflicts in a coherent way is one of the most difficult part of implementing
a vcs (as witnessed for instance by the various proposals for Darcs: mergers,
conflictors, graphictors, etc. [10]). In order to be able to have a representation
for all conflicting files, we investigate the free completion of the category L
under all pushouts, this category being denoted P, which corresponds to adding
all conflicting files to the category, in the most general way as possible. This
category can easily be shown to exist for general abstract reasons, and one
of the main contributions of this work is to provide an explicit description by
applying the theory of presheaves. This approach paves the way towards the
implementation of a vcs whose correctness is deduced from universal categorical
properties.
Related work. The Darcs community has investigated a formalization of pat-
ches based on commutation properties [10]. Operational transformations tackle
essentially the same issues by axiomatizing the notion of residual patches [9].
In both cases, the fact that residual should form a pushout cocone is never
explicitly stated, excepting in informal sentences saying that “g/f should have
the same effect as g once f has been applied”. We should also mention another
interesting approach to the problem using inverse semigroups in [4]. Finally,
Houston has proposed a category with pushouts, similar to ours, in order to
model conflicting files [3], see Section 6.
Plan of the paper. We begin by defining a category L of files and patches in
Section 2. Then, in Section 3, we abstractly define the category P of conflicting
files obtained by free finite cocompletion. Section 4 provides a concrete descrip-
tion of the construction in the simpler case where patches can only insert lines.
2
We give some concrete examples in Section 5 and adapt the framework to the
general case in Section 6. We conclude in Section 7.
2 Categories of files and patches
In this section, we investigate a model for a simplified vcs: it handles only one
file and the only allowed operations are insertion and deletion of lines (modi-
fication of a line can be encoded by a deletion followed by an insertion). We
suppose fixed a set L = {a, b, . . .} of lines (typically words over an alphabet of
characters). A file A is a finite sequence of lines, which will be seen as a function
A : [n] → L for some number of lines n ∈ N, where the set [n] = {0, 1, . . . , n − 1}
indexes the lines of the files. For instance, a file A with three lines such that
A(0) = a, A(1) = b and A(2) = c models the file abc. Given a ∈ L, we some-
times simply write a for the file A : [1] → L such that A(0) = a. A morphism
between two files A : [m] → L and B : [n] → L is an injective increasing partial
function f : [m] → [n] such that ∀i ∈ [m], B ◦ f (i) = A(i) whenever f (i) is
defined. Such a morphism is called a patch.
Definition 1. The category L has files as objects and patches as morphisms.
Notice that the category L is strictly monoidal with [m] ⊗ [n] = [m + n] and
for every file A : [m] → L and B : [n] → L, (A ⊗ B)(i) = A(i) if i < m and
(A ⊗ B)(i) = B(i − m) otherwise, the unit being the empty file I : [0] → L, and
tensor being defined on morphisms in the obvious way. The following proposition
shows that patches are generated by the operations of inserting and deleting a
line:
Proposition 2. The category L is the free monoidal category containing L as
objects and containing, for every line a ∈ L, morphisms ηa : I → a (insertion
of a line a) and εa : a → I (deletion of a line a) such that εa ◦ ηa = idI (deleting
an inserted line amounts to do nothing).
Example 3. The patch corresponding to transforming the file abc into dadeb,
by deleting the line c and inserting the lines labeled by d and e, is modeled by
the partial function f : [3] → [5] such that f (0) = 1 and f (1) = 4 and f (2) is
undefined. Graphically,
a
b
c
d
a
d
e
b
The deleted line is the one on which f is not defined and the inserted lines are
those which are not in the image of f . In other words, f keeps track of the
unchanged lines.
In order to increase readability, we shall consider the particular case where L
is reduced to a single element. In this unlabeled case, the objects of L can be
identified with integers (the labeling function is trivial), and Proposition 2 can
be adapted to achieve the following description of the category, see also [6].
Proposition 4. If L is reduced to a singleton, the category L is the free category
whose objects are integers and morphisms are generated by sn
i : n → n + 1 and
3
dn
i : n + 1 → n for every n ∈ N and i ∈ [n + 1] (respectively corresponding to
insertion and deletion of a line at i-th position), subject to the relations
sn+1
i sn
j = sn+1
j+1 sn
i dn
i sn
i = idn dn
i dn+1
j = dn
j dn+1
i+1 (1)
whenever 0 ≤ i ≤ j < n.
We will also consider the subcategory L+ of L, with same objects, and to-
tal injective increasing functions as morphisms. This category models patches
where the only possible operation is the insertion of lines: Proposition 2 can be
adapted to show that L+ is the free monoidal category containing morphisms
ηa : I → a and, in the unlabeled case, Proposition 4 can be similarly adapted
to show that it is the free category generated by morphisms sn
i : n → n + 1
satisfying sn+1
i sn
j = sn+1
j+1 sn
i with 0 ≤ i ≤ j < n.
3 Towards a category of conflicting files
Suppose that A is a file which is edited by two users, respectively applying
patches f1 : A → A1 and f2 : A → A2 to the file. For instance,
a c c b f1
←− a b f2
−→ a b c d (2)
Now, each of the two users imports the modification from the other one. The
resulting file, after the import, should be the smallest file containing both mod-
ifications on the original file: accbcd. It is thus natural to state that it should
be a pushout of the diagram (2). Now, it can be noticed that not every diagram
in L has a pushout. For instance, the diagram
a c b f1
←− a b f2
−→ a d b (3)
does not admit a pushout in L. In this case, the two patches f1 and f2 are said
to be conflicting.
In order to represent the state of files after applying two conflicting patches,
we investigate the definition of a category P which is obtained by completing
the category L under all pushouts. Since, this completion should also contain
an initial object (i.e. the empty file), we are actually defining the category P as
the free completion of L under finite colimits: recall that a category is finitely
cocomplete (has all finite colimits) if and only if it has an initial object and is
closed under pushouts [6]. Intuitively, this category is obtained by adding files
whose lines are not linearly ordered, but only partially ordered, such as on the
left of a
   ❃❃❃
c
 ❁❁❁ d
✂✂✂
b
a
<<<<<<< HEAD
c
=======
d
>>>>>>> 5c55...
b
(4)
which would intuitively model the pushout of the diagram (3) if it existed,
indicating that the user has to choose between c and d for the second line.
Notice the similarities with the corresponding textual notation in git on the
right. The name of the category L reflects the facts that its objects are files
whose lines are linearly ordered, whereas the objects of P can be thought as files
whose lines are only partially ordered. More formally, the category is defined
as follows.
4
Definition 5. The category P is the free finite conservative cocompletion of L:
it is (up to equivalence of categories) the unique finitely cocomplete category
together with an embedding functor y : L → P preserving finite colimits, such
that for every finitely cocomplete category C and functor F : L → C preserv-
ing finite colimits, there exists, up to unique isomorphism, a unique functor
˜F : P → C preserving finite colimits and satisfying ˜F ◦ y = F :
L
y  
F / / C
P ˜F
?
?
Above, the term conservative refers to the fact that we preserve colimits which
already exist in L (we will only consider such completions here). The “standard”
way to characterize the category P, which always exists, is to use the following
folklore theorem, often attributed to Kelly [5, 1]:
Theorem 6. The conservative cocompletion of the category L is equivalent to
the full subcategory of ˆL whose objects are presheaves which preserve finite limits,
i.e. the image of a limit in Lop (or equivalently a colimit in L) is a limit in Set
(and limiting cones are transported to limiting cones). The finite conservative
cocompletion P can be obtained by further restricting to presheaves which are
finite colimits of representables.
Example 7. The category FinSet of finite sets and functions is the conservative
cocompletion of the terminal category 1.
We recall that the category ˆL of presheaves over L, is the category of functors
Lop → Set and natural transformations between them. The Yoneda functor
y : L → ˆL defined on objects n ∈ L by yn = L(−, n), and on morphisms
by postcomposition, provides a full and faithful embedding of L into the cor-
responding presheaf category, and can be shown to corestrict into a functor
y : L → P [1]. A presheaf of the form yn for some n ∈ L is called representable.
Extracting a concrete description of the category P from the above propo-
sition is a challenging task, because we a priori need to characterize firstly all
diagrams admitting a colimit in L, and secondly all presheaves in ˆL which pre-
serve those diagrams. This paper introduces a general methodology to build
such a category. In particular, perhaps a bit surprisingly, it turns out that
we have to “allow cycles” in the objects of the category P, which will be de-
scribed as the category whose objects are finite sets labeled by lines together with
a transitive relation and morphisms are partial functions respecting labels and
relations.
4 A cocompletion of files and insertions of lines
In order to make our presentation clearer, we shall begin our investigation of the
category P in a simpler case, which will be generalized in Section 6: we compute
the free finite cocompletion of the category L+ (patches can only insert lines)
in the case where the set of labels is a singleton. To further lighten notations,
in this section, we simply write L for this category.
We sometimes characterize the objects in L as finite colimits of objects in a
subcategory G of L. This category G is the full subcategory of L whose objects
5
are 1 and 2: it is the free category on the graph 1 / / / / 2 , the two arrows
being s1
0 and s1
1. The category ˆG of presheaves over G is the category of graphs:
a presheaf P ∈ ˆG is a graph with P (1) as vertices, P (2) as edges, the functions
P (s1
1) and P (s1
0) associate to a vertex its source and target respectively, and
morphisms correspond to usual morphisms of graphs. We denote by x ։ y a
path going from a vertex x to a vertex y in such a graph. The inclusion functor
I : G → L induces, by precomposition, a functor I∗ : ˆL → ˆG. The image
of a presheaf in ˆL under this functor is called its underlying graph. By well
known results about presheaves categories, this functor admits a right adjoint
I∗ : ˆG → ˆL: given a graph G ∈ ˆG, its image under the right adjoint is the
presheaf G∗ ∈ ˆL such that for every n ∈ N, G∗(n + 1) is the set of paths of
length n in the graph G, with the expected source maps, and G∗(0) is reduced
to one element.
Recall that every functor F : C → D induces a nerve functor NF : D → ˆC
defined on an object A ∈ C by NF (A) = D(F −, A) [7]. Here, we will consider
the nerve NI : L → ˆG associated to the inclusion functor I : G → L. An easy
computation shows that the image NI (n) of n ∈ L is a graph with n vertices,
so that its objects are isomorphic to [n], and there is an arrow i → j for every
i, j ∈ [n] such that i < j. For instance,
NI (3) = 0 / / 6 6 1 / / 2 NI (4) = 0 / / ( ( 4 4 1 / / ( ( 2 / / 3
It is, therefore, easy to check that this embedding is full and faithful, i.e. mor-
phisms in L correspond to natural transformations in ˆG. Moreover, since NI (1)
is the graph reduced to a vertex and NI (2) is the graph reduced to two vertices
and one arrow between them, every graph can be obtained as a finite colimit of
the graphs NI (1) and NI (2) by “gluing arrows along vertices”. For instance, the
initial graph NI (0) is the colimit of the empty diagram, and the graph NI (3) is
the colimit of the diagram
NI (2) NI (2)
NI (1)
NI (s1 ) , , ❳❳❳❳❳❳❳❳❳❳❳❳❳❳
NI (s1 ) 6 6❧❧❧❧ NI (1)
NI (s0 )hh❘❘❘❘ NI (s1) 6 6❧❧❧❧ NI (1)
NI (s0 )hh❘❘❘❘
NI (s0 )rr❢❢❢❢❢❢❢❢❢❢❢❢❢❢
NI (2)
which may also be drawn as on the left of
*
*
❯❯❯❯❯❯❯❯❯
:
: ✈✈✈✈ f f ◆◆◆◆ 8 8 ♣♣♣♣ d d ❍❍❍❍
t
t
✐✐✐✐✐✐✐✐✐
2 2
1
&
&
◆◆◆◆◆◆◆
@
@
✁✁✁ 1
^
^
❂❂❂ @ @ ✁✁✁ 1
^
^
❂❂❂
x
x
♣♣♣♣♣♣♣
2
by drawing the graphs NI (0) and NI (1). Notice, that the object 3 is the col-
imit of the corresponding diagram in L (on the right), and this is generally
true for all objects of L, moreover this diagram is described by the functor
El(NI (3)) π
−→ L. The notation El(P ) refers to the category of elements of a
presheaf P ∈ ˆC, whose objects are pairs (A, p) with A ∈ C and p ∈ P (A)
and morphisms f : (A, p) → (B, q) are morphisms f : A → B in C such that
P (f )(q) = p, and π is the first projection functor. The functor I : G → L is
thus a dense functor in the sense of Definition 9 below, see [7] for details.
6
Proposition 8. Given a functor F : C → D, with D cocomplete, the associated
nerve NF : D → ˆC admits a left adjoint RF : ˆC → D called the realization
along F . This functor is defined on objects P ∈ ˆC by
RF (P ) = colim(El(P ) π
−→ C F
−→ D)
Proof. Given a presheaf P ∈ ˆC and an object D, it can be checked directly that
morphisms P → NF D in ˆC with cocones from El(P ) D
−→ to D, which in turn
are in bijection with morphisms RF (P ) → D in D, see [7].
Definition 9. A functor F : C → D is dense if it satisfies one of the two
equivalent conditions:
(i) the associated nerve functor NF : D → ˆC is full and faithful,
(ii) every object of D is canonically a colimit of objects in C: for every D ∈ D,
D ∼= colim(El(NF D) π
−→ C F
−→ D) (5)
Since the functor I is dense, every object of L is a finite colimit of objects
in G, and G does not have any non-trivial colimit. One could expect the free
conservative finite cocompletion of L to be the free finite cocompletion P of G.
We will see that this is not the case because the image in L of a non-trivial
diagram in G might still have a colimit. By Theorem 6, the category P is the
full subcategory of ˆL of presheaves preserving limits, which we now describe
explicitly. This category will turn out to be equivalent to a full subcategory
of ˆG (Theorem 15). We should first remark that those presheaves satisfy the
following properties:
Proposition 10. Given a presheaf P ∈ ˆL which is an object of P,
1. the underlying graph of P is finite,
2. for each non-empty path x ։ y there exists exactly one edge x → y (in
particular there is at most one edge between two vertices),
3. P (n + 1) is the set of paths of length n in the underlying graph of P ,
and P (0) is reduced to one element.
Proof. We suppose given a presheaf P ∈ P, it preserves limits by Theorem 6.
The diagram on the left
3
2
s2
2 = = ③③③ 2
s2
0aa❉❉❉
1s1
0
a
a
❉❉❉ s1
1
=
=
③③③
P (3)
P (2)
x
x
P (s2
2)
qq
P (2)
&
&
P (s2
0 )▼▼
P (1)
&
&P (s1
0)
▼▼ x x P (s1
1 )
qq
is a pushout in L, or equivalently the dual diagram is a pullback in Lop. There-
fore, writing D for the diagram 2 1
s1
0
oo s1
1 / / 2 in L, a presheaf P ∈ P should satisfy
P ((colim D)op) ∼= lim P (Dop), i.e. the above pushout diagram in L should be
transported by P into the pullback diagram in Set depicted on the right of the
above figure. This condition can be summarized by saying that P should satisfy
7
the isomorphism P (3) ∼= P (2) ×P (1) P (2) (and this isomorphism should respect
obvious source and target maps given by the fact that the functor P should
send a limiting cone to a limiting cone). From this fact, one can deduce that
the elements α of P (3) are in bijection with the paths x → y → z of length 2
in the underlying graph of P going from x = P (s2
2s1
1)(α) to z = P (s2
0s1
0)(α). In
particular, this implies that for any path α = x → y → z of length 2 in the
underlying graph of P , there exists an edge x → z, which is P (s2
1)(α). More
generally, given any integer n > 1, the object n + 1 is the colimit in L of the
diagram
2 2 2 2
1
s1
1 ; ; ①①① 1
s1
0cc❋❋❋ s1
1 ; ; ①①① s1
0aa❈❈❈❈ . . .
s1
1 = = ④④④④ 1
s1
0cc❋❋❋ s1
1 ; ; ①①① 1
s1
0cc❋❋❋ (6)
with n + 1 occurrences of the object 1, and n occurrences of the object 2.
Therefore, for every n ∈ N, P (n+ 1) is isomorphic to the set of paths of length n
in the underlying graph. Moreover, since the diagram
2 2 2 2
1
s1
1 , , ❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩
s1
1 < < ②②② 1
s1
0bb❊❊❊ s1
1 < < ②②② s1
0``❇❇❇ . . .
s1
1 > > ⑤⑤⑤ 1
s1
0bb❊❊❊ s1
1 < < ②②② 1
s1
0bb❊❊❊
s1
0
rr❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞
2
(7)
with n + 1 occurrences of the object 1 also admits the object n + 1 as colimit,
we should have P (n + 1) ∼= P (n + 1) × P (2) between any two vertices x and y,
i.e. for every non-empty path x ։ y there exists exactly one edge x → y. Also,
since the object 0 is initial in L, it is the colimit of the empty diagram. The
set P (0) should thus be the terminal set, i.e. reduced to one element. Finally,
since I is dense, P should be a finite colimit of the representables NI (1) and
NI (2), the set P (1) is necessarily finite, as well as the set P (2) since there is at
most one edge between two vertices.
Conversely, we wish to show that the conditions mentioned in the above
proposition exactly characterize the presheaves in P among those in ˆL. In order
to prove so, by Theorem 6, we have to show that presheaves P satisfying these
conditions preserve finite limits in L, i.e. that for every finite diagram D : J → L
admitting a colimit we have P (colim D) ∼= lim(P ◦ Dop). It seems quite difficult
to characterize the diagrams admitting a colimit in L, however the following
lemma shows that it is enough to check diagrams “generated” by a graph which
admits a colimit.
Lemma 11. A presheaf P ∈ ˆL preserves finite limits if and only if it sends the
colimits of diagrams of the form
El(G) πG
−−→ G I
−→ L (8)
to limits in Set, where G ∈ ˆG is a finite graph such that the above diagram
admits a colimit. Such a diagram in L is said to be generated by the graph G.
Proof. In order to check that a presheaf P ∈ ˆL preserves finite limits, we have
to check that it sends colimits of finite diagrams in L which admit a colimit
to limits in Set, and therefore we have to characterize diagrams which admit
colimits in L. Suppose given a diagram K : J → L. Since I is dense, every
object of linear is a colimit of a diagram involving only the objects 1 and 2 (see
8
Definition 9). We can therefore suppose that this is the case in the diagram K.
Finally, it can be shown that diagram K admits the same colimits as a diagram
containing only s1
0 and s1
1 as arrows (these are the only non-trivial arrows in L
whose source and target are 1 or 2), in which every object 2 is the target of
exactly one arrow s1
0 and one arrow s1
1. For instance, the diagram in L below
on the left admits the same colimits as the diagram in the middle.
2 3
1s1
0
^
^
❂❂❂
s1
1
  ❂❂❂
s2
2 s1
1 @ @ ✁✁✁ 1
s1
0
✁✁✁
s2
0s1
0
^^❂❂❂
1
s1
0jj2
2 2 2
1
s1
1 @ @ ✁✁✁ 1
s1
0
^^❂❂❂
s1
1 @ @ ✁✁✁
s1
1 & & ◆◆◆◆◆◆◆ 1
s1
0
^^❂❂❂
s1
1 @ @ ✁✁✁ 1
s1
0
xx♣♣♣♣♣♣♣
s1
0
^^❂❂❂
2
0 / / 1 / / 6 6 2 / / 3
Any such diagram K is obtained by gluing a finite number of diagrams of the
form 1 s1
1 / / 2 1
s1
0
oo along objects 1, and is therefore of the form El(G) π
−→ G I
−→ L
for some finite graph G ∈ ˆG: the objects of G are the objects 1 in K, the edges of
G are the objects 2 in K and the source and target of an edge 2 are respectively
given by the sources of the corresponding arrows s1
1 and s1
0 admitting it as target.
For instance, the diagram in the middle above is generated by the graph on the
right. The fact that every diagram is generated by a presheaf (is a discrete
fibration) also follows more abstractly and generally from the construction of
the comprehensive factorization system on Cat [8, 11].
Among diagrams generated by graphs, those admitting a colimit can be
characterized using the following proposition:
Lemma 12. Given a graph G ∈ ˆG, the associated diagram (8) admits a colimit
in L if and only if there exists n ∈ L and a morphism f : G → NI n in ˆL
such that every morphism g : G → NI m in ˆL, with m ∈ L, factorizes uniquely
through NI n: G f/ /
g
2 2
NI n / / NI m
Proof. Follows from the existence of a partially defined left adjoint to NI , in
the sense of [8], given by the fact that I is dense (see Definition 9).
We finally arrive at the following concrete characterization of diagrams admit-
ting colimits:
Lemma 13. A finite graph G ∈ ˆG induces a diagram (8) in L which admits a
colimit if and only if it is “tree-shaped”, i.e. it is
1. acyclic: for any vertex x, the only path x ։ x is the empty path,
2. connected: for any pair of vertices x and y there exists a path x ։ y or a
path y ։ x.
Proof. Given an object n ∈ L, recall that NI n is the graph whose objects are
elements of [n] and there is an arrow i → j if and only if i < j. Given a finite
graph G, morphisms f : G → NI n are therefore in bijection with functions
f : VG → [n], where VG denotes the set of vertices of G, such that f (x) < f (y)
whenever there exists an edge x → y (or equivalently, there exists a non-empty
path x ։ y).
9
Consider a finite graph G ∈ ˆG, by Lemma 12, it induces a diagram (8)
admitting a colimit if there is a universal arrow f : G → NI n with n ∈ L. From
this it follows that the graph is acyclic: otherwise, we would have a non-empty
path x ։ x for some vertex x, which would imply f (x) < f (x). Similarly,
suppose that G is a graph with vertices x and y such that there is no path
x ։ y or y ։ x, and there is an universal morphism f : G → NI n for some
n ∈ L. Suppose that f (x) ≤ f (y) (the case where f (y) ≤ f (x) is similar). We
can define a morphism g : G → NI (n + 1) by g(z) = f (z) + 1 if there is a path
x ։ z, g(y) = f (x) and g(z) = f (z) otherwise. This morphism is easily checked
to be well-defined. Since we always have f (x) ≤ f (y) and g(x) > g(y), there is
no morphism h : NI n → NI (n + 1) such that h ◦ f = g.
Conversely, given a finite acyclic connected graph G, the relation ≤ defined
on morphisms by x ≤ y whenever there exists a path x ։ y is a total order.
Writing n for the number of vertices in G, the function f : G → NI n, which to
a vertex associates the number of vertices strictly below it wrt ≤, is universal
in the sense of Lemma 12.
Proposition 14. The free conservative finite cocompletion P of L is equiva-
lent to the full subcategory of ˆL whose objects are presheaves P satisfying the
conditions of Proposition 10.
Proof. By Lemma 11, the category P is equivalent to the full subcategory of ˆL
whose objects are presheaves preserving limits of diagrams of the form (8) gen-
erated by some graph G ∈ ˆG which admits a colimit, i.e. by Lemma 13 the finite
graphs which are acyclic and connected. We write Gn for the graph with [n]
as vertices and edges i → (i + 1) for 0 ≤ i < n − 1. It can be shown that any
acyclic and connected finite graph can be obtained from the graph Gn, for some
n ∈ N, by iteratively adding an edge x → y for some vertices x and y such
that there exists a non-empty path x ։ y. Namely, suppose given an acyclic
and connected finite graph G. The relation ≤ on its vertices, defined by x ≤ y
whenever there exists a path x ։ y, is a total order, and therefore the graph G
contains Gn, where n is the number of edges of G. An edge in G which is not
in Gn is necessarily of the form x → y with x ≤ y, otherwise it would not be
acyclic. Since by Proposition 10, see (7), the diagram generated by a graph of
the form . . .
is preserved by presheaves in P (which corresponds to adding an edge between
vertices at the source and target of a non-empty path), it is enough to show
that presheaves in P preserve diagrams generated by graphs Gn. This follows
again by Proposition 10, see (6).
One can notice that a presheaf P ∈ P is characterized by its underlying
graph since P (0) is reduced to one element and P (n) with n > 2 is the set of
paths of length n in this underlying graph: P ∼= I∗(I∗P ). We can therefore
simplify the description of the cocompletion of L as follows:
Theorem 15. The free conservative finite cocompletion P of L is equivalent to
the full subcategory of the category ˆG of graphs, whose objects are finite graphs
such that for every non-empty path x ։ y there exists exactly one edge x → y.
Equivalently, it can be described as the category whose objects are finite sets
equipped with a transitive relation <, and functions respecting relations.
10
In this category, pushouts can be explicitly described as follows:
Proposition 16. With the last above description, the pushout of a diagram in
P (B, <B ) f
←− (A, <A) g
−→ (C, <C ) is B ⊎ C/ ∼ with B ∋ b ∼ c ∈ C whenever
there exists a ∈ A with f (a) = b and f (a) = c, equipped with the transitive
closure of the relation inherited by <B and <C .
Lines with labels. The construction can be extended to the labeled case (i.e. L is
not necessarily a singleton). The forgetful functor ˆL → Set sending a presheaf P
to the set P (1) admits a right adjoint ! : Set → ˆL. Given n ∈ N∗ the elements
of !L(n) are words u of length n over L, with !L(sn−1
i )(u) being the word ob-
tained from u by removing the i-th letter. The free conservative finite cocomple-
tion P of L is the slice category L/!L, whose objects are pairs (P, ℓ) consisting
of a finite presheaf P ∈ ˆL together with a labeling morphism ℓ : P → !L of
presheaves. Alternatively, the description of Proposition 15 can be straightfor-
wardly adapted by labeling the elements of the objects by elements of L (labels
should be preserved by morphisms), thus justifying the use of labels for the
vertices in following examples.
5 Examples
In this section, we give some examples of merging (i.e. pushout) of patches.
Example 17. Suppose that starting from a file ab, one user inserts a line a′ at
the beginning and c in the middle, while another one inserts a line d in the
middle. After merging the two patches, the resulting file is the pushout of
a′
a
c
b
f1
←−
a
b
f2
−→
a
d
b
which is
a′
a
c d
b
Example 18. Write G1 for the graph with one vertex and no edges, and G2 for
the graph with two vertices and one edge between them. We write s, t : G1 → G2
for the two morphisms in P. Since P is finitely cocomplete, there is a coproduct
G1 + G1 which gives, by universal property, an arrow seq : G1 + G1 → G2:
G2
G1
s 8 8 rrrrr / / G1 + G1
seq O O
G1
oo
tff▲▲▲▲▲ or graphically s
< < ②②②②②②②② / /
seq
O
O
o
o t
b
b
❊❊❊❊❊❊❊❊
that we call the sequentialization morphism. This morphism corresponds to the
following patch: given two possibilities for a line, a user can decide to turn them
into two consecutive lines. We also write seq′ : G1 + G1 → G2 for the morphism
obtained similarly by exchanging s and t in the above cocone. Now, the pushout
of seq
←−− seq′
−−→ is
which illustrates how cyclic graphs appear in P during the cocompletion of L.
11
Example 19. With the notations of the previous example, by taking the co-
product of two copies of idG1 : G1 → G1, there is a universal morphism
G1 + G1 → G1, which illustrates how two independent lines can be merged
by a patch (in order to resolve conflicts).
id•
< < ②②②②②②②②② / /
merge
O
O
o
o id•
b
b
❊❊❊❊❊❊❊❊❊
6 Handling deletions of lines
All the steps performed in previous sections in order to compute the free con-
servative finite cocompletion of the category L+ can be adapted in order to
compute the cocompletion P of the category L as introduced in Definition 1,
thus adding support for deletion of lines in patches. In particular, the general-
ization of the description given by Theorem 15 turns out to be as follows.
Theorem 20. The free conservative finite cocompletion P of the category L
is the category whose objects are triples (A, <, ℓ) where A is a finite set of
lines, < is a transitive relation on A and ℓ : A → L associates a label to
each line, and morphisms f : (A, <A, ℓA) → (B, <B , ℓB ) are partial functions
f : A → B such that for every a, a′ ∈ A both admitting an image under f , we
have ℓB (f (a)) = ℓA(a), and a <A a′ implies f (a) <B f (a′).
Similarly, pushouts in this category can be computed as described in Proposi-
tion 16, generalized in the obvious way to partial functions.
Example 21. Suppose that starting from a file abc, one user inserts a line d
after a and the other one deletes the line b. The merging of the two patches
(in P′) is the pushout of
a
d
b
c
f1
←−
a
b
c
f2
−→
a
c
which is
a
d
c
i.e. the file adc. Notice that the morphism f2 is partial: b has no image.
Interestingly, a category very similar to the one we have described in The-
orem 20 was independently proposed by Houston [3] based on a construction
performed in [2] for modeling asynchronous processes. This category is not
equivalent to ours because morphisms are reversed partial functions: it is thus
not the most general model (in the sense of being the free finite cocomple-
tion). As a simplified explanation for this, consider the category FinSet which
is the finite cocompletion of 1. This category is finitely complete (in addition
to cocomplete), thus FinSetop is finitely cocomplete and 1 embeds fully and
faithfully in it. However, FinSetop is not the finite cocompletion of 1. Another
way to see this is that this category does not contain the “merging” morphism
of Example 19, but it contains a dual morphism “duplicating” lines.
12
7 Concluding remarks and future works
In this paper, we have detailed how we could derive from universal constructions
a category which suitably models files resulting from conflicting modifications.
It is finitely cocomplete, thus the merging of any modifications of the file is
well-defined.
We believe that the interest of our methodology lies in the fact that it adapts
easily to other more complicated base categories L than the two investigated
here: in future works, we should explain how to extend the model in order
to cope with multiple files (which can be moved, deleted, etc.), different file
types (containing text, or more structured data such as xml trees). Also, the
structure of repositories (partially ordered sets of patches) is naturally modeled
by event structures labeled by morphisms in P, which will be detailed in future
works, as well as how to model usual operations on repositories: cherry-picking
(importing only one patch from another repository), using branches, removing
a patch, etc. It would also be interesting to explore axiomatically the addition
of inverses for patches, following other works hinted at in the introduction.
Once the theoretical setting is clearly established, we plan to investigate
algorithmic issues (in particular, how to efficiently represent and manipulate the
conflicting files, which are objects in P). This should eventually serve as a basis
for the implementation of a theoretically sound and complete distributed version
control system (no unhandled corner-cases as in most current implementations
of vcs).
Acknowledgments. The authors would like to thank P.-A. Melli`es, E. Haucourt,
T. Heindel, T. Hirschowitz and the anonymous reviewers for their enlightening
comments and suggestions.
References
[1] J. Ad´amek and J. Rosicky. Locally presentable and accessible categories,
volume 189. Cambridge Univ. Press, 1994.
[2] R. Cockett and D. Spooner. Categories for synchrony and asynchrony.
Electronic Notes in Theoretical Computer Science, 1:66–90, 1995.
[3] R. Houston. On editing text. http://bosker.wordpress.com/2012/05/10/on-editing-text.
[4] J. Jacobson. A formalization of darcs patch theory using inverse semi-
groups. Technical report, CAM report 09-83, UCLA, 2009.
[5] M. Kelly. Basic concepts of enriched category theory, volume 64. Cambridge
Univ. Press, 1982.
[6] S. Mac Lane. Categories for the Working Mathematician, volume 5 of
Graduate Texts in Mathematics. Springer Verlag, 1971.
[7] S. Mac Lane and I. Moerdijk. Sheaves in geometry and logic: A first
introduction to topos theory. Springer, 1992.
[8] R. Par´e. Connected components and colimits. Journal of Pure and Applied
Algebra, 3(1):21–42, 1973.
13
[9] M. Ressel, D. Nitsche-Ruhland, and R. Gunzenh¨auser. An integrating,
transformation-oriented approach to concurrency control and undo in group
editors. In Proceedings of the 1996 ACM conference on Computer supported
cooperative work, pages 288–297. ACM, 1996.
[10] D. Roundy and al. The Darcs Theory. http://darcs.net/Theory.
[11] R. Street and R. Walters. The comprehensive factorization of a functor.
Bull. Amer. Math. Soc, 79(2):936–941, 1973.
14
A A geometric interpretation of presheaves on L+
Since presheaf categories are sometimes a bit difficult to grasp, we recall here the
geometric interpretation that can be done for presheaves in ˆL+. We forget about
labels of lines and for simplicity suppose that the empty file is not allowed (the
objects are strictly positive integers). In this section, we denote this category
by L. The same reasoning can be performed on the usual category L+, and
even L, but the geometrical explanation is a bit more involved to describe.
In this case, the presheaves in ˆL can easily be described in geometrical
terms: the elements P of ˆL are presimplicial sets. Recall from Proposition 4
that the category L is the free category whose objects are strictly positive
natural integers, containing for every integers n ∈ N∗ and i ∈ [n + 1] mor-
phisms sn
i : n → n + 1, subject to the relations sn+1
i sn
j = sn+1
j+1 sn
i when-
ever 0 ≤ i ≤ j < n. Writing y : L → ˆL for the Yoneda embedding, a repre-
sentable presheaf y(n + 1) ∈ ˆLfin
+ can be pictured geometrically as an n-simplex:
a 0-simplex is a point, a 1-simplex is a segment, a 2-simplex is a (filled) triangle,
a 3-simplex is a (filled) tetrahedron, etc.:
y(1) y(2) y(3) y(4)
Notice that the n-simplex has n faces which are (n − 1)-dimensional simplices,
and these are given by the image under y(sn
i ), with i ∈ [n], of the unique el-
ement of y(n + 1)(n + 1): the i-th face of an n-simplex is the (n − 1)-simplex
obtained by removing the i-th vertex from the simplex. More generally, a
a
b c
d
f
g
h
i
j α
presheaf P ∈ ˆLfin
+ (a finite presimplicial set) is a finite colimit
of representables: every such presheaf can be pictured as a glu-
ing of simplices. For instance, the half-filled square on the right
corresponds to the presimplicial set P with P (1) = {a, b, c, d},
P (2) = {f, g, h, i, j}, P (3) = {α} with faces P (s1
1)(f ) = a,
P (s1
0)(f ) = b, etc.
Similarly, in the labeled case, a labeled presheaf (P, ℓ) ∈ L/! can be pic-
tured as a presimplicial set whose vertices (0-simplices) are labeled by elements
a
b
c
ab
bc
acabc
of L. The word labeling of higher-dimensional simplices can then be
deduced by concatenating the labels of the vertices it has as iterated
faces. For instance, an edge (a 1-simplex) whose source is labeled
by a and target is labeled by b is necessarily labeled by the word
ab, etc.
More generally, presheaves in L+ can be pictured as augmented presimplicial
sets and presheaves in L as augmented simplicial sets, a description of those can
for instance be found in Hatcher’s book Algebraic Topology.
B Proofs of classical propositions
In this section, we briefly recall proofs of well-known propositions as our proofs
rely on a fine understanding of those. We refer the reader to [7] for further
details.
15
Proposition 8 Given a functor F : C → D, with D cocomplete, the associated
nerve NF : D → ˆC admits a left adjoint RF : ˆC → D called the realization
along F . This functor is defined on objects P ∈ ˆC by
RF (P ) = colim(El(P ) π
−→ C F
−→ D)
Proof. In order to show the adjunction, we have to construct a natural family of
isormorphisms D(RF (P ), D) ∼= ˆC(P, NF D) indexed by a presheaf P ∈ ˆC and an
object D ∈ D. A natural transformation θ ∈ ˆC(P, NF D) is a family of functions
(θC : P C → NF DC)C∈C such that for every morphism f : C′ → C in C the
diagram
P (C)
P (f )  
θC / / D(F C, D)
D(F f,D)
P (C′) θC′
/ / D(F C′, D)
commutes. It can also be seen as a family (θC (p) : F C → D)(C,p)∈El(P ) of
morphisms in D such that the diagram
F C θC (p)
#
#
●●●●
D
F C′
F f
O
O
θC′ (P (f )(p))
;
;
①①①
or equivalently
F πP (C, p) θC (p)
'
'
◆◆◆◆◆
D
F πP (C′, p′)
F πP f
O
O
θC′ (p′ )
7
7
♣♣♣♣♣
commutes for every morphism f : C′ → C in C. This thus defines a co-
cone from F πP : El(P ) → D to D, and those cocones are in bijection with
morphisms RF (P ) → D by definition of RF (P ) as a colimit: we have shown
D(RF (P ), D) ∼= ˆC(P, NF (D)), from which we conclude.
The equivalence between the two conditions of Definition 9 can be shown as
follows.
Proposition 22. Given a functor F : C → D, the two following conditions are
equivalent:
(i) the associated nerve functor NF : D → ˆC is full and faithful,
(ii) every object of D is canonically a colimit of objects in C: for every D ∈ D,
D ∼= colim(El(NF D) π
−→ C F
−→ D)
Proof. In the case where D is cocomplete, the nerve functor NF : D → ˆC admits
RF : ˆC → D as right adjoint, and the equivalence amounts to showing that the
right adjoint is full and faithful if and only if the counit is an isomorphism,
which is a classical theorem [6, Theorem IV.3.1]. The construction can be
adapted to the general case where D is not necessarily cocomplete by considering
colim(El(−) π
−→ C F
−→ D) : ˆC → D as a partially defined left adjoint (see [8]) and
generalizing the theorem.
16
C Proofs of the construction of the finite cocom-
pletion
Lemma 11 A presheaf P ∈ ˆL preserves finite limits, if and only if it sends the
colimits of diagrams of the form
El(G) πG
−−→ G I
−→ L
to limits in Set, where G ∈ ˆG is a finite graph such that the above diagram
admits a colimit. Such a diagram in L is said to be generated by the graph G.
Proof. In order to check that a presheaf P ∈ ˆL preserves finite limits, we have
to check that it sends colimits of finite diagrams in L which admit a colimit
to limits in Set, and therefore we have to characterize diagrams which admit
colimits in L. The number of diagrams to check can be reduced by using the
facts that limits commute with limits [6]. For instance, the inclusion functor
I : G → L is dense, which implies that every object n ∈ L is canonically a
colimit of the objects 1 and 2 by the formula n ∼= colim(El(NI n) π
−→ G I
−→ L),
see Definition 9. Thus, given a finite diagram K : J → L, we can replace any
object n different from 1 and 2 occurring in the diagram by the corresponding
diagram El(NI n) π
−→ G I
−→ L, thus obtaining a new diagram K′ : J → L which
admits the same colimit as K. This shows that P will preserve finite limits if
and only if it preserves limits of finite diagrams in L in which the only occurring
objects are 1 and 2. Since the only non-trivial arrows in L between the objects
1 and 2 are s1
0, s1
1 : 1 → 2, and removing an identity arrow in a diagram does
not change its colimit, the diagram K can thus be assimilated to a bipartite
graph with vertices labeled by 1 or 2 and edges labeled by s1
0 or s1
1, all edges
going from vertices 1 to vertices 2.
We can also reduce the number diagrams to check by remarking that some
pairs of diagrams are “equivalent” in the sense that their image under P have
the same limit, independently of P . For instance, consider a diagram in which
an object 2 is the target of two arrows labeled by s1
0 (on the left). The diagram
obtained by identifying the two arrows along with the objects 1 in their source
(on the right) can easily be checked to be equivalent by constructing a bijection
between cocones of the first and cocones of the second.
2 ... 2 2 2 ... 2
1
5
5 ...
O
O
@
@ 1
O
O
^
^
8
8 1
f
f
^
^ s1
0
@
@
✁✁✁ 1s1
0
^
^
❂❂❂ @ @ 8 8
1
f
f ...
O
O
@
@ 1
O
O
^
^
i
i
2 ... 2 2 2 ... 2
1
8
8 ...
O
O
@
@ 1
O
O
^
^
@
@ 1
f
f
^
^
s1
0
O
O @ @
8
8 1
^
^ ...
O
O
@
@ 1
O
O
^
^
f
f
More precisely, if we write K : J ′ → L and K : J → L for the two diagrams and
J : J ′ → J for the obvious functor, the canonical arrow colim(K◦J) → colim(K)
is an isomorphism, i.e. the functor J is final. The same reasoning of course also
holds with s1
1 instead of s1
0. We can therefore restrict ourselves to considering
diagrams in which 2 is the target of at most one arrow s1
0, and of at most one
arrow s1
1. Conversely, if an object 2 is the target of no arrow s1
0 (on the left),
then we can add a new object 1 and a new arrow from this object to the object
2 (on the right) and obtain an equivalent diagram:
2 2 ... 2
1
^
^ ...
O
O
@
@ 1
f
f
O
O
^
^
2 2 ... 2
1
s1
0
O O 1
^
^ ...
O
O
@
@ 1
f
f
O
O
^
^
17
The same reasoning holds with s1
1 instead of s1
0 and we can therefore restrict
ourselves to diagrams in which every object 2 is the target of exactly one arrow s1
0
and one arrow s1
1.
Any such diagram K is obtained by gluing a finite number of diagrams of
the form
2
1
s1
1 @ @ ✁✁✁ 1
s1
0
^^❂❂❂
along objects 1, and is therefore of the form El(G) π
−→ G I
−→ L for some finite
graph G ∈ ˆG: the objects of G are the objects 1 in K, the edges of G are the
objects 2 in K and the source and target of an edge 2 are respectively given by
the sources of the corresponding arrows s1
1 and s1
0 admitting it as target. For
instance, the diagram on the left
2 2 2
1
s1
1 @ @ ✁✁✁ 1
s1
0
^^❂❂❂
s1
1 @ @ ✁✁✁
s1
1 & & ◆◆◆◆◆◆◆ 1
s1
0
^^❂❂❂
s1
1 @ @ ✁✁✁ 1
s1
0
xx♣♣♣♣♣♣♣
s1
0
^^❂❂❂
2
0 / / 1 / / 6 6 2 / / 3
is generated by the graph on the right.
Lemma 12 Given a graph G ∈ ˆG, the associated diagram (8) admits a colimit
in L if and only if there exists n ∈ L and a morphism f : G → NI n in ˆL
such that every morphism g : G → NI m in ˆL, with m ∈ L, factorizes uniquely
through NI n:
G f/ /
g
2 2
NI n / / NI m
Proof. We have seen in proof of Proposition 8 that morphisms in ˆL(G, NI n) are
in bijection with cocones in L from El(G) πG
−−→ G I
−→ L to n, and moreover given
a morphism h : n → m in G the morphism ˆL(G, NI n) → ˆL(G, NI m) induced by
post-composition with NI h is easily checked to correspond to the usual notion of
morphism between n-cocones and m-cocones induced by NI h (every morphism
NI n → NI m is of this form since NI is full and faithful). We can finally conclude
using the universal property defining colimiting cocones.
D Proofs for deletions of lines
In this section, we detail proofs of properties mentioned in Section 6.
D.1 Sets and partial functions
Before considering the conservative finite cocompletion of the category L, as
introduced in Definition 1, it is enlightening to study the category PSet of sets
and partial functions. A partial function f : A → B can always be seen
1. as a total function f : A → B ⊎ {⊥A} where ⊥A is a fresh element wrt A,
where given a ∈ A, f (a) = ⊥A means that the partial function is undefined
on A,
18
2. alternatively, as a total function f : A ⊎ {⊥A} → B ⊎ {⊥B} such that
f (⊥A) = ⊥B .
This thus suggests to consider the following category:
Definition 23. The category pSet of pointed sets has pairs (A, a) where A
is a set and a ∈ A as objects, and morphisms f : (A, a) → (B, b) are (total)
functions f : A → B such that f (a) = b.
Point (ii) of the preceding discussion can be summarized by saying that a partial
function can be seen as a pointed function and conversely:
Proposition 24. The category PSet of sets and partial functions is equivalent
to the category pSet of pointed sets.
It is easily shown that the forgetful functor U : pSet → Set, sending a
pointed set (A, a) to the underlying set A, admits a left adjoint F : Set → pSet,
defined on objects by F A = (A ⊎ {⊥A}, ⊥A). This adjunction induces a monad
T = U F on Set, from which point (i) can be formalized:
Proposition 25. The category PSet is equivalent to the Kleisli category SetT
associated to the monad T : Set → Set.
Finally, it turns out that the category pSet of pointed sets might have been
discovered from PSet using “presheaf thinking” as follows. We write G for the
full subcategory of PSet containing two objects: the empty set 0 = ∅ and a
set 1 = {∗} with only one element, and two non-trivial arrows ⋆ : 0 → 1 and
⊥ : 1 → 0 (the undefined function) such that ⊥◦⋆ = id0. We write I : G → PSet
for the inclusion functor. Consider the associated nerve functor NI : PSet → ˆG.
Given a set A the presheaf NI A ∈ ˆG is such that:
• NI A0 = PSet(I0, A) ∼= {⋆}: the only morphism 0 → A in PSet is noted
⋆,
• NI A1 = PSet(I1, A) ∼= A ⊎ {⊥A}: a morphism 1 → A is characterized
by the image of ∗ ∈ A which is either an element of A or undefined,
• NI A⋆ : NI A1 → NI A0 is the constant function whose image is ⋆,
• NI A⊥ : NI A0 → NI A1 is the function such that the image of ⋆ is ⊥A.
Moreover, given A, B ∈ PSet a natural transformation from NI A to NI B is a
pair of functions f : A ⊎ {⊥A} → B ⊎ {⊥B} and g : {⋆} → {⋆} such that the
diagrams
A ⊎ {⊥A} f / /
NI A⋆  
B ⊎ {⊥B}
NI B⋆
{⋆} g / / {⋆}
and
A ⊎ {⊥A} f / / ⊎{⊥B }
{⋆} g / /
NI A⊥ O O
{⋆}
NI B⊥
O
O
commutes. Since {⋆} is the terminal set, such a natural transformation is char-
acterized by a function f : A ⊎ {⊥A} → B ⊎ {⊥B} such that f (⊥A) = ⊥B . The
functor NI : PSet → ˆG is thus dense and its image is equivalent to pSet.
19
D.2 A cocompletion of L
The situation with regards to the category L is very similar. We follow the
plan of Section 4 and first investigate the unlabeled case: L is the category with
integers as objects and partial injective increasing functions f : [m] → [n] as
morphisms f : m → n.
We write G for the full subcategory of L whose objects are 0, 1 and 2. This
is the free category on the graph
0 s0
0 / / 1
d0
0
o
o
s1
0 / /
s1
1 / / 2d1
0
oo
d1
1
o
o
subject to the relations
s1
0s0
0 = s1
1s0
0 d0
0s0
0 = id1 d1
0s1
0 = id2 d1
1s1
1 = id2 d0
0d1
0 = d0
0d1
1 (9)
(see Proposition 4). We write I : G → L for the embedding and consider the
associated nerve functor NI : L → ˆG. Suppose given an object n ∈ L, the
associated presheaf NI n can be described as follows. Its sets are
• NI n0 = L(I0, n) ∼= {⋆},
• NI n1 = L(I1, n) ∼= [n] ⊎ {⊥},
• NI n2 = L(I2, n) ∼=
{(i, j) ∈ [n] × [n] | i < j} ⊎ {(⊥, i) | i ∈ [n]} ⊎ {(i, ⊥) | i ∈ [n]} ⊎ {(⊥, ⊥)}:
a partial function f : 2 → n is characterized by the pair of images
(f (0), f (1)) of 0, 1 ∈ [n], where ⊥ means undefined.
and morphisms are
• NI ns0
0 : NI n1 → NI n0 is the constant function whose image is ⋆,
• NI nd0
0 : NI n0 → NI n1 is the function whose image is ⊥,
• NI ns1
0 : NI n2 → NI n1 is the second projection,
• NI ns1
1 : NI n2 → NI n1 is the first projection,
• NI nd1
0 : NI n1 → NI n2 sends i ∈ [n] ⊎ {⊥} to (⊥, i)
• NI nd1
1 : NI n1 → NI n2 sends i ∈ [n] ⊎ {⊥} to (i, ⊥)
Such a presheaf can be pictured as a graph with NI n1 as set of vertices, NI n2 as
set of edges, source and target being respectively given by the functions NI ns1
1
and NI ns1
0:
⊥
0 1 2
. . .
n − 1
Its vertices are elements of [n] ⊎ {⊥} and edges are of the form
20
• i → j with i, j ∈ [n] such that i < j,
• i → ⊥ for i ∈ [n]
• ⊥ → i for i ∈ [n]
• ⊥ → ⊥
Morphisms are usual graphs morphisms which preserve the vertex ⊥. We are
thus naturally lead to define the following categories of pointed graphs and
graphs with partial functions. We recall that a graph G = (V, s, t, E) consists of
a set V of vertices, a set E of edges and two functions s, t : E → V associating
to each edge its source and its target respectively.
Definition 26. We define the category pGraph of pointed graphs as the cat-
egory whose objects are pairs (G, x) with G = (V, E) and x ∈ V such that for
every vertex there is exactly one edge from and to the distinguished vertex x,
and morphisms f : G → G′ are usual graph morphisms consisting of a pair
(fV , fE ) of functions fV : VG → VG′ and fE : EG → EG′ such that for every
edge e ∈ EG, fV (s(e)) = s(fE (e)) and fV (t(e)) = t(fE (e)), which are such that
the distinguished vertex is preserved by fV .
Definition 27. We define the category PGraph of graphs and partial mor-
phisms as the category whose objects are graphs and morphisms f : G → G′
are pairs (fV , fE ) of partial functions fV : VG → VG′ and fE : EG → EG′ such
that
• for every edge e ∈ EG such that fE (e) is defined, fV (s(e)) and fV (t(e))
are both defined and satisfy fV (s(e)) = s(fE (e)) and fV (t(e)) = t(fE (e)),
• for every edge e ∈ EG such that fV (s(e)) and fV (t(e)) are both defined,
fE (e) is also defined.
More briefly: a morphism is defined on an edge if and only it is defined on its
source and on its target.
Similarly to previous section, a partial morphism of graph can be seen as a
pointed morphism of graph and conversely:
Proposition 28. The categories pGraph and PGraph are equivalent.
Now, notice that the category L is isomorphic to the full subcategory of PGraph
whose objects are the graphs whose set of objects is [n] for some n ∈ N, and
such that there is an edge i → j precisely when i < j. Also notice that the
full subcategory of pGraph whose objects are the graphs NI n (with ⊥ as
distinguished vertex) with n ∈ N is isomorphic to the full subcategory of ˆG
whose objects are the NI n with n ∈ N. And finally, the two categories are
equivalent via the isomorphism of Proposition 28. From this, we immediately
deduce that the functor NI : L → ˆG is full and faithful, i.e.
Proposition 29. The functor I : G → L is dense.
We can now follow Section 4 step by step, adapting each proposition as nec-
essary. The conditions satisfied by presheaves in P introduced in Proposition 10
are still valid in our new case:
21
Proposition 30. Given a presheaf P ∈ ˆL which is an object of P,
1. the underlying graph of P is finite,
2. for each non-empty path x ։ y there exists exactly one edge x → y,
3. P (n + 1) is the set of paths of length n in the underlying graph of P ,
and P (0) is reduced to one element.
Proof. The diagrams of the form (6) and (7) used in proof of Proposition 10
still admit the same colimit n + 1 with the new definition of L and 0 is still
initial. It can be checked that the limit of the image under a presheaf P ∈ ˆL
of a diagram (6) is still the set of paths of length n in the underlying graph
of P .
Lemma 11 is also still valid:
Lemma 31. A presheaf P ∈ ˆL preserves finite limits, if and only if it sends the
colimits of diagrams of the form
El(G) πG
−−→ G I
−→ L
to limits in Set, where G ∈ ˆG is a finite pointed graph such that the above
diagram admits a colimit. Such a diagram in L is said to be generated by the
pointed graph G.
Proof. The proof of Lemma 11 was done “by hand”, but we mentioned a more
abstract alternative proof. In the present case, a similar proof can be done but
would be really tedious, so we provide the abstract one. In order to illustrate
why we have to do so, one can consider the category of elements associated to
the presheaves representable by 0 and 1, which are clearly much bigger than in
the case of Section 4:
El(NI 0) ∼= ⋆ s0
0 / / ⊥
d0
0
o
o
s1
0 / /
s1
1 / / (⊥, ⊥)d1
0
oo
d1
1
o
o
and
El(NI 1) ∼=
(⊥, ⊥)
d1
0
rrrr
y
y
rrrrrr d1
1
rrrr
y
y
rrrrrr
⊥
d0
0
✈✈✈✈✈
{
{
✈✈✈✈✈
s1
0
9 9 rrrrrrrrrrr s1
1rrrrrr
9
9
rrrr
s1
1 / /
s1
0
▲▲▲▲▲▲
%
%
▲▲▲▲
(⊥, 1)
d1
0
rrrrr
y
y
rrrrr
⋆
s0
0
;
; ✈✈✈✈✈✈✈✈✈✈✈
s0
0
/
/ 1
s1
0rrrrrrr
9
9
rrrr
s1
1 / / (1, ⊥)
d1
1
o
o
subject to relations which follow directly from (9).
Before going on with the proof, we need to introduce a few notions. A func-
tor F : C → D is called final if for every category E and diagram G : D → E
the canonical morphism colim(G ◦ F ) → colim(G) is an isomorphism [6]: re-
stricting a diagram along F does not change its colimit. Alternatively, these
22
functors can be characterized as functors such that for every object D ∈ D the
category is non-empty and connected (there is a zig-zag of morphisms between
any two objects). A functor F : C → D is called a discrete fibration if for any
object C ∈ C and morphism g : D → F C in D there exists a unique mor-
phism f : C′ → C in C such that F f = g called the lifting of g. To any such
discrete fibration one can associate a presheaf P ∈ ˆD defined on any D ∈ D
by P D = F −1(D) = {C ∈ C | F C = D} and on morphisms g : D′ → D
as the function P g which to C ∈ P D associates the source of the lifting of g
with codomain C. Conversely, any presheaf P ∈ ˆD induces a discrete fibration
El(P ) π
−→ D, and these two operations induce an equivalence of categories be-
tween the category ˆD and the category of discrete fibrations over D. It was
shown by Par´e, Street and Walters [8, 11] that any functor F : C → D factorizes
as final functor J : C → E followed by a discrete fibration K : E → D, and
this factorization is essentially unique: this is called the comprehensive factor-
ization of a functor. More explicitly, the functor K can be defined as follows.
The inclusion functor Set → Cat which send a set to the corresponding dis-
crete category admits a left adjoint Π0 : Cat → Set, sending a category to its
connected components (its set of objects quotiented by the relation identifying
two objects linked by a zig-zag of morphisms). The discrete fibration part K
above can be defined as El(P ) π
−→ D where P ∈ ˆD is the presheaf defined by
P = Π0(−/F ). In this precise sense, every diagram F in D is “equivalent” to
one which is “generated” by a presheaf P on D (we adopted this informal termi-
nology in the article in order to avoid having to introduce too many categorical
notions).
In our case, we can thus restrict to diagrams in L generated by presheaves
on L. Finally, since I : G → L is dense, we can further restrict to diagrams
generated by presheaves on G by interchange of colimits.
Lemma 13 applies almost as in Section 4: since the morphism f : G → NI n (seen
as a partial functions between graphs) has to satisfy the universal property of
Lemma 12, by choosing for every vertex x of G a partial function gx : G → NI m
which is defined on x (such a partial function always exists), it can be shown
that the function f has to be total. The rest of the proof can be kept unchanged.
Similarly, Proposition 14 applies with proof unchanged.
Finally, we have that
Theorem 32. The free conservative finite cocompletion P of L is equivalent to
the full subcategory of ˆL whose objects are presheaves P satisfying the conditions
of Proposition 30. Since its objects P satisfy I∗I∗(P ) ∼= P , it can equivalently
be characterized as the full subcategory of ˆG whose objects P are
1. finite,
2. transitive: for each non-empty path x ։ y there exists exactly one edge
x → y,
3. pointed: P (0) is reduced to one element.
From this characterization (which can easily be extended to the labeled case),
along with the correspondence between pointed graphs and graphs with par-
tial functions (Proposition 28), the category is shown to be equivalent to the
23
category described in Theorem 20: the relation is defined on vertices x, y of a
graph G by x < y whenever there exists a path x ։ y.
As in case of previous section, the forgetful functor pGraph → Graph
admits a left adjoint, thus inducing a monad on Graph. The category pGraph
is equivalent to the Kleisli category associated to this monad, which is closely
related to the exception monad as discussed in [3].
E Modeling repositories
We briefly detail here the modeling of repositories evoked in Section 7. As
explained in the introduction, repositories can be modeled as partially ordered
sets of patches, i.e. morphisms in L. Since some of them can be incompatible,
it is natural to model them as particular labeled event structures.
Definition 33. An event structure (E, ≤, #) consists of a set E of events, a
partial order relation ≤ on E and incompatibility relation on events. We require
that
1. for any event e, the downward closure of {e} is finite and
2. given e1, e′
1 and e2 such that e1 ≤ e′
1 and e1#e2, we have e′
1#e2.
Two events e1 and e2 are compatible when they are not incompatible, and
independent when they are compatible and neither e1 ≤ e2 nor e2 ≤ e1. A
configuration x is a finite downward-closed set of compatible events. An event
e2 is a successor of an event e1 when e1 ≤ e2 and there is no event in between.
Given an event e we write ↓e for the configuration, called the cause of e, obtained
as the downward closure of {e} from which e was removed. A morphism of
event structures f : (E, ≤, #) → (E′, ≤′, #′) is an injective function f : E → E′
such that the image of a configuration is a configuration. We write ES for the
category of event structures.
To every event structure E, we can associate a trace graph T (E) whose
vertices are configurations and edges are of the form x e
−→ x ⊎ {e} where x
is a configuration such that e 6 ∈ x and x ⊎ {e} is a configuration. A trace is
a path x ։ y in this graph. Notice that two paths x ։ y are of the same
length. Moreover, given two configurations x and y such that x ⊆ y, there
exists necessarily a path x ։ y. It can be shown that this operation provides a
faithful embedding T : ES → Graph from the category of event structures to
the category of graphs, which admits a right adjoint.
Example 34. An event structure with five events is pictured on the left (arrows
represent causal dependencies and ∼ incompatibilities). The associated trace
graph is pictured on the right.
d
&
f & f & f & f
b
A
A
✄✄✄ c
]
]
❀❀❀❀ / o
c′
a
^
^
❂❂❂ @ @ ✁✁✁
7
7 ♣♣♣♣♣♣♣
{a, b, c, d}
{a, b, c}
d O O
{a, b, c′} {a, b}
c′
o
o c 7 7 ♦♦♦♦♦ {a, c}
bgg❖❖❖❖
{a, c′}
b
f
f
◆◆◆◆
{a}
c′
o
o
bgg❖❖❖❖❖❖ c
7
7 ♦♦♦♦♦
∅
a O O
24
Definition 35. A categorical event structure (E, λ) in a category C with an
initial object consists of an event structure equipped with a labeling functor
λ : (T E)∗ → C, where (T E)∗ is the free category generated by the graph T E,
such that λ∅ is the initial object of C and the image under λ of every square
z
y1
e2 = = ⑤⑤⑤⑤ y2
e1aa❇❇❇❇
x
e1
a
a
❇❇❇ e2
=
=
⑤⑤⑤
in T E is a pushout in C.
The following proposition shows that a categorical event structure is char-
acterized by a suitable labeling of events of E by morphisms of C.
Proposition 36. The functor λ is characterized, up to isomorphism, by the
image of the transitions ↓e e
−→ ↓e ⊎ {e}.
We can now define a repository to be simply a finite categorical event struc-
ture (E, ≤, #, λ : T (E) → L). Such a repository extends to a categorical event
structure (E, ≤, #0, I ◦ λ : T (E) → P), where #0 is the empty conflict rela-
tion. The state S of such an event structure is the file obtained as the image
S = I ◦ λ(E) of the maximal configuration: this is the file that the users is
currently editing given his repository. Usual operations on repositories can be
modeled in this context, for instance importing the patches of another reposi-
tory is obtained by a pushout construction (the category of repositories is finitely
cocomplete).
25

File addition: 20120507_most_confusing_git_terminology.md (----------)

[0.1]

# Document Title
The most confusing git terminology
n.b. This blog post dates from 2012, so some of it may be out of date now.
To add my usual disclaimer to the start of these blog posts, I should say that I love git; I think it’s a beautiful and elegant system, and it saves me huge amounts of time in my daily work. However, I think it’s a fair criticism of the system that its terminology is very confusing for newcomers, and in particular those who have come from using CVS or Subversion.
This is a personal list of some of my “favourite” points of confusion, which I’ve seen arise time and time again, both in real life and when answering questions on Stack Overflow. To be fair to all the excellent people who have contributed to git’s development, in most cases it’s clear that they are well aware that these terms can be problematic, and are trying to improve the situation subject to compatibility constraints. The problems that seem most bizarre are those that reuse CVS and Subversion terms for completely different concepts – I speculate a bit about that at the bottom.
“update”
If you’ve used Subversion or CVS, you’re probably used to “update” being a command that goes to the remote repository, and incorporates changes from the remote version into your local copy – this is (very broadly) analogous to “git pull”.  So, when you see the following error message when using git:
foo.c: needs update
You might imagine that this means you need to run “git pull”. However, that’s wrong.  In fact, what “needs update” means is approximately: “there are local modifications to this file, which you should probably commit or stash”.
“track” and “tracking”
The word “track” is used in git in three senses that I’m aware of. This ambiguity is particularly nasty, because the latter two collide at a point in learning the system where newcomers to git are likely to be baffled anyway. Fortunately, this seems to have been recognized by git’s developers (see below).
1. “track” as in “untracked files”
To say that a file is tracked in the repository appears to mean that it is either present in the index or exists in the commit pointed to by HEAD.  You see this usage most often in the output of “git status”, where it will list “untracked files”:
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#    .classpath
This sense is relatively intuitive, I think – it was only after complaining for a while about the next two senses of “track” that I even remembered that there was also this one :)
2. “track” as in “remote-tracking branch”
As a bit of background, you can think of a remote-tracking branch as a local cache of the state of a branch in a remote repository.  The most commonly seen example is origin/master, or, to name that ref in full, refs/remotes/origin/master.  Such branches are usually updated by git fetch (and thus also potentially by git pull).  They are also updated by a successful push to the branch in the remote repository that they correspond to.   You can merge from them, examine their history, etc. but you can’t work directly on them.
The sense of “track” in the phrase “remote-tracking branch” is indicating that the remote-tracking branch is tracking the state of the branch in the remote repository the last time that remote-tracking branch was updated.  So, you might say that refs/remotes/origin/master is tracking the state of the branch master in origin.
The “tracking” here is defined by the refspec in the config variable remote.<remote-name>.fetch and the URL in the config variable remote.<remote-name>.url.
3. “track” as in “git branch –track foo origin/bar” and “Branch foo set up to track remote branch bar from origin”
Again, if you want to do some work on a branch from a remote repository, but want to keep your work separate from everything else in your repository, you’ll typically use a command like the following (or one of its many “Do What I Mean” equivalents):
git checkout --track -b foo origin/bar
… which will result in the following messages:
Branch foo set up to track remote branch bar from origin
Switched to a new branch 'foo'
The sense of “track” both in the command and the output is distinct from the previous sense – it means that config options have been set that associate your new local branch with another branch in the remote repository. The documentation sometimes refers to this relationship as making bar in origin “upstream” of foo. This “upstream” association is very useful, in fact: it enables nice features like being able to just type git pull while you’re on branch foo in order to fetch from origin and then merge from origin/bar. It’s also how you get helpful messages about the state of your branch relative to the remote-tracking branch, like “Your branch foo is 24 commits ahead of origin/bar and can be fast-forwarded”.
The tracking here is defined by config variables branch.<branch-name>.remote and branch.<branch-name>.merge.
“tracking” Summary
Fortunately, the third sense of “tracking” seems to be being carefully deprecated – for example, one of the possible options for push.default used to be tracking, but this is now deprecated in favour of the option name upstream. The commit message for 53c403116 says:
    push.default: Rename ‘tracking’ to ‘upstream’
    Users are sometimes confused with two different types of “tracking” behavior in Git: “remote-tracking” branches (e.g. refs/remotes/*/*) versus the merge/rebase relationship between a local branch and its @{upstream} (controlled by branch.foo.remote and branch.foo.merge config settings).
    When the push.default is set to ‘tracking’, it specifies that a branch should be pushed to its @{upstream} branch. In other words, setting push.default to ‘tracking’ applies only to the latter of the above two types of “tracking” behavior.
    In order to make this more understandable to the user, we rename the push.default == ‘tracking’ option to push.default == ‘upstream’.
    push.default == ‘tracking’ is left as a deprecated synonym for ‘upstream’.
“commit”
In CVS and Subversion, “commit” means to send your changes to the remote repository. In git the action of committing (with “git commit”) is entirely local; the closest equivalent of “cvs commit” is “git push”. In addition, the word “commit” in git is used as both a verb and a noun (although frankly I’ve never found this confusing myself – when you commit, you create a commit).
“checkout”
In CVS and Subversion “checkout” creates a new local copy of the source code that is linked to that repository. The closest command in git is “git clone”. However, in git, “git checkout” is used for something completely distinct. In fact, it has two largely distinct modes of operation:
    To switch HEAD to point to a new branch or commit, in the usage git checkout <branch>. If <branch> is genuinely a local branch, this will switch to that branch (i.e. HEAD will point to the ref name) or if it otherwise resolves to a commit will detach HEAD and point it directly to the commit’s object name.
    To replace a file or multiple files in the working copy and the index with their content from a particular commit or the index. This is seen in the usages: git checkout -- (update from the index) and git checkout <tree-ish> -- (where <tree-ish> is typically a commit).
(git checkout is also frequently used with -b, to create a new branch, but that’s really a sub-case of usage 1.)
In my ideal world, these two modes of operation would have different verbs, and neither of them would be “checkout”.
Update in 2023: This has now happened – you can now use “git switch” for the former cases and “git restore” for the latter. This is a very welcome change :) 
“HEAD” and “head”
There are usually many “heads” (lower-case) in a git repository – the tip of each branch is a head. However, there is only one HEAD (upper-case) which is a symbolic ref which points to the current branch or commit.
“fetch” and “pull”
I wasn’t aware of this until Roy Badami pointed it out, but it seems that git and Mercurial have opposite meanings for “fetch” and “pull” – see the top two lines in this table of git / hg equivalences. I think it’s understandable that since git’s and Mercurial’s development were more or less concurrent, such unfortunate clashes in terminology might occur.
“push” and “pull”
“git pull” is not the opposite of “git push”; the closest there is to an opposite of “git push” is “git fetch”.
“hash”, “SHA1”, “SHA1sum”, “object name” and “object identifier”
These terms are often used synonymously to mean the 40 characters hexadecimal strings that uniquely identify objects in git. “object name” seems to be the most official, but the least used in general. Referring to an object name as a SHA1sum is potentially confusing, since the object name for a blob is not the same as the SHA1sum of the file.
“remote branch”
This term is only used occasionally in the git documentation, but it’s one that I would always try to avoid because it tends to be unclear whether you mean “a branch in a remote repository” or “a remote-tracking branch”. Whenever a git beginner uses this phrase, I think it’s worth clarifying this, since it can avoid later confusion.
“index”, “staging area” and “cache”
As nouns, these are all synonyms, which all exist for historical reasons. Personally, I like “staging area” the best since it seems to be the easiest concept to understand for git beginners, but the other two are used more commonly in the documentation.
When used as command options, --index and --cached have distinct and consistent meanings, as explained by Junio C. Hamano in this useful blog post.
Why are there so many of these points of confusion?
I would speculate that the most significant effect that contributed to these terminology confusions is that git was being actively used by an enthusiastic community from very early in its development, which means that early names for concepts have tended to persist for the sake of compatibility and consistency. That doesn’t necessarily account for the many conflicts with CVS / Subversion usage, however.
To be fair to git, thinking up verbs for particular commands in any software is tough, and there have been enough version control systems written that to completely avoid clashes would lead to some convoluted choices. However, it’s hard to see git’s use of CVS / Subversion terminology for completely different concepts as anything but perverse. Linus has made it very clear many times that he hated CVS, and joked that a design principle for git was WWCVSND (What Would CVS Never Do); I’m sympathetic to that, as I’m sure most are, especially after having switched to the DVCS mindset. However, could that attitude have extended to deliberately disregarding concerns about terminology that might make it actively harder for people to migrate to git from CVS / Subversion? I don’t know nearly enough about the early development of git to know. However, it wouldn’t have been tough to find better choices for commit, checkout and update in each of their various senses.
Posted
2012-05-07
in
git
by
mark
Tags:
Comments
21 responses to “The most confusing git terminology”
    Dmitriy Matrosov Avatar
    Dmitriy Matrosov
    2012-06-09
    Hi, Mark.
    Thanks for yor excellent explanation of “tracking” branches! This is exactly what i searched for.
    Also, i think, here is typo:
    You wrote “The documentation sometimes refers to this relationship as making foo in origin “upstream” of bar.”, but it should be “bar in origin upstream of foo”.
    Reply
        mark Avatar
        mark
        2012-06-11
        Hi Dimitriy: thanks for your kind comment and the correction. I’ve edited the post to fix that now.
        Reply
    harold Avatar
    harold
    2012-10-05
    Really good article – a lot of these terms are thrown about in git documentation/tutorials without much explanation and can certainly trip you up.
    Reply
    ahmet Avatar
    ahmet
    2012-11-26
    “Why are there so many of these points of confusion?”
    Probably beacuse Linus is a dictator and he probaby didn’t give shit about others’ opinions and past works for the terminology.
    Reply
        Bri Avatar
        Bri
        2014-08-14
        Right! He forgot to include the input of someone who knows an inkling about user-friendliness. Imagine how much easier GIT SWITCH would be to remember than GIT CHECKOUT. The whole system was made unnecessarily complicated just by having awfully named functions.
        Reply
        Alberto Fonseca Avatar
        Alberto Fonseca
        2018-03-22
        Good point. Another reason is probably the agile paradigm that tells us that only “working code” is important, so developers happily cast aside anything even remotely resembling logical, structured, conceptional work, which works only based on a precisely defined set of domain vocabulary (which is exactly what we’re missing here).
        Reply
    Jennifer Avatar
    Jennifer
    2013-02-05
    I find the term upstream confusing. You mention one usage in this article. However, GitHub help recommends To keep track of the original repo [you forked from], you need to add another remote named upstream. So when someone says to push upstream, it’s ambiguous. I’m not sure if this usage is typical of Git in general or just GitHub, but it sure left me confused for a while.
    Reply
    Ian Avatar
    Ian
    2013-11-21
    “remote branch”
    This term is only used occasionally in the git documentation, but it’s one that I would always try to avoid because it tends to be unclear whether you mean “a branch in a remote repository” or “a remote-tracking branch”. Whenever a git beginner uses this phrase, I think it’s worth clarifying this, since it can avoid later confusion.
    Doh! That’s exactly what I’m trying to find out, but then you don’t actually give us the answer!
    Reply
        admin Avatar
        admin
        2013-11-21
        Hi Ian, That’s the thing – you often can’t tell without context. Where did you see “remote branch” used, or in what context? (“a branch in a remote repository” is the more accurate interpretation, I think, but people who don’t understand remote-tracking branches might use the term differently.)
        Reply
    Oleh Avatar
    Oleh
    2014-05-28
    Thanks for the helpful git articles. I’m a git noob, and the more different sources I read, the closer my understanding converges on what’s really going on. It’s just a shame that numerous git commands have ended up being labels where one must simply memorize what they really do — might as well have called them things like “abc” and “xyz” instead of terms that initially lead one astray.
    If you’re motivated to write another one, here’s a suggestion. Maybe you could review all the places that git stores stuff, like working repository, index (aka cache or staging area), remote repository, local copy of remote repository (remote tracking branch?), stash (and stash versions), etc., and list the commands that transfer data in various ways among these places. I’ve had to gradually formulate and revise these ideas in my own mind as I’ve been reading about git, and if I had an up-front explanation of this at the outset, it would’ve saved me lots of time.
    Reply
        mark Avatar
        mark
        2014-06-03
        Thanks for the kind words and your suggestion Oleh. I’ll certainly consider it – there are quite a few good diagrammatic representations of the commands that change what’s stored in the index, working tree, etc., but maybe there’s something more to do. (You can find some good examples with an image search for git index diagram.)
        Reply
        bryan chance Avatar
        bryan chance
        2017-12-06
        Oleh, That would be a great write up. I think those are the rest of the most confusing git terminoloy. :p I’m new to git as well and I thought I was just stupid because I couldn’t get a handle on it. This is 2017 and yes git terminologies still baffle users.
        Reply
            Savanna Avatar
            Savanna
            2018-11-13
            I’ve been using git for years and I’m only just now diving into how it actually works and what the commands actually mean since I’m working at a place that has a rather complicated branching structure… in the past I simply worked on master since I was the only or one of few developers. git pull and git push were usually enough. Now I find myself often totally confused and realized I don’t actually know what the heck I’m doing with git, so I’m reading all these articles and watching videos. It’s been too long coming! Haha
            Reply
    Alistair Avatar
    Alistair
    2014-11-12
        The sense of “track” in the phrase “remote-tracking branch” is indicating that the remote-tracking branch is tracking the state of the branch in the remote repository the last time that remote-tracking branch was updated.
    In the light of your paragraph about “update”, maybe “fetched” would be a better word than “updated”?
        3. “track” as in “git branch –track foo origin/bar”
    I don’t understand how senses 2 and 3 are different. Specifically, sense 2 seems to be the special case of 3 where “foo” and “bar” coincide. I.e. the “upstream” of a remote-tracking branch is the branch of the same name on “origin”. Is that right?
        commit
        As a noun
    I agree. This terminology is fine, and not really confusing at all. The closest english word would be “commitment”, I think, but in the context it is good to have a slightly different word.
        “fetch” and “pull” […] “push” and “pull”
    This juxtaposition is hilarious. In the former you are very diplomatic in suggesting that git’s version is an arbitrary choice, made by historical accident, but then in the latter you immediately make it look like exactly the opposite choice would be better.
    Reply
        admin Avatar
        admin
        2014-12-30
            In the light of your paragraph about “update”, maybe “fetched” would be a better word than “updated”
        I see what you mean, but I think it would probably be more confusing to use “fetched”; the colloquial sense of update is what I mean and the “needs update” sense in git is surprising and wouldn’t make sense in this context anyway.
            I don’t understand how senses 2 and 3 are different. Specifically, sense 2 seems to be the special case of 3 where “foo” and “bar” coincide. I.e. the “upstream” of a remote-tracking branch is the branch of the same name on “origin”. Is that right?
        In sense 2 (“remote-tracking branch”) the tracking can only be about this special type of branch tracking the branch in a remote repository; very often there’s no corresponding “local” branch (as opposed to a remote-tracking branch). In sense 3 (“–track”) there must be a local branch, and the tracking is describing an association between the local branch and one in the remote repository (and usually to a corresponding remote-tracking branch).
        Reply
    Philip A Avatar
    Philip A
    2015-04-01
    Thanks so much for this really useful and informative blog post which I have only recently discovered. I think focusing in on the confusing aspects of Git is a great way to improve one’s general understanding. Particularly helpful are your explanations of ‘remote tracking branches’ compared with ‘branches which track branches on a remote’. One very minor thing which confused me was: “Your branch foo is 24 commits ahead of origin/bar and can be fast-forwarded”. Shouldn’t it be ‘behind’ instead of ‘ahead’ ? Apologies if I have misunderstood. Anyway, once again … really good post :-)
    Reply
    Matthew Astley Avatar
    Matthew Astley
    2016-02-17
    Thanks – good list, and several of us here misuse the jargon.
    When describing the state of the work tree: clean, up-to-date and unmodified are easily mixed up.
    Going by the command name, clean should mean no untracked files (aka. cruft) in the work tree.
    Unmodified should mean no changes to tracked files, but the command to do this is reset – another naming clash?
    Up-to-date is trickier to define. There may be copies elsewhere containing commits you can’t yet see. Is the HEAD (local branch) up-to-date with one remote at a point in time when the fetch ran? Should all local branches be up-to-date?
    Reply
    Mike Weilgart Avatar
    Mike Weilgart
    2016-04-16
    This is an excellent article; thank you!
    Particularly helpful was the explanation of the three senses in which the word “track” may be used.
    The light went on when I realized that there are actually *three* things going on when you deal with a remote, not just two:
    1. The branch on the remote repository;
    2. The remote tracking branch in your local repository;
    3. The local branch associated with the remote tracking branch.
    This clarified for me how I should go about instructing students in this aspect of Git; thank you!
    Reply
    Simon Bagley Avatar
    Simon Bagley
    2016-06-02
    Thank you for the very informative article. I am a complete novice with Git, and your article will help me understand the many vary confusing names of commands, options etc.
    It would be useful if you defined the terms ‘refs’, and ‘refspec’ before you used them.
    Reply
    Doug Kimzey Avatar
    Doug Kimzey
    2019-09-06
    I could not agree more. I have been using git for about 3 months now. Git is not a full blown language but a source control tool.
    The ambiguity of the git syntax contributes greatly to the confusion experienced by developers facing git for the first time. The concepts are not difficult. The ambiguity of the git syntax constantly forces you to reach for a cheat sheet. The commands do not clearly describe their corresponding actions. Source control is a crucial part of any software project. Source control is not an area that should be managed by confusing and ambiguous terminology and syntax. Confusion puts work at risk and adds time to projects.
    Commands that concisely describe actions taken with source control are very important.
    Thousands of developers use git. Millions of Americans use IRS Tax Forms. This does not mean that Millions of Americans find IRS Tax Forms clear and easy to understand. Git terminology unnecessarily costs time and effort in translation.
    Reply
    Doug Kimzey Avatar
    Doug Kimzey
    2023-04-11
    This is very well said. If the git syntax was concise:
    – There would be far fewer books on amazon.
    – The number of up votes on git questions on stackoverflow would be much smaller (some of these are literally in the thousands).
    – There would be fewer cheat sheets and diagrams.
    – The investment in time and effort would be much less.
    The git documentation is poor because one vague command is described in terms and several ambiguous and confusing commands.
    Grammar is not the only measure of the quality of documentation.
    I am adopting some of the rules in the Simplified Technical English standard (ASD-STE100) to the naming of commands used in console applications and utilities.
    The goals of this standard are:
    – Reduce ambiguity.
    – Improve clarity of the technical text.
    – Make user manuals more comprehensive for non-native speakers of English.
    – Create better conditions for both human and machine translation.
    These goals should be carried to command syntax.
    Reply

File addition: 20120110_please_stop_using_git.md (----------)

[0.1]

thanks for having me up here i am
matthew mccullough a frequent speaker on
git and when flying around i'm even
willing to
help people through the aisles fasten
their seat belts and know what they need
to do in the sake of an emergency i have
an octocat named after my daughter but
the trouble that we're talking about
here today takes two forms and it has a
name that trouble is named git
i apologize for having proliferated it
so much at this point because when
thinking about it in two forms when
thinking about it from the positive and
the negative side the happy people say
it has distributed capabilities and i
can work when the network is offline and
it's robust it saves to disk i can
independently version things i don't
need any connectivity at all to make
this function and at the surface that
sounds awesome but i remember back to
the time when i had to drive in and go
to a cubicle and sit down at that
specific terminal to check in code and
there's something nice about those
fluorescent lights and that wonderful
elevator music that played while i was
there and then the git proponents say
well it's fault tolerant and it's it's
network enabled in the sense that you
can easily work when it's online or it's
off it's like a mobile device it caches
your results and this simply means that
we can work whether or not we have the
network
but i find this in this hard economic
time that we have today to be a reason
to stop buying hardware which is sick
when this economy needs your help the
most
sometimes people
but when you think about the positive
side they then back again say it's low
maintenance you don't have to really do
much with it every five thousand objects
it automatically garbage collects so in
that sense it's beautiful it's hands-off
it seems like it just basically takes
care of itself
but what really that means is it's
taking away the opportunity of my friend
terence who hopes someday to graduate
with an svn admin job what's he going to
do when he graduates and there's no more
svn to maintain
i don't know think about terrence
they also claim the disblazing fast and
you can commit three thousand objects
watch this over my tethered device it
goes up into the cloud and seven seconds
later it's beautifully committed and
visible on github and the user interface
can you do that with your version
control system isn't that fantastic but
you know what
i remember when we could talk up for 30
minutes with clearcase about who was
gonna win survivor and i enjoyed that
bonding with guys like tad and ryan
i don't have it anymore
and in fact then they come back and say
it's still fine because it's open source
and free and it's all this liberty that
we have at a conference like open source
embodied in a tool that cross-cuts all
the possible languages that we might
talk about here at a conference like
this
but you know what if you don't fund
software development if you don't give
them your heart in dollars you might not
have clearcase in a few years and can
you imagine what it would be like to
live in a world without clearcase at
your disposal do you want to live in
that world
say it's disconservative it only takes a
small amount of hard disk to store every
version ever
of every branch that you've ever
committed to anyone on the team with any
tag
but you know what in my box is an
amazing hard disk a solid state flash
hard disk and another one in another box
with perpendicular bits and that r d
didn't come from not using disks like
wild so think about what your nest disc
will be and whether you're ruining that
chance with git
then i used to commit a lot of bugs and
one of the things from the proponents of
get they'd say you could search for bugs
and find where the regression happened
exactly what commit brought that into
existence isn't that wonderful you can
isolate and find what you did wrong but
i remember the days where my friends tad
and ryan would help support me with 150
comment commits to help drown that bug
so when that engineer wanted to go find
it hope of that was zero like i wanted
it to be
but then they still keep coming back and
saying it's like cats and dogs living
together the compatibility is wonderful
we can talk to clearcase and to
subversion and perforce with these
conversion back and forth and
round-tripping utilities subversion
being the special mark there but you
know what
this is a bridge for them to be using
tools covertly that are not approved by
the it organization and that leads to
all kinds of other havoc in every frame
soon we'll be using libraries you don't
even understand what they are
they claim it's free as in beer as their
last one to say it helps in hard times
with keeping a minimal budget and really
allowing to have a great tool at a small
cost that doesn't need maintenance and
that helps us all but i bet you if you
keep on this track one day somebody's
even going to take it to the edge saying
they're going to have a free operating
system and no my word i can't even think
what that all chaos that'll bring to the
table do you want that do you want a
free operating system i don't think so
so i ask you even though i've taught git
for this long reconsider
salvage it for the sake of the people in
college today stop using git you're
ruining their lives their hopes their
dreams their future and it needs to end
now
thank you
you

File addition: 20090110_badmerge.md (----------)

[0.1]

# Document Title
badmerge -- abstract version
See also the concrete example with real C code where darcs merge does the right thing and the others do the wrong thing. See also an attempted counter-example where darcs merge allegedly does the wrong thing, along with my argument that this is not actually a counter-example.
    a
   / \
  b1  c1
  |    
  b2
Now the task is to merge b2 and c1. As far as I know all revision control tools except for darcs, Codeville, SCCS, and perhaps BitKeeper treat this as a 3-merge of (a, b2, c1).
The git or svn 3-merge algorithm and the GNU diff3 algorithm apply the changed line from c1 to the wrong place in b2, resulting in an apparently clean but actually wrong result.
merged by SVN v1.2.0, GNU diff3 v3.8.1, or git v1.5.6.3
Whereas darcs applies the changed from from c1 to the right place in b2, resulting in a clean merge with the correct result.
merged by darcs v1
The difference between what svn does and what darcs does here is, contrary to popular belief, not that darcs makes a better or luckier guess as to where the line from c1 should go, but that darcs uses information that svn does not use -- namely the information contained in b1 -- to learn that the location has moved and precisely to where it has moved.
Any algorithm which merges c1 with b2 by examining (a, b2, c1) but without reference to the information in b1 is doomed to do worse than darcs does in this sort of case. (Where a silent but incorrect merge such as SVN currently does is worse than raising it to the user as a conflict, which is worse than merging it correctly.)
It is important to understand that darcs merge performs better in this example not because it is "luckier" or "more complicated" than 3-merge (indeed darcs merge is actually a much simpler algorithm than 3-merge is for this example) but because darcs merge takes advantage of more information -- information which is already present in other systems such as SVN but which is not used in the current SVN merge operation.
code listings
listing a, abstract
A
B
C
D
E
listing diff from a to b1, abstract
@@ -1,3 +1,6 @@
+G
+G
+G
 A
 B
 C
listing b1, abstract
G
G
G
A
B
C
D
E
listing diff from b1 to b2, abstract
@@ -1,3 +1,8 @@
+A
+B
+C
+D
+E
 G
 G
 G
listing b2, abstract
A
B
C
D
E
G
G
G
A
B
C
D
E
listing diff from a to c1, abstract
@@ -1,5 +1,5 @@
 A
 B
-C
+X
 D
 E
listing c1, abstract
A
B
X
D
E
listing 3-way merged, abstract
A
B
X
D
E
G
G
G
A
B
C
D
E
listing darcs merged, abstract
A
B
C
D
E
G
G
G
A
B
X
D
E
Thanks for Nathaniel J. Smith for help with this example.
Thanks for Ken Schalk for finding bugs in this web page.
zooko
Last modified: Sat Jan 10 15:04:11 MST 2009

Insertion in tbn.md at line 71 [2.1]

[2.19981]


 - https://www.boringcactus.com/2021/02/22/can-we-please-move-past-git.html
 - https://jneem.github.io/merging/
 - https://jneem.github.io/pijul/
 - https://jneem.github.io/cycles/
 - https://jneem.github.io/ids/
 - https://jneem.github.io/pseudo/
 - https://pod.link/1602572955/episode/8070d95c8c25fd131709f03f6495589a
 - https://nibblestew.blogspot.com/2020/12/some-things-potential-git-replacement.html
 - https://v5.chriskrycho.com/essays/jj-init/
 - https://tildes.net/~comp/aqk/what_other_version_control_systems_do_people_use_other_than_git
 - https://arxiv.org/abs/1311.3903
 - https://stackoverflow.blog/2023/05/23/for-those-who-just-dont-git-it-ep-573/
 - https://foojay.io/today/foojay-podcast-26/
 - https://www.youtube.com/watch?v=7MpdZkGj5AI
 - https://www.youtube.com/watch?v=M4KktA_jbOE
 - https://www.youtube.com/watch?v=o0ooKVikV3c
 - https://www.youtube.com/watch?v=lW0gxMbyLEM
 - https://www.youtube.com/watch?v=bx_LGilOuE4
 - https://jvns.ca/blog/2023/12/31/2023--year-in-review/#a-lot-of-blog-posts-about-git
 - https://stackoverflow.blog/2023/01/09/beyond-git-the-other-version-control-systems-developers-use/
 - https://katafrakt.me/2017/05/27/beyond-git/
 - https://www.fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki
 - https://www.youtube.com/watch?v=WKVX7xq58kA
 - https://www.youtube.com/watch?v=ghtpJnrdgbo
 - https://www.youtube.com/watch?v=jPSOxVjK8A0
 - https://www.youtube.com/watch?v=584sAUbHU1o
 - https://www.youtube.com/watch?v=o4PFDKIc2fs
 - https://www.youtube.com/watch?v=2XQz-x6wAWk
 - https://www.youtube.com/watch?v=7d3Qnr5F9zs
 - https://jesseduffield.com/Lazygit-5-Years-On/
 - https://www.forrestthewoods.com/blog/dependencies-belong-in-version-control/
 - https://dev.to/yonkeltron/is-it-time-to-look-past-git-ah4
 - https://longair.net/blog/2012/05/07/the-most-confusing-git-terminology
 - https://mitchellh.com/writing/github-changesets
 - https://www.youtube.com/watch?v=RPmeZH8sOs8
 - https://matklad.github.io/2023/10/23/unified-vs-split-diff.html
 - https://blog.waleedkhan.name/bringing-revsets-to-git/
 - https://blog.waleedkhan.name/git-ui-features/
 - https://blog.waleedkhan.name/in-memory-rebases/
 - https://engineering.fb.com/2022/11/15/open-source/sapling-source-control-scalable/
 - http://darcs.net/RosettaStone
 - https://en.wikipedia.org/wiki/Comparison_of_version-control_software
 - http://www.cheat-sheets.org/#Bazaar
 - http://www.cheat-sheets.org/#Git
 - http://www.cheat-sheets.org/#CVS
 - https://meldmerge.org/
 - https://duckrowing.com/2013/12/26/bzr-init-a-bazaar-tutorial/
 - https://beza1e1.tuxen.de/monorepo_vcs.html
 - https://www.inkandswitch.com/local-first/
 - https://mattweidner.com/2023/09/26/crdt-survey-1.html
 - https://inria.hal.science/hal-02983557/document
 - https://mattweidner.com/2022/02/10/collaborative-data-design.html
 - https://crdt.tech/papers.html
 - https://inria.hal.science/hal-00738680/PDF/RR-8083.pdf
 - https://lewiscampbell.tech/sync.html
 - https://initialcommit.com/blog/pijul-creator
 - https://tahoe-lafs.org/~zooko/badmerge/simple.html
 - https://ohshitgit.com/
 - https://nest.pijul.com/tae/pijul-for-git-users
 - https://www.youtube.com/watch?v=ahyF8e9qKBc
 - https://github.com/git-game/git-game

add transcriptions

Dependencies

In channels

Change contents

File addition: transcriptions (d--x------)

File addition: 20240625_fossil_versus_git.md (----------)

File addition: 20240418_learning_git_a_hands_on_and_visual_guide.md (----------)

File addition: 20240213_crdt_survey_algorithmic_techniques.md (----------)

File addition: 20240202_jjinit.md (----------)

File addition: 20240111_patch_terminology.md (----------)

File addition: 20231204_mounting_git_commits_as_folders_with_nfs.md (----------)

File addition: 20231125_dependencies_belong_in_version_control.md (----------)

File addition: 20231123_git_branches_intuition_reality.md (----------)

File addition: 20231112_darcs_rosetta_stone.md (----------)

File addition: 20231110_how_git_cherrypick_and_revert_use_3_way_merge.md (----------)

File addition: 20231106_git_rebase_what_can_go_wrong.md (----------)

File addition: 20231101_confusing_git_terminology.md (----------)

File addition: 20231023_unified_versus_split_diff.md (----------)

File addition: 20231020_gitfacts.md (----------)

File addition: 20231017_crdt_survey_semantic_techniques.md (----------)

File addition: 20230930_github_changesets.md (----------)

File addition: 20230926_crdt_survey_further_topics.md (----------)

File addition: 20230926_crdt_survey.md (----------)

File addition: 20230917_git_is_awful_prime_reacts.md (----------)

File addition: 20230914_where_do_your_files_live.md (----------)

File addition: 20230805_lazy_git_5_years_on.md (----------)

File addition: 20230711_pijul_version_control_post_git.md (----------)

File addition: 20230628_what_comes_after_git.md (----------)

File addition: 20230421_growing_programming_communities.md (----------)

File addition: 20230206_3_questions_on_the_sapling_source_control_system.md (----------)

File addition: 20230109_beyond_git.md (----------)

File addition: 20230105_where_are_my_git_ui_features_from_the_futures.md (----------)

File addition: 20221116_bringing_revsets_to_git.md (----------)

File addition: 20221115_sapling_source_control_scalable.md (----------)

File addition: 20221019_build_aware_sparse_checkouts.md (----------)

File addition: 20221018_jujutsu.md (----------)

File addition: 20221012_will_git_be_around_forever.md (----------)

File addition: 20220606_is_it_time_to_look_past_git.md (----------)

File addition: 20220310_pijul_for_git_users.md (----------)

File addition: 20220210_designing_data_structures_for_collaborative_apps.md (----------)

File addition: 20211012_lightning_fast_rebases_with_git_move.md (----------)

File addition: 20210619_git_undo_we_can_do_better.md (----------)

File addition: 20210318_bob_2021_raichoo_ketchum_darcs_because_git_won.md (----------)

File addition: 20210222_move_past_git.md (----------)

File addition: 20201228_git_replacement.md (----------)

File addition: 20201129_qa_pijul.md (----------)

File addition: 20201030_conflict_free_replicated_relations.md (----------)

File addition: 20191107_pondering_monorepo_version_control_system.md (----------)

File addition: 20190810_fossil_versus_git.md (----------)

File addition: 20190507_pseudo.md (----------)

File addition: 20190401_local_first_software.md (----------)

File addition: 20190225_ids.md (----------)

File addition: 20190222_other_than_git.md (----------)

File addition: 20190219_cycles.md (----------)

File addition: 20180320_why_i_prefer_fossil_over_git.md (----------)

File addition: 20170527_beyond_git.md (----------)

File addition: 20170513_pijul.md (----------)

File addition: 20170508_merging.md (----------)

File addition: 20170109_pijul_sane_version_control.md (----------)

File addition: 20150613_git_just_say_no.md (----------)

File addition: 20141127_darcs_what_why_and_how.md (----------)

File addition: 20131226_bzr_init_bazaar_tutorial.md (----------)

File addition: 20131113_categorical_theory_of_patches.md (----------)

File addition: 20120507_most_confusing_git_terminology.md (----------)

File addition: 20120110_please_stop_using_git.md (----------)

File addition: 20090110_badmerge.md (----------)

Insertion in tbn.md at line 71 [2.1]