RD5W5OJB37WCUCK6FYFXSMK7GAWIF5EN5QQWROZSOI4KJMUPJ55AC # Document TitleFossil Versus Git1.0 Don't Stress!The feature sets of Fossil and Git overlap in many ways. Both are distributed version control systems which store a tree of check-in objects to a local repository clone. In both systems, the local clone starts out as a full copy of the remote parent. New content gets added to the local clone and then later optionally pushed up to the remote, and changes to the remote can be pulled down to the local clone at will. Both systems offer diffing, patching, branching, merging, cherry-picking, bisecting, private branches, a stash, etc.Fossil has inbound and outbound Git conversion features, so if you start out using one DVCS and later decide you like the other better, you can easily move your version-controlled file content.¹In this document, we set all of that similarity and interoperability aside and focus on the important differences between the two, especially those that impact the user experience.Keep in mind that you are reading this on a Fossil website, and though we try to be fair, the information here might be biased in favor of Fossil, if only because we spend most of our time using Fossil, not Git. Ask around for second opinions from people who have used both Fossil and Git.If you want a more practical, less philosophical guide to moving from Git to Fossil, see our Git to Fossil Translation Guide.2.0 Differences Between Fossil And GitDifferences between Fossil and Git are summarized by the following table, with further description in the text that follows.| GIT | FOSSIL | more || File versioning only | VCS, tickets, wiki, docs, notes, forum, chat, UI, RBAC | 2.1 ↓ || A federation of many small programs | One self-contained, stand-alone executable | 2.2 ↓ || Custom key/value data store | The most used SQL database in the world | 2.3 ↓ || Runs natively on POSIX systems | Runs natively on both POSIX and Windows | 2.4 ↓ || Bazaar-style development | Cathedral-style development | 2.5.1 ↓ || Designed for Linux kernel development | Designed for SQLite development | 2.5.2 ↓ || Many contributors | Select contributors | 2.5.3 ↓ || Focus on individual branches | Focus on the entire tree of changes | 2.5.4 ↓ || One check-out per repository | Many check-outs per repository | 2.6 ↓ || Remembers what you should have done | Remembers what you actually did | 2.7 ↓ || Commit first | Test first | 2.8 ↓ || SHA-1 or SHA-2 | SHA-1 and/or SHA-3, in the same repository | 2.9 ↓ |2.1 FeaturefulGit provides file versioning services only, whereas Fossil adds an integrated wiki, ticketing & bug tracking, embedded documentation, technical notes, a web forum, and a chat service, all within a single nicely-designed skinnable web UI, protected by a fine-grained role-based access control system. These additional capabilities are available for Git as 3rd-party add-ons, but with Fossil they are integrated into the design, to the point that it approximates "GitHub-in-a-box."Even if you only want straight version control, Fossil has affordances not available in Git.For instance, Fossil can do operations over all local repo clones and check-out directories with a single command. You can say "fossil all sync" on a laptop prior to taking it off the network hosting those repos, as before going on a trip. It doesn't matter if those repos are private and restricted to your company network or public Internet-hosted repos, you get synced up with everything you need while off-network.You get the same capability with several other Fossil sub-commands as well, such as "fossil all changes" to get a list of files that you forgot to commit prior to the end of your working day, across all repos.Whenever Fossil is told to modify the local checkout in some destructive way (fossil rm, fossil update, fossil revert, etc.) Fossil remembers the prior state and is able to return the check-out directory to that state with a fossil undo command. While you cannot undo a commit in Fossil — on purpose! — as long as the change remains confined to the local check-out directory only, Fossil makes undo easier than in Git.For developers who choose to self-host projects rather than rely on a 3rd-party service such as GitHub, Fossil is much easier to set up: the stand-alone Fossil executable together with a 2-line CGI script suffice to instantiate a full-featured developer website. To accomplish the same using Git requires locating, installing, configuring, integrating, and managing a wide assortment of separate tools. Standing up a developer website using Fossil can be done in minutes, whereas doing the same using Git requires hours or days.Fossil is small, complete, and self-contained. If you clone Git's self-hosting repository, you get just Git's source code. If you clone Fossil's self-hosting repository, you get the entire Fossil website — source code, documentation, ticket history, and so forth.² That means you get a copy of this very article and all of its historical versions, plus the same for all of the other public content on this site.2.2 Self ContainedGit is actually a collection of many small tools, each doing one small part of the job, which can be recombined (by experts) to perform powerful operations. Git has a lot of complexity and many dependencies, so that most people end up installing it via some kind of package manager, simply because the creation of complicated binary packages is best delegated to people skilled in their creation. Normal Git users are not expected to build Git from source and install it themselves.Fossil is a single self-contained stand-alone executable which depends only on common platform libraries in its default configuration. To install one of our precompiled binaries, unpack the executable from the archive and put it somewhere in your PATH. To uninstall it, delete the executable.This policy is particularly useful when running Fossil inside a restrictive container, anything from classic chroot jails to modern OS-level virtualization mechanisms such as Docker. Our stock container image is under 8 MB when uncompressed and running. It contains nothing but a single statically-linked binary.If you build a dynamically linked binary instead, Fossil's on-disk size drops to around 6 MB, and it's dependent only on widespread platform libraries with stable ABIs such as glibc, zlib, and openssl.Full static linking is easier on Windows, so our precompiled Windows binaries are just a ZIP archive containing only "fossil.exe". There is no "setup.exe" to run.Fossil is easy to build from sources. Just run "./configure && make" on POSIX systems and "nmake /f Makefile.msc" on Windows.Contrast a basic installation of Git, which takes up about 15 MiB on Debian 10 across 230 files, not counting the contents of /usr/share/doc or /usr/share/locale. If you need to deploy to any platform where you cannot count on facilities like the POSIX shell, Perl interpreter, and Tcl/Tk platform needed to fully use Git as part of the base platform, the full footprint of a Git installation extends to more like 45 MiB and thousands of files. This complicates several common scenarios: Git for Windows, chrooted Git servers, Docker images...Some say that Git more closely adheres to the Unix philosophy, summarized as "many small tools, loosely joined," but we have many examples of other successful Unix software that violates that principle to good effect, from Apache to Python to ZFS. We can infer from that that this is not an absolute principle of good software design. Sometimes "many features, tightly-coupled" works better. What actually matters is effectiveness and efficiency. We believe Fossil achieves this.The above size comparisons aren't apples-to-apples anyway. We've compared the size of Fossil with all of its many built-in features to a fairly minimal Git installation. You must add a lot of third-party software to Git to give it a Fossil-equivalent feature set. Consider GitLab, a third-party extension to Git wrapping it in many features, making it roughly Fossil-equivalent, though much more resource hungry and hence more costly to run than the equivalent Fossil setup. The official GitLab Community Edition container currently clocks in at 2.66 GiB!GitLab's requirements are easy to accept when you're dedicating a local rack server or blade to it, since its minimum requirements are more or less a description of the smallest thing you could call a "server" these days, but when you go to host that in the cloud, you can expect to pay about 8 times as much to comfortably host GitLab as for Fossil.³ This difference is largely due to basic technology choices: Ruby and PostgreSQL vs C and SQLite.The Fossil project itself is hosted on a small and inexpensive VPS. A bare-bones $5/month VPS or a spare Raspberry Pi is sufficient to run a full-up project site, complete with tickets, wiki, chat, and forum, in addition to being a code repository.2.3 Query LanguageThe baseline data structures for Fossil and Git are the same, modulo formatting details. Both systems manage a directed acyclic graph (DAG) of Merkle tree structured check-in objects. Check-ins are identified by a cryptographic hash of the check-in contents, and each check-in refers to its parent via the parent's hash.The difference is that Git stores its objects as individual files in the .git folder or compressed into bespoke key/value pack-files, whereas Fossil stores its objects in a SQLite database file which provides ACID transactions and a high-level query language. This difference is more than an implementation detail. It has important practical consequences.One notable consequence is that it is difficult to find the descendants of check-ins in Git. One can easily locate the ancestors of a particular Git check-in by following the pointers embedded in the check-in object, but it is difficult to go the other direction and locate the descendants of a check-in. It is so difficult, in fact, that neither native Git nor GitHub provide this capability short of crawling the commit log. With Fossil, on the other hand, finding descendants is a simple SQL query. It is common in Fossil to ask to see all check-ins since the last release. Git lets you see "what came before". Fossil makes it just as easy to also see "what came after".Leaf check-ins in Git that lack a "ref" become "detached," making them difficult to locate and subject to garbage collection. This detached head state problem has caused grief for many Git users. With Fossil, detached heads are simply impossible because we can always find our way back into the Merkle tree using one or more of the relations in the SQL database.The SQL query capabilities of Fossil make it easier to track the changes for one particular file within a project. For example, you can easily find the complete edit history of this one document, or even the same history color-coded by committer, Both questions are simple SQL query in Fossil, with procedural code only being used to format the result for display. The same result could be obtained from Git, but because the data is in a key/value store, much more procedural code has to be written to walk the data and compute the result. And since that is a lot more work, the question is seldom asked.The ease of querying Fossil data using SQL means that status or history information about the project under management is easier to obtain. Being easier means that it is more likely to happen. Fossil reports tend to be more detailed and useful. Compare this Fossil timeline to its closest equivalent in GitHub. Judge for yourself: which of those reports is more useful to a developer trying to understand what happened?The bottom line is that even though Fossil and Git are built around the same low-level data structure, the use of SQL to query this data makes the data more accessible in Fossil, resulting in more detailed information being available to the user. This improves situational awareness and makes working on the project easier.2.4 PortableFossil is largely written in ISO C, almost purely conforming to the original 1989 standard. We make very little use of C99, and we do not knowingly make any use of C11. Fossil does call POSIX and Windows APIs where necessary, but it's about as portable as you can ask given that ISO C doesn't define all of the facilities Fossil needs to do its thing. (Network sockets, file locking, etc.) There are certainly well-known platforms Fossil hasn't been ported to yet, but that's most likely due to lack of interest rather than inherent difficulties in doing the port. We believe the most stringent limit on its portability is that it assumes at least a 32-bit CPU and several megs of flat-addressed memory.⁴ Fossil isn't quite as portable as SQLite, but it's close.Over half of the C code in Fossil is actually an embedded copy of the current version of SQLite. Much of what is Fossil-specific after you set SQLite itself aside is SQL code calling into SQLite. The number of lines of SQL code in Fossil isn't large by percentage, but since SQL is such an expressive, declarative language, it has an outsized contribution to Fossil's user-visible functionality.Fossil isn't entirely C and SQL code. Its web UI uses JavaScript where necessary. The server-side UI scripting uses a custom minimal Tcl dialect called TH1, which is embedded into Fossil itself. Fossil's build system and test suite are largely based on Tcl.⁵ All of this is quite portable.About half of Git's code is POSIX C, and about a third is POSIX shell code. This is largely why the so-called "Git for Windows" distributions (both first-party and third-party) are actually an MSYS POSIX portability environment bundled with all of the Git stuff, because it would be too painful to port Git natively to Windows. Git is a foreign citizen on Windows, speaking to it only through a translator.⁶While Fossil does lean toward POSIX norms when given a choice — LF-only line endings are treated as first-class citizens over CR+LF, for example — the Windows build of Fossil is truly native.The third-party extensions to Git tend to follow this same pattern. GitLab isn't portable to Windows at all, for example. For that matter, GitLab isn't even officially supported on macOS, the BSDs, or uncommon Linuxes! We have many users who regularly build and run Fossil on all of these systems.2.5 Linux vs. SQLiteFossil and Git promote different development styles because each one was specifically designed to support the creator's main software development project: Linus Torvalds designed Git to support development of the Linux kernel, and D. Richard Hipp designed Fossil to support the development of SQLite. Both projects must rank high on any objective list of "most important FOSS projects," yet these two projects are almost entirely unlike one another, so it is natural that the DVCSes created to support these projects also differ in many ways.In the following sections, we will explain how four key differences between the Linux and SQLite software development projects dictated the design of each DVCS's low-friction usage path.When deciding between these two DVCSes, you should ask yourself, "Is my project more like Linux or more like SQLite?"2.5.1 Development OrganizationEric S. Raymond's seminal essay-turned-book "The Cathedral and the Bazaar" details the two major development organization styles found in FOSS projects. As it happens, Linux and SQLite fall on opposite sides of this dichotomy. Differing development organization styles dictate a different design and low-friction usage path in the tools created to support each project.Git promotes the Linux kernel's bazaar development style, in which a loosely-associated mass of developers contribute their work through a hierarchy of lieutenants who manage and clean up these contributions for consideration by Linus Torvalds, who has the power to cherry-pick individual contributions into his version of the Linux kernel. Git allows an anonymous developer to rebase and push specific locally-named private branches, so that a Git repo clone often isn't really a clone at all: it may have an arbitrary number of differences relative to the repository it originally cloned from. Git encourages siloed development. Select work in a developer's local repository may remain private indefinitely.All of this is exactly what one wants when doing bazaar-style development.Fossil's normal mode of operation differs on every one of these points, with the specific designed-in goal of promoting SQLite's cathedral development model:Personal engagement: SQLite's developers know each other by name and work together daily on the project.Trust over hierarchy: SQLite's developers check changes into their local repository, and these are immediately and automatically synchronized up to the central repository; there is no "dictator and lieutenants" hierarchy as with Linux kernel contributions. D. Richard Hipp rarely overrides decisions made by those he has trusted with commit access on his repositories. Fossil allows you to give some users more power over what they can do with the repository, but Fossil only loosely supports the enforcement of a development organization's social and power hierarchies. Fossil is a great fit for flat organizations.No easy drive-by contributions: Git pull requests offer a low-friction path to accepting drive-by contributions. Fossil's closest equivalents are its unique bundle and patch features, which require higher engagement than firing off a PR.⁷ This difference comes directly from the initial designed purpose for each tool: the SQLite project doesn't accept outside contributions from previously-unknown developers, but the Linux kernel does.No rebasing: When your local repo clone syncs changes up to its parent, those changes are sent exactly as they were committed locally. There is no rebasing mechanism in Fossil, on purpose.Sync over push: Explicit pushes are uncommon in Fossil-based projects: the default is to rely on autosync mode instead, in which each commit syncs immediately to its parent repository. This is a mode so you can turn it off temporarily when needed, such as when working offline. Fossil is still a truly distributed version control system; it's just that its starting default is to assume you're rarely out of communication with the parent repo.This is not merely a reflection of modern always-connected computing environments. It is a conscious decision in direct support of SQLite's cathedral development model: we don't want developers going dark, then showing up weeks later with a massive bolus of changes for us to integrate all at once. Jim McCarthy put it well in his book on software project management, Dynamics of Software Development: "Beware of a guy in a room."Branch names sync: Unlike in Git, branch names in Fossil are not purely local labels. They sync along with everything else, so everyone sees the same set of branch names. Fossil's design choice here is a direct reflection of the Linux vs. SQLite project outlook: SQLite's developers collaborate closely on a single coherent project, whereas Linux's developers go off on tangents and occasionally send selected change sets to each other.Private branches are rare: Private branches exist in Fossil, but they're normally used to handle rare exception cases, whereas in many Git projects, they're part of the straight-line development process.Identical clones: Fossil's autosync system tries to keep each local clone identical to the repository it cloned from.Where Git encourages siloed development, Fossil fights against it. Fossil places a lot of emphasis on synchronizing everyone's work and on reporting on the state of the project and the work of its developers, so that everyone — especially the project leader — can maintain a better mental picture of what is happening, leading to better situational awareness.By contrast, "…forking is at the core of social coding at GitHub". As of January 2022, Github hosts 47 million distinct software projects, most of which were created by forking a previously-existing project. Since this is roughly twice the number of developers in the world, it beggars belief that most of these forks are still under active development. The vast bulk of these must be abandoned one-off efforts. This is part of the nature of bazaar style development.You can think about this difference in terms of feedback loop size, which we know from the mathematics of control theory to directly affect the speed at which any system can safely make changes. The larger the feedback loop, the slower the whole system must run in order to avoid loss of control. The same concept shows up in other contexts, such as in the OODA loop concept. Committing your changes to private branches in order to delay a public push to the parent repo increases the size of your collaborators' control loops, either causing them to slow their work in order to safely react to your work, or to over-correct in response to each change.Each DVCS can be used in the opposite style, but doing so works against their low-friction paths.2.5.2 ScaleThe Linux kernel has a far bigger developer community than that of SQLite: there are thousands and thousands of contributors to Linux, most of whom do not know each other's names. These thousands are responsible for producing roughly 89× more code than is in SQLite. (10.7 MLOC vs. 0.12 MLOC according to SLOCCount.) The Linux kernel and its development process were already uncommonly large back in 2005 when Git was designed, specifically to support the consequences of having such a large set of developers working on such a large code base.95% of the code in SQLite comes from just four programmers, and 64% of it is from the lead developer alone. The SQLite developers know each other well and interact daily. Fossil was designed for this development model.When choosing your DVCS, we think you should ask yourself whether the scale of your software configuration management problems is closer to those Linus Torvalds designed Git to cope with or whether your work's scale is closer to that of SQLite, for which D. Richard Hipp designed Fossil. An automotive air impact wrench running at 8000 RPM driving an M8 socket-cap bolt at 16 cm/s is not the best way to hang a picture on the living room wall.Fossil works well for projects several times the size of SQLite, such as Tcl, with a repository over twice the size and with many more core committers.2.5.3 Individual Branches vs. The Entire Change HistoryBoth Fossil and Git store history as a directed acyclic graph (DAG) of changes, but Git tends to focus more on individual branches of the DAG, whereas Fossil puts more emphasis on the entire DAG.For example, the default behavior in Git is to only synchronize a single branch, whereas with Fossil the only sync option is to sync the entire DAG. Git commands, GitHub, and GitLab tend to show only a single branch at a time, whereas Fossil usually shows all parallel branches at once. Git has commands like "rebase" that help keep all relevant changes on a single branch, whereas Fossil encourages a style of many concurrent branches constantly springing into existence, undergoing active development in parallel for a few days or weeks, then merging back into the main line and disappearing.This difference in emphasis arises from the different purposes of the two systems. Git focuses on individual branches, because that is exactly what you want for a highly-distributed bazaar-style project such as Linux. Linus Torvalds does not want to see every check-in by every contributor to Linux: such extreme visibility does not scale well. Contrast Fossil, which was written for the cathedral-style SQLite project and its handful of active committers. Seeing all changes on all branches all at once helps keep the whole team up-to-date with what everybody else is doing, resulting in a more tightly focused and cohesive implementation.2.6 One vs. Many Check-outs per RepositoryBecause Git commingles the repository data with the initial checkout of that repository, the default mode of operation in Git is to stick to that single work/repo tree, even when that's a shortsighted way of working.Fossil doesn't work that way. A Fossil repository is a SQLite database file which is normally stored outside the working checkout directory. You can open a Fossil repository any number of times into any number of working directories. A common usage pattern is to have one working directory per active working branch, so that switching branches is done with a cd command rather than by checking out the branches successively in a single working directory.Fossil does allow you to switch branches within a working checkout directory, and this is also often done. It is simply that there is no inherent penalty to either choice in Fossil as there is in Git. The standard advice is to use a switch-in-place workflow in Fossil when the disturbance from switching branches is small, and to use multiple checkouts when you have long-lived working branches that are different enough that switching in place is disruptive.While you can use Git in the Fossil style, Git's default tie between working directory and repository means the standard method for working with a Git repo is to have one working directory only. Most Git tutorials teach this style, so it is how most people learn to use Git. Because relatively few people use Git with multiple working directories per repository, there are several known problems with that way of working, problems which don't happen in Fossil because of the clear separation between a Fossil repository and each working directory.This distinction matters because switching branches inside a single working directory loses local context on each switch.For instance, in any software project where the runnable program must be built from source files, you invalidate build objects on each switch, artificially increasing the time required to switch versions. Most obviously, this affects software written in statically-compiled programming languages such as C, Java, and Haskell, but it can even affect programs written in dynamic languages like JavaScript. A typical SPA build process involves several passes: Browserify to convert Node packages so they'll run in a web browser, SASS to CSS translation, transpilation of Typescript to JavaScript, uglification, etc. Once all that processing work is done for a given input file in a given working directory, why re-do that work just to switch versions? If most of the files that differ between versions don't change very often, you can save substantial time by switching branches with cd rather than swapping versions in-place within a working checkout directory.For another example, you might have an active long-running test grinding away in a working directory, then get a call from a customer requiring that you switch to a stable branch to answer questions in terms of the version that customer is running. You don't want to stop the test in order to switch your lone working directory to the stable branch.Disk space is cheap. Having several working directories — each with its own local state — makes switching versions cheap and fast.Plus, cd is faster to type than git checkout or fossil update.2.7 What you should have done vs. What you actually didGit puts a lot of emphasis on maintaining a "clean" check-in history. Extraneous and experimental branches by individual developers often never make it into the main repository. Branches may be rebased before being pushed to make it appear as if development had been linear, or "squashed" to make it appear that multiple commits were made as a single commit. There are other history rewriting mechanisms in Git as well. Git strives to record what the development of a project should have looked like had there been no mistakes.Fossil, in contrast, puts more emphasis on recording exactly what happened, including all of the messy errors, dead-ends, experimental branches, and so forth. One might argue that this makes the history of a Fossil project "messy," but another point of view is that this makes the history "accurate." In actual practice, the superior reporting tools available in Fossil mean that this incidental mess is not a factor.Like Git, Fossil has an amend command for modifying prior commits, but unlike in Git, this works not by replacing data in the repository, but by adding a correction record to the repository that affects how later Fossil operations present the corrected data. The old information is still there in the repository, it is just overridden from the amendment point forward.Fossil lacks almost every other history rewriting mechanism listed on the Git documentation page linked above. There is no rebase in Fossil, on purpose, thus no way to reorder or copy commits around in the commit hash tree. There is no commit squashing, dropping, or interactive patch-based cherry-picking of commit elements in Fossil. There is nothing like Git's filter-branch in Fossil.The lone exception is deleting commits. Fossil has two methods for doing that, both of which have stringent limitations, on purpose.The first is shunning. See that document for details, but briefly, you only get mandatory compliance for shun requests within a single repository. Shun requests do not propagate automatically between repository clones. A Fossil repository administrator can cooperatively pull another repo's shun requests across a sync boundary, so that two admins can get together and agree to shun certain committed artifacts, but a person cannot force their local shun requests into another repo without having admin-level control over the receiving repo as well. Fossil's shun feature isn't for fixing up everyday bad commits, it's for dealing with extreme situations: public commits of secret material, ticket/wiki/forum spam, law enforcement takedown demands, etc.There is also the experimental purge command, which differs from shunning in ways that aren't especially important in the context of this document. At a 30000 foot level, you can think of purging as useful only when you've turned off Fossil's autosync feature and want to pluck artifacts out of its hash tree before they get pushed. In that sense, it's approximately the same as git rebase -i, drop. However, given that Fossil defaults to having autosync enabled for good reason, the purge command isn't very useful in practice: once a commit has been pushed into another repo, shunning is more useful if you need to delete it from history.If these accommodations strike you as incoherent with respect to Fossil's philosophy of durable, unchanging commits, realize that if shunning and purging were removed from Fossil, you could still remove artifacts from the repository with SQL DELETE statements; the repository database file is, after all, directly modifiable, being writable by your user. Where the Fossil philosophy really takes hold is in making it difficult to violate the integrity of the hash tree. It's somewhat tangential, but the document "Is Fossil a Blockchain?" touches on this and related topics.One commentator characterized Git as recording history according to the victors, whereas Fossil records history as it actually happened.2.8 Test Before CommitOne of the things that falls out of Git's default separation of commit from push is that there are several Git sub-commands that jump straight to the commit step before a change could possibly be tested. Fossil, by contrast, makes the equivalent change to the local working check-out only, requiring a separate check-in step to commit the change. This design difference falls naturally out of Fossil's default-enabled autosync feature and its philosophy of not offering history rewriting features.The prime example in Git is rebasing: the change happens to the local repository immediately if successful, even though you haven't tested the change yet. It's possible to argue for such a design in a tool like Git since it lacks an autosync feature, because you can still test the change before pushing local changes to the parent repo, but in the meantime you've made a durable change to your local Git repository. You must do something drastic like git reset --hard to revert that rebase or rewrite history before pushing it if the rebase causes a problem. If you push your rebased local repo up to the parent without testing first, you cannot fix it without violating the golden rule of rebasing.Lesser examples are the Git merge, cherry-pick, and revert commands, all of which apply work from one branch onto another, and all of which commit their change to the local repository immediately without giving you an opportunity to test the change first unless you give the --no-commit option. Otherwise, you're back in the same boat: reset the local repository or rewrite history to fix things, then maybe retry.Fossil cannot sensibly work that way because of its default-enabled autosync feature and its purposeful paucity of commands for modifying commits, as discussed in the prior section.Instead of jumping straight to the commit step, Fossil applies the proposed merge to the local working directory only, requiring a separate check-in step before the change is committed to the repository. This gives you a chance to test the change first, either manually or by running your software's automatic tests. (Ideally, both!) Thus, Fossil doesn't need rebase, squashing, reset --hard, or other Git commit mutating mechanisms.Because Fossil requires an explicit commit for a merge, it has the nice side benefit that it makes you give an explicit commit message for each merge, whereas Git writes that commit message itself by default unless you give the optional --edit flag to override it.We don't look at this difference as a workaround in Fossil for autosync, but instead as a test-first philosophical difference: fossil commit is a commitment. When every commit is pushed to the parent repo by default, it encourages a working style in which every commit is tested first. It encourages thinking before acting. We believe this is an inherently good thing.Incidentally, this is a good example of Git's messy command design. These three commands:$ git merge HASH$ git cherry-pick HASH$ git revert HASH...are all the same command in Fossil:$ fossil merge HASH$ fossil merge --cherrypick HASH$ fossil merge --backout HASHIf you think about it, they're all the same function: apply work done on one branch to another. All that changes between these commands is how much work gets applied — just one check-in or a whole branch — and the merge direction. This is the sort of thing we mean when we point out that Fossil's command interface is simpler than Git's: there are fewer concepts to keep track of in your mental model of Fossil's internal operation.Fossil's implementation of the feature is also simpler to describe. The brief online help for fossil merge is currently 41 lines long, to which you want to add the 600 lines of the branching document. The equivalent documentation in Git is the aggregation of the man pages for the above three commands, which is over 1000 lines, much of it mutually redundant. (e.g. Git's --edit and --no-commit options get described three times, each time differently.) Fossil's documentation is not only more concise, it gives a nice split of brief online help and full online documentation.2.9 Hash Algorithm: SHA-3 vs SHA-2 vs SHA-1Fossil started out using 160-bit SHA-1 hashes to identify check-ins, just as in Git. That changed in early 2017 when news of the SHAttered attack broke, demonstrating that SHA-1 collisions were now practical to create. Two weeks later, the creator of Fossil delivered a new release allowing a clean migration to 256-bit SHA-3 with full backwards compatibility to old SHA-1 based repositories.In October 2019, after the last of the major binary package repos offering Fossil upgraded to Fossil 2.x, we switched the default hash mode so that the conversion to SHA-3 is fully automatic. This not only solves the SHAttered problem, it should prevent a reoccurrence of similar problems for the foreseeable future.Meanwhile, the Git community took until August 2018 to publish their first plan for solving the same problem by moving to SHA-256, a variant of the older SHA-2 algorithm. As of this writing in February 2020, that plan hasn't been implemented, as far as this author is aware, but there is now a competing SHA-256 based plan which requires complete repository conversion from SHA-1 to SHA-256, breaking all public hashes in the repo. One way to characterize such a massive upheaval in Git terms is a whole-project rebase, which violates Git's own Golden Rule of Rebasing.Regardless of the eventual implementation details, we fully expect Git to move off SHA-1 eventually and for the changes to take years more to percolate through the community.Almost three years after Fossil solved this problem, the SHAmbles attack was published, further weakening the case for continuing to use SHA-1.The practical impact of attacks like SHAttered and SHAmbles on the Git and Fossil Merkle trees isn't clear, but you want to have your repositories moved over to a stronger hash algorithm before someone figures out how to make use of the weaknesses in the old one. Fossil has had this covered for years now, so that the solution is now almost universally deployed.Asides and DigressionsMany things are lost in making a Git mirror of a Fossil repo due to limitations of Git relative to Fossil. GitHub adds some of these missing features to stock Git, but because they're not part of Git proper, exporting a Fossil repository to GitHub will still not include them; Fossil tickets do not become GitHub issues, for example.The fossil-scm.org web site is actually hosted in several parts, so that it is not strictly true that "everything" on it is in the self-hosting Fossil project repo. The web forum is hosted as a separate Fossil repo from the main Fossil self-hosting repo for administration reasons, and the Download page content isn't normally synchronized with a "fossil clone" command unless you add the "-u" option. (See "How the Download Page Works" for details.) Chat history is deliberately not synced as chat messages are intended to be ephemeral. There may also be some purely static elements of the web site served via D. Richard Hipp's own lightweight web server, althttpd, which is configured as a front end to Fossil running in CGI mode on these sites.That estimate is based on pricing at Digital Ocean in mid-2019: Fossil will run just fine on the smallest instance they offer, at US $5/month, but the closest match to GitLab's minimum requirements among Digital Ocean's offerings currently costs $40/month.This means you can give up waiting for Fossil to be ported to the PDP-11, but we remain hopeful that someone may eventually port it to z/OS."Why is there all this Tcl in and around Fossil?" you may ask. It is because D. Richard Hipp is a long-time Tcl user and contributor. SQLite started out as an embedded database for Tcl specifically. ([Reference]) When he then created Fossil to manage the development of SQLite, it was natural for him to use Tcl-based tools for its scripting, build system, test system, etc. It came full circle in 2011 when the Tcl and Tk projects moved from CVS to Fossil.A minority of the pieces of the Git core software suite are written in other languages, primarily Perl, Python, and Tcl. (e.g. git-send-mail, git-p4, and gitk, respectively.) Although these interpreters are quite portable, they aren't installed by default everywhere, and on some platforms you can't count on them at all. (Not just Windows, but also the BSDs and many other non-Linux platforms.) This expands the dependency footprint of Git considerably. It is why the current Git for Windows distribution is 44.7 MiB but the current fossil.exe zip file for Windows is 2.24 MiB. Fossil is much smaller despite using a roughly similar amount of high-level scripting code because its interpreters are compact and built into Fossil itself.Both Fossil and Git support patch(1) files — unified diff formatted output — for accepting drive-by contributions, but it's a lossy contribution path for both systems. Unlike Git PRs and Fossil bundles, patch files collapse multiple checkins together, they don't include check-in comments, and they cannot encode changes made above the individual file content layer: you lose branching decisions, tag changes, file renames, and more when using patch files. The fossil patch command also solves these problems, but it is because it works like a Fossil bundle, only for uncommitted changes; it doesn't use Larry Wall's patch tool to apply unified diff output to the receiving Fossil checkout.This page was generated in about 0.008s by Fossil 2.25 [390e00134e] 2024-06-25 06:30:55
Welcome to another episode.My name is Helen Scott.And today I'm going to be interviewing Anna.So Anna is a creative.She uses her communication and her storytelling,and she's used it to tell Git in a simpleway.So, Anna, welcome.Thanks for having me, Helen.I'm excited.I'm excited too.I'm just gonna see how many stories of myGit frustration I can weave into this interview.So Anna's published her first book, "LearningGit."It's on the O'Reilly platform.You published it June last year, Anna?June last year.June last year.So, I've got a copy of the book.I've been very efficient, and put it backon my bookshelf over here.For the purposes of this interview, let'sstart right at the beginning.Tell us a little bit more about you and whatyou enjoy doing, before we get started intothe book.cvI think what I enjoy doing is taking a complicatedtopic or a topic that confuses people, andmaking it simpler and more approachable.And presenting the information in a way thatit's easier to learn for beginners.So that's what I did with Git.And that's what I also do in my day job, becauseI work as a technical writer.So in my day job, I also take various topicsand explain them in a simpler way, and presentinformation in a simple way, so that peoplecan consume it better.Fantastic.And that's so important.Having a technical writing background myselfas well, it's just super important that whena user comes...looks for the documentation,that it is written in such a way that youcan actually help them to move forward andget past the problem that's made them go andlook at the documentation in the first place,right?Yes.Okay.So, I'm going to not derail this interviewcompletely with how annoyed I can get at Gitsometimes, but I think maybe some of our audiencemight share some of these frustrations.What was your primary motivators behind writingthis book?Okay.So, the reason I wrote this book is becauseI needed this book.So I'm just going to backtrack a little andsay that my entryway into the world of techwas through the world of UX design, so userexperience design.So, at some point in my life, I did a UX designbootcamp, and I worked as a UX designer.And then when I was working as a UX designer,I'd realized that I knew how to design appsand websites, so sort of like an architect.But I didn't know how to actually build thoseapps and websites.I wasn't like the construction company thatgets the blueprints and actually builds them.I got really curious about how these thingsare built.So I ended up doing a coding bootcamp.That's kind of a three-to-four-month intensiveprogram, learned the basics of web development,and then worked as a front-end developer.The first time that I Git introduced to getwas in that coding boot camp.But it was one hour where they just kind oftold us Git add, Git commit, Git push, go.Off you go.And obviously, that may have been sort ofenough when you were just working on yourprojects in the coding boot camp.But once I got my first job as a junior front-enddeveloper, and I had to work with a team ofdevelopers and senior developers, I was terrifiedof Git.I mean, every time I had to do something thatI deemed was complicated, which was almosteverything, I would call on the senior developers,and ask them to help me.And I was always worried I was about to destroythe repository, and take down the entire platformand the entire website.And this was like a massive ecommerce platform,so that would have not been good.Little did I know that that was not the case.And that that was never gonna happen.But anyways, so at some point during thatjob, I realized I want to conquer this fear.I want to learn how to use Git, and I wantto understand how it works, so that I canbe an independent developer, and not haveto ask for help all the time.So I started learning, looking for onlineresources to learn Git.And what I realized was that there weren'treally any online resources that were designedfor people like me, that were really new totech, that had transitioned into tech fromnon-technical backgrounds, and that explainedthings in a simple way.Then at some point, this creative idea cameto me of how I could teach Git using colors,and storytelling, and visuals.I mean, this was after I'd kind of understoodsome of the basics.So the first thing that I did was actuallymake an online course teaching Git.And that's still available online.At the moment, it's on Udemy.Who knows where it will be in the future.But that journey...When I was making the online course, I stillwanted to write a book.But I felt that the barrier to entry to writethe book was higher than to make an onlinecourse.Because with online courses, you just kindof record your voice, make some slides, recordthem.So I could do that a lot easier and publishit online a lot easier.But once I released that online course, Istarted getting reviews, I started gettingfeedback.I realized my way of teaching Git really resonatedwith a lot of people.There were a lot of people that, just likeme, had not been served by the Git teachingresources out there up until now.My approach to organizing information andpresenting concepts worked for them.And then I was like, all right, since it works,let's write this book.I am also a writer in my personal time, andI love to journal.So, writing is my medium of choice.That's kind of how this book came about.This is the book that I wish I had had inmy first week of that coding boot camp.Or especially in that first week of my newjob as a front-end developer.A hundred percent.100%.And just in hearing that story, there's somuch that resonated with me.People have talked about, regardless of howyou ended up in the profession, whether you'recoming through a degree, you're coming througha boot camp, you're self-taught, whateverit is.How much you learn about version control,and, you know, Git is part of that, is reallyvariable.In my experience, sometimesit's purely conceptual.It's like, there is this thing that you cando.You will learn about it on your job.And then you turn up at your job and you'relike, "Oh, I'm terrified of going to studythe whole repository."So I think we've all been there.And I think we've all had experience of knowingwho or even being at times that expert inGit on the team, that people go to when they'vegone beyond the Git pull or, you know, Gitupdate, Git push.They've gone beyond the basics, and they'relike, "Oh, I'm in trouble now.Something's not working."So I think certainly, I identified with alot of what you said there, and I expect ouraudience did as well.So much frustration.The other thing that actually has been verysurprising is that it's not just developersthat use Git, there are so many other peoplethat work with developers or that do otherjobs that use Git.And this I've discovered since publishingthe book.Game artists.Mechanical engineers also use Git.Sometimes UX designers have to collaboratewith devs, and share assets or whatever.Even product managers.And actually one of the biggest audiencesmy book actually got was technical writers,because they often have this thing that wecall Docs as Code Approach.And they use Git to manage the repositoriesfor the documentation websites.So some technical writers come from a technicalbackground, but some don't.And so, technical concepts don't come naturallyto them.My book has really served various differentaudiences, including junior developers andexperienced developers, but also just so manyother professions.Which, yeah, has been very eye opening.And again, identified with that, because itwas back in my technical writing career, andI started using Git.And I needed that Git expert in the team,that developer, and I was like, "I've gotit in reverse.Please help me."So let's move swiftly on to talking aboutthe book itself, which is why the majorityof the audience will be here.Can you give us an overview of the book andtalk about its structure a little bit more,please?Definitely.So, the book is for absolute complete beginners.If you have a bit of experience, you can maybetry to skip a chapter.Normally, you should do it from the beginningto the end, though, because it's a hands-onlearning experience, where you're workingon a specific repository throughout the entirebook, which is actually called the rainbowrepository.Because you're listing the colors of the rainbow.I'll explain that a little bit more later.But the first chapter actually just startsoff with installing Git and an introductionto the command line, because some people actuallyhaven't worked in the command line and aren'tfamiliar with it.So, it really is a book for absolute beginners.Then I build this thing that I call the Gitdiagram, which is my way of creating a mentalmodel of the different areas of Git and howthey fit together.So, you have your project directory, the workingdirectory, the local repository.Then inside the local repository, you havethe staging area and the commit history.When I was learning Git, I came across diagramsthat tried to kind of depict these differentareas, and how they interact.They didn't really make sense to me.So I've actually created my own representationof these areas.This is really key to my teaching methodology.Because the main part of my teaching methodologyis creating a mental model of how things work,and making things tangible.So, we build the Git diagram.Once we have that, we go over the processof making commits.We introduce branches and what they are.And we create visualizations for all of thesethings.So I visualize what branches look like.They're just pointers to commits.Then what else?I have a list here.I go over merging, and I introduce the twotypes of merges, fast forward merges, three-waymerges.We're at chapter five now.And there we just go over the experience ofdoing a fast-forward merge, not yet a three-waymerge.That's a bit more spicy, and it comes lateron in the learning journey.And then in chapter six is when we actuallygo from working just in the local repository,just on your computer, on your own.We introduce hosting services.so GitHub, GitLab, Bitbucket,and remote repositories,basically.One thing I should mention right now, justa quick parenthesis, is that, the book isnot prescriptive.You can use whichever hosting service youwant.You can use whichever text editor you want.I didn't want to exclude anyone that usesmaybe a different technology.And I wanted to make the book accessible to,yeah, anything.So yeah, so that's closing parenthesis now.Moving on, chapter seven, we jump into creatingand pushing to a remote repository.So we're really into, you know, local repository,remote repository.Chapter eight, we go over cloning and fetchingdata.So, in chapter eight, is where the learningexperience, we simulate that you're workingwith another person.So, in the book, we say that a friend of yourswants to join you on your rainbow project.So they create a...Well, they clone your remote repository, andthey create a local repository called FriendRainbow.And I mean, if you have a friend that youcan actually do all the exercises with, thenthat's ideal.But the most realistic thing is that you justcreate a second local repository on your computer,and you just kind of pretend it's on someoneelse's computer.But this is really important, because at theend of the day, Git is a collaboration tool,right?It's version control and collaboration.So, if you don't have any representationsof how the collaboration happens, then thatleaves out basically more than half of Git.Chapter eight, you're learning how to clone,how to fetch data from a remote repository.And chapter nine, finally, we get into thespicy topic of...Well, not that spicy.Three-way merges are pretty simple.But then chapter 10, we get into merge conflicts,which is the spicier topic, and the thingthat a lot of people are afraid of.Chapter 11, rebasing.Rebasing, I'd say is the most advanced thingthat I cover in my book.Like I said, this book is a real basics andbeginner book.So, rebasing is, yeah, at the end.And finally, the last chapter is pull requestsor merge requests, whatever terminology youwant to use.And obviously, pull requests, merge requests,they're not actually a feature of Git itself.They're a feature of the hosting services.So GitHub, GitLab, Bitbucket, and others.There are others.I'm just not going to start naming 20 differenthosting services.But I thought that they were so importantbecause they really...Yeah, they're essential, almost for everyone'sworkflow.So I thought, okay, I'll make an exceptionand include them, even though this is a bookabout Git.That's kind of an overview.And like I mentioned, you're working on thisrainbow project, and it's hands on.You are with your computer, doing the exercises.So you are supposed to do the book from chapter1 to chapter 12.Because if you don't, then you'll miss out,you won't be able to follow along.But I have created an appendix, where I'veoffered instructions on how someone can start,like, create the minimum set up for theirproject to start off on any chapter.Because, yeah, maybe you read the book once,and then you just want to review chapter eight,or you just want to review chapter nine.And you don't have to go from chapter oneall the way to that chapter just to be ableto review it.But yeah, that's the kind of overview of thebook.Brilliant.So for my next question, I'm going to usethe book as my demo, so the audience can seewhat I'm talking about when I ask this nextquestion.For example, you've made extensive use ofcolor in this book.And you've mentioned therainbow project repeatedly.What made you choose that theme?And why?When my creative idea of how I could teachGit in a simple way came to me, it was allabout using color.Because I thought to myself, one of the reallyconfusing or difficult things with Git, whenyou're teaching it, is that there's commithashes.So every commit...A commit is a version of your project.Every commit has a commit hash, so 40 characters,letters, and numbers, that's unique.And it's like a name for the commit.But if you're teaching Git, and you're havingto refer to, well, remember commit, six, nine,Z, blah, blah, blah, blah, blah.That is so confusing.Who wants to learn like that?So I thought to myself, how can I use colorinstead?And so let me give an example.In the rainbow project, the very first thingyou add to your project, you create a rainbow.txtfile.So a txt file, very simple.I keep everything really simple.And I'll make a comment about that in a second.And the first thing you add is just...redis the first color of the rainbow.You just add that sentence, first line ofyour file, and you add that to the stagingarea, and you make a commit.And then I represent that commit in the diagramsas a circle, that is red.And so from then on, I can just say, the redcommits.And that just simplifies the learning experience.It makes it a lot more memorable, and alsovery much more visual.Because I'm not having to include, like, alittle commit hash in my diagram to try torefer to the commit.That's why I use color in my teaching methodology.And the rainbow was just a really nice wayto structure it.You know, we all...well, many of us, or mostof us know that the order of the colors ofthe rainbow.So it's a very familiar thing.It was easy to then, yeah, structure the wholebook that way.Although at the end, I ran out of colors.I just had to add some random colors.I actually have...At the end, you add another file to your projectcalled othercolors.txt, and you start adding,like, pink is not a color in the rainbow.And gray is not a color in the rainbow.Because I literally ran out of colors.But also because I wanted to show, you know,how you add new files to your project.But the other thing I wanted to say aboutkeeping things simple, is that one of thedecisions I made with this book is that itwould have no code in it.So, the files in your project are just dot-txtfiles, which are just really plain text files.They're not even markdown files.Like, it is so simple.Because I thought if I make the project thatyou work on in the book, a web developmentproject, or a data science project, a Javaproject, a Python project, anything, it willexclude some people for whom that is not familiar.And let's say, yeah, fine, well, those peoplecan go and look it up.It just complicates things.It's not necessary.I wanted someone to just be able to focuson learning the concepts of Git, rather thanhaving to also learn other tech concepts,which are not relevant to just learning Git.So, yeah, that's kind of...that was my wayof approaching how to teach this stuff.That's great.And I think what's really helpful and insightfulis, the very deliberate decisions that youmade along the way.You know, this didn't just happen by accident,you made a very deliberate choice that I'mgoing to represent hashes with blobs of color,and therefore I'm going to refer to those.And you made a very deliberate choice to usetxt files, a concept that is going to be familiarto your entire audience.So I really liked that you made those consciousdecisions upfront to create the learning journeythat, you know, you said yourself you wantedwhen you first started working with Git.YOne more I can add is screenshots.I decided not a single screenshot in my book.Because I thought to myself, the minute Iadd a screenshot, the book is out of date.And since I was able to do with...So true.Helen knows this because she wrote a bookwith lots of screenshots.But you have to have screenshots in yours.Sorry, off topic.I think that was another really, really consciousdecision of mine, of, since it's not necessary,don't include screenshots.Because again, they're not relevant to everyone.Everyone has a different operating system,and a different version, and UI changes.And the minute that the book goes to print,it would be out of date.So, that was another really conscious decisionI made for this book.Good call out.Good call out.And yes, the pain is real.Just ask my co-author, Trisha Gee, who updatedthem all.Okay, so the next question is kind of a doublequestion.You can answer it in any order you like.But it's, who should read this book?And equally as importantly, who is this booknot for?So who should read this book?I think anyone that wants to learn Git.So they've never used Git, and someone's toldthem they have to, or they realize they haveto for their work or for their personal project.So anyone that wants to learn Git.Anyone that's confused by Git.I have talked to developers with 10 years'experience, that still are afraid of Git anddon't have a mental model that they can reliably,like, use and feel confident with.And they even tell me, this book helped meto put things together.So, yeah, junior developers, anyone that doesn'tyet really understand how the pieces cometogether.Because what happens is, when you don't havea good mental model of Git, once you start...Like, maybe you're okay with doing the Gitadd, Git commit, Git push.But once you start going to the more advancedtopics, you don't really understand how theywork.And that's where you start getting reallyconfused and scared.And it all becomes a bit challenging.So that's who I think could benefit from thisbook.Also, I would say anyone that's a visual learner,and anyone that's kind of more like a tactile,like, wants to like see...kind of make thingstangible type of learner.I do have to say, you know, this book isn'tfor everyone.Not everyone's a visual learner.And I totally appreciate that.And that means that this book will appealto a certain type of learner, but not to another.Now, let's get to the topic of who this bookis not for.It's not for anyone that uses Git, and reallyhas their own mental model of how it works,and it works for them.And they never really struggle with understandingthings.I mean, they don't need it.It's not for anyone that's looking for anadvanced guide to Git, you know, that goesover advanced features of Git, more nichecommands.It's not for them.That stuff is not in the book, so they'llnot get anything from it.It's also not for anyone that's looking fora resource that will teach them what theirGit workflow should be, or what best practicesshould be.In the book, I don't teach you, like, oh,well, the best branching workflow is this,and this, and this.Or, yeah, I don't know, this is how you shouldset things up.Like I said, the book is not at all prescriptive.And actually, to be honest, the rainbow projectis not a really realistic project of a softwareproject.I mean, you're listing the colors of the rainbow.That's not really what you're usually doingwhen you're building software.So for a lot of developers, for example, theexample in the book is not so realistic.It's more about building that mental model.Although I do have a second example in thebook, which is called the example book project,which is a little bit more realistic, becauseit uses storytelling to provide a bit moreof a realistic use of Git.But again, it's not.And the other thing is, anyone looking forother best practices, like, how should I benaming my branches?What should I be including in a commit?Let me think, what else?So those kinds of things, I don't provideany guidance on that.Because, like I said, I focus on teachinga mental model.And those things are really up to...they'rereally kind of dependent on your opinion,on which company you work in, which sectoryou work in, what your background is, whatyou personally like in branch names, and in,yeah...or commit messages.So, it was not something that I wanted toconfuse people with and clutter up the bookwith.I think there's plenty of other resourcesin the world that provide guidance on that.And the final thing is that this book is nota reference.So it's really kind of a learning journey.It's a hands-on learning journey.But it's not the kind of book that you wouldbe like, oh, you know, each chapter...I don't know, it's not a reference guide.So, to be honest, the Git website is the bestreference.I mean, it is a reference.They have a reference.So, Git-scm.com, you got a reference there.And other people have built a reference.So yeah, I think that's kind of who the bookis for and who the book isn't for.Okay.So, we've mentioned previously that the bookis designed to be a sequentiallearning experience.Start at the beginning, progress, especiallyif you're new to Git.But there's going to be people out there thatwill definitely have the question of, howmuch Git experience do I need if I'm goingto buy this book?And what's the answer to that one?Zero.That's an easy answer.I could just, like, zero.I have nothing else to say.No, zero, really, like I mentioned in thevery first chapter, we go over the processof downloading Git, and kind of setting itup, setting up like the basics.And I even introduce the command line.Like, I tell you, this is the app that isthe command line, open the command line.This is the command prompt.This is where you enter commands.You write a command and you press enter.And I introduce some, like, very basic commandslike CD, change directory, or LS, like, listthe files, the visible files.I introduce the concept of visible files andhidden files.I introduce, like, I don't know, mkdirs, ormake directory.Just a couple of super basic commands.So, yeah, and we go from the very start.Like, chapter two is, you know, introducingthat Git diagram.And so, zero, zero, zero.I user tested.We'll get into that later.But I literally gave this book to my brother,my dad, people that are...well, at least mybrother, not at all in the tech space.Or at least, you know, not developers of anykind, at the moment, at least.So, yeah, it's zero.Brilliant.We're gonna get to that now.So people who get value from this book, noGit experience is necessarily, none whatsoever.Road tested with your dad and brother, amongstother people.And tells a sequential journey.Anybody who is looking to understand the mentalmodel of Git, anybody who perhaps has beenusing Git, but is a little bit less confidentaround some of the operations.Whether they're the more advanced ones likerebase, or the spicy three-way merge, whichis absolutely what I'm always going to callthe three-way merge from this point forwardin my career, always going to prefix it withspicy.And anybody who just needs to brush up onsome of those underlying concepts, becauseGit is very...What are the words?You need to build on top of the basics.If you don't understand the basics in Git,then the more advanced stuff, the more advancedcommands tend to be more challenging thanperhaps they would be if you had a good graspof the underlying mental model.It's true.Awesome.Okay.So let's stick with advanced topics.Is there anything that you really, really,really wanted to get into the book, but just,you know, you've got to draw the line somewhere?Is there anything that you're like, "Oh, Ireally wanted to put that in, but I just..."You know, you had a cut off for it.That's a really good question.And I've been asked this before.But to be honest, I think I am in quite aunique position, that unlike many other authors,who had a lot more in their books, and neededto cut down, or just, yeah, the books gotway too big.I had a very clear idea of what the basicswere, maybe because I'd already made the onlinecourse, which already had the basics.And so, I didn't really have that situation.I didn't have anything that I was like, "Oh,but I really wish I could fit this in."The only thing I would say maybe that I wasconsidering squeezing into one of the chapterswas stashing.But I mean, it's not like a huge, massivething that I really, really, you know, waslike, "Oh, I can't believe this won't fitin."Because to be honest, you know, the book isstill pretty lean.It's very minimalist.So, if I had ultimately decided that thisis essential, I would have included it.Actually, in my case, I think the pull request,merge request chapter was actually not evenpart of my original plan.And I think later on, I realized, no, thisis really important.I need to add this on.So actually, my book was like too lean.And I was, like...I think you've been rebasing, I was kind of...No, rebasing, I think I had from the beginning.But yeah, I was like, super lean.And I was, like, okay, but maybe I shouldadd a little bit extra.So, I think, I felt that the most generousthing that I could do for my readers, andfor my learners, was to beas simple and minimalistas possible, and just give the bare basics.And in that way, just make it less overwhelming.Because tech is so overwhelming.There's just so much information, and so muchgoing on.So if I could just create one learning experience,which would not be overwhelming, I was like,yes, I shall do this.Fantastic.Thank you.So I'm wondering if in our audience today,we might have some budding authors, otherpeople out there who perhaps want to authora book.I know I've co-authored a book.So I absolutely appreciate the effort thatgoes into the process and then some.Do you have any advice for anybody who mightbe listening, who's thinking, "Oh, yeah, Iknow a thing.I'm gonna write a book about that thing."So anything I'm about to say is just gonnacome from my own experience.So you never know, it might not apply to others.And anything I'm gonna say is probably justgoing to refer to technical book authors,because I have not yet written another kindof book.So I don't want to say that I can speak onthat.But for technical book potential authors,I would say one of the things that I reallyappreciated in my journey was, that I sortof wrote this book as if it was an app.Like, I made a prototype and then I user testedthat prototype.And then I took all the feedback and I iterated.And then I user tested again.And then I took all the feedback, and I iterated.And the first prototype was ugly.It was very, very ugly.And it was very rough.And it was not at all the finished product.I actually user tested when I just had fourchapters ready, because I thought, well, letme check if this is making any sense to anyone.Because if I've done four chapters, and itdoesn't make any sense, there's no point inwriting the other eight chapters.Or, you know, it's better for me to figureout what they are, what works for my audience,before I write those other eight chapters.So I had up to, like, 30 user testers throughoutthe two years that I was working on this book.And I think that was invaluable.The experience of having a diverse user testingaudience.I mean, it went from the ages of 18 to 60.And from various professions, from all overthe world.It was all remote user testing.Wait, actually no, I also did some in person.But most of it, almost all of it was remote.So I got people from all over the world.And it really did make my book a lot better.There were a lot of things I needed to change.I think sometimes, yeah, we just want to thinkthat we're like this genius that, like, cancreate the best thing out of the outset.And, like, oh, my God, my ideas are so good.But sometimes, our creations need to havethat chemical reaction with the audience inorder to really become what they need.Just one funny story, since we're at it.At the beginning, I thought that my book wouldactually...people would have, like, a pencilcase with them and paper, and they would actuallydraw out the commits.So in the beginning of the preface, I waslike, "Oh, you have to buy colored pencils,and you should, like, follow along, and youshould draw the commits."And when I did that user testing, nobody didthat.And I had the colored pencils with me, I broughtthem, and nobody did it.So then I was like, okay, well, I guess thisisn't something...You know, if somebody does want to do it,they can.But it wasn't something that made sense.I mean, that's just a non-technical example.But there were plenty of other things where,yeah, I got feedback of, this doesn't makesense to me, or you forgot about this.I'm a Mac user.So there were a lot of Windows users thattold me, like, "Hey, you need to mention this,you need to mention that because this doesn'tmake sense for me, or this doesn't apply tome."Or, you know, you need to simplify this.You know, I'm not aware of this concept.So, yeah, my monologue about user testingand how good it is, is over now.But yeah, I really recommend.The importance of user testing.And yeah, colored pencils, or not, in yourcase.Fantastic.So we've spoken a little bit about your Udemycourse, we've spoken a little...well, a lot,about your book.What's next for you?Is there anything else in the pipeline?That's a good question, Helen.I don't know what's next.I do know that I've started working on a book.I cannot share anything about it.It shall remain a mystery.People can...You know, at the end, we'll talk about linkswhere people can find me.But it may not end up being a technical book.So I'm not sure yet.I'm still exploring...Well, I've started working on something.But until I commit to a creative idea or acreative project, I flirt with a lot of differentcreative ideas and creative projects.So, I definitely want to keep creating.I love the process of creating something thathelps people and that explains things in asimple way.But what that next thing is going to be isstill under wraps.So, we'll see.For now, I'm just adjusting to my new technicalwriter position, and enjoying kind of sharingthe journey I had with this book.Wonderful.A mystery.Yes, a mystery.I do like a mystery.Maybe that's how I get people to follow me.It's kind of self-serving.If I make it mysterious, people have to comeand follow me to find out what it is whenI'm ready to share it.I think that is the perfect segue.So, if people want to find out about thismystery in time, or learn more about yourbook, or your courses, or whatever is comingnext for you, Anna, where should they go?So the first thing I'll say is that I'm theonly Anna Skoulikari on this planet.So, just look me up, Anna Skoulikari, twoNs, A-N-N-A.Last name S-K-O-U-L-I-K-A-R-I.Just google me.But other than that, I have awebsite, annaskoulikari.com.So that is one good place.The platform that I'm currently most activeon is LinkedIn.So you can connect with me or follow me there.And then other than that, you can find mybook all over the place, Amazon, and plentyof other places.You can find my online course on Udemy, atthe moment.It's difficult to tell people where to findyou, because you always think, well, the socialmedia landscape changes all the time.So this might not be relevant in a year ortwo years from now.And maybe I'll start using something else.So I do have a Twitter.I don't use it much.Oh, I'm sorry, X.Anyways, so I'd say my website and LinkedInare currently the best places to find outmore about me, and what I'm doing.And then other than that, I just want to sharewith the audience that I do have a link thatfor the rest of 2024 is still going to beactive, which we can leave in the notes, whichgives people 30 day free access to the O'Reillyonline learning platform.You can read my book, basically, in 30 days.We can leave that in the notes.And you actually have access to all the resourceson there.If you want to take a peek at anything else,feel free.And it doesn't automatically subscribe, youjust get those 30 days, and then you can choosewhether you want to continue.I think those are the best places.Perfect.And we will, of course,put all of that informationin the show notes, including that link for30 day access to the writing platform.So I think that just about brings us to theend of this interview.So, Anna, do you have any final words thatyou'd like to share today?Oh, that's a good question.Well, since the audience, there's going tobe a mix of junior developers, senior developers,and various other people in the tech profession.I'd say, if you yourself, maybe understandGit, but you have a friend or anyone you knowthat is struggling with it, feel free to recommendthem to take a peek at the book and see whetherit is the right learning journey and learningresource for them.It might be, it might not.But if you do have anyone that is strugglingwith Git, which many of us do, feel free tokind of just share it with them, in case youthink it can serve them.Fantastic.Fantastic.And I'd second that, especially with the 30day free access.I mean, that's a win-win.You can check out the book.I am fortunate enough to have this lovelyphysical copy of it.But, Anna, thank you.Thank you for coming to this interview, forsharing your knowledge and writing this book.I've got a lot of value from it.And I know that our audience will as well.Thanks to the GOTO platform as well.And yeah, thank you to you, the audience,for tuning in and coming on this journey.All the show notes will be available to you.And that's it from us.Thank you very much.Bye.Thanks, Helen.
# Document TitleCRDT Survey, Part 3: Algorithmic TechniquesMatthew Weidner | Feb 13th, 2024Home | RSS FeedKeywords: CRDTs, optimization, state-based CRDTsThis blog post is Part 3 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further Topics# Algorithmic TechniquesLet’s say you have chosen your collaborative app’s semantics in style of Part 2. That is, you’ve chosen a pure function that inputs an operation history and outputs the intended state of a user who is aware of those operations.Here is a simple protocol to turn those semantics into an actual collaborative app:Each user’s state is a literal operation history, i.e., a set of operations, with each operation labeled by a Unique ID (UID).When a user performs an operation, they generate a new UID, then add the pair (id, op) to their local operation history.To synchronize their states, users share the pairs (id, op) however they like. For example, users could broadcast pairs as soon as they are created, periodically share entire histories peer-to-peer, or run a clever protocol to send a peer only the pairs that it is missing. Recipients always ignore redundant pairs (duplicate UIDs).Whenever a user’s local operation history is updated - either by a local operation or a remote message - they apply the semantics (pure function) to their new history, to yield the current app-visible state.Technicalities:It is the translated operations that get stored in operation histories and sent over the network. E.g., convert list indices to list CRDT positions before storing & sending.The history should also include causal ordering metadata - the arrows in Part 2’s operation histories. When sharing an operation, also share its incoming arrows, i.e., the UIDs of its immediate causal predecessors.Optionally enforce causal order delivery, by waiting to add a received operation to your local operation history until after you have added all of its immediate causal predecessors.This post describes CRDT algorithmic techniques that help you implement more efficient versions of the simple protocol above. We start with some Prerequisites that are also useful in general distributed systems. Then Sync Strategies describes the traditional “types” of CRDTs - op-based, state-based, and others - and how they relate to the simple protocol.The remaining sections describe specific algorithms. These algorithms are largely independent of each other, so you can skip to whatever interests you. Misc Techniques fills in some gaps from Part 2, e.g., how to generate logical timestamps. Optimized CRDTs describes nontrivial optimized algorithms, including classic state-based CRDTs.# Table of ContentsPrerequisitesReplicas and Replica IDs • Unique IDs: Dots • Tracking Operations: Vector Clocks 1Sync StrategiesOp-Based CRDTs • State-Based CRDTs • Other Sync StrategiesMisc TechniquesLWW: Lamport Timestamps • LWW: Hybrid Logical Clocks • Querying the Causal Order: Vector Clocks 2Optimized CRDTsList CRDTs • Formatting Marks (Rich Text) • State-Based Counter • Delta-State Based Counter • State-Based Unique Set • Delta-State Based Unique Set# Prerequisites# Replicas and Replica IDsA replica is a single copy of a collaborative app’s state, in a single thread on a single device. For web-based apps, there is usually one replica per browser tab; when the user (re)loads a tab, a new replica is created.You can also call a replica a client, session, actor, etc. However, a replica is not synonymous with a device or a user. Indeed, a user can have multiple devices, and a device can have multiple independently-updating replicas - for example, a user may open the same collaborative document in multiple browser tabs.In previous posts, I often said “user” out of laziness - e.g., “two users concurrently do X and Y”. But technically, I always meant “replica” in the above sense. Indeed, a single user might perform concurrent operations across different devices.The importance of a replica is that everything inside a replica happens in a sequential order, without any concurrency between its own operations. This is the fundamental principle behind the next two techniques.It is usually convenient to assign each replica a unique replica ID (client ID, session ID, actor ID), by generating a random string when the replica is created. The replica ID must be unique among all replicas of the same collaborative state, including replicas created concurrently, which is why they are usually random instead of “the highest replica ID so far plus 1”. Random UUIDs (v4) are a safe choice. You can potentially use fewer random bits (shorter replica IDs) if you are willing to tolerate a higher chance of accidental non-uniqueness (cf. the birthday problem).For reference, a UUID v4 is 122 random bits, a Collabs replicaID is 60 random bits (10 base64 chars), and a Yjs clientID is 32 random bits (a uint32).Avoid the temptation to reuse a replica ID across replicas on the same device, e.g., by storing it in window.localStorage. That can cause problems if the user opens multiple tabs, or if there is a crash failure and the old replica did not record all of its actions to disk.In Collabs: ReplicaIDs# Unique IDs: DotsRecall from Part 2 that to refer to a piece of content, you should assign it an immutable Unique ID (UID). UUIDs work, but they are long (32 chars) and don’t compress well.Instead, you can use dot IDs: pairs of the form (replicaID, counter), where counter is a local variable that is incremented each time. So a replica with ID "n48BHnsi" uses the dot IDs ("n48BHnsi", 1), ("n48BHnsi", 2), ("n48BHnsi", 3), …In pseudocode:// Local replica ID.const replicaID = <sufficiently long random string>;// Local counter value (specific to this replica). Integer.let counter = 0;function newUID() {counter++;return (replicaID, counter);}The advantage of dot IDs is that they compress well together, either using plain GZIP or a dot-aware encoding. For example, a vector clock (below) represents a sequence of dots ("n48BHnsi", 1), ("n48BHnsi", 2), ..., ("n48BHnsi", 17) as the single map entry { "n48BHnsi": 17 }.You have some flexibility for how you assign the counter values. For example, you can use a logical clock value instead of a counter, so that your UIDs are also logical timestamps for LWW. Or, you can use a separate counter for each component CRDT in a composed construction, instead of one for the whole replica. The important thing is that you never reuse a UID in the same context, where the two uses could be confused.Example: An append-only log could choose to use its own counter when assigning events’ dot IDs. That way, it can store its state as a Map<ReplicaID, T[]>, mapping each replica ID to an array of that replica’s events indexed by (counter - 1). If you instead used a counter shared by all component CRDTs, then the arrays could have gaps, making a Map<ReplicaID, Map<number, T>> preferable.In previous blog posts, I called these “causal dots”, but I cannot find that name used elsewhere; instead, CRDT papers just use “dot”.Collabs: Each transaction has an implicit dot ID (senderID, senderCounter).Refs: Preguiça et al. 2010# Tracking Operations: Vector Clocks 1Vector clocks are a theoretical technique with multiple uses. In this section, I’ll focus on the simplest one: tracking a set of operations. (Vector Clocks 2 is later.)Suppose a replica is aware of the following operations - i.e., this is its current local view of the operation history:An operation history with labels ("A84nxi", 1), ("A84nxi", 2), ("A84nxi", 3), ("A84nxi", 4), ("bu2nVP", 1), ("bu2nVP", 2).I’ve labeled each operation with a dot ID, like ("A84nxi", 2).In the future, the replica might want to know which operations it is already aware of. For example:When the replica receives a new operation in a network message, it will look up whether it has already received that operation, and if so, ignore it.When a replica syncs with another collaborator (or a storage server), it might first send a description of the operations it’s already aware of, so that the collaborator can skip sending redundant operations.One way to track the operation history is to store the operations’ unique IDs as a set: { ("A84nxi", 1), ("A84nxi", 2), ("A84nxi", 3), ("A84nxi", 4), ("bu2nVP", 1), ("bu2nVP", 2)}. But it is cheaper to store the “compressed” representation{"A84nxi": 4,"bu2nVP": 2}This representation is called a vector clock. Formally, a vector clock is a Map<ReplicaID, number> that sends a replica ID to the maximum counter value received from that replica, where each replica assigns counters to its own operations in order starting at 1 (like a dot ID). Missing replica IDs implicitly map to 0: we haven’t received any operations from those replicas. The above example shows that a vector clock efficiently summarizes a set of operation IDs.The previous paragraph implicitly assumes that you process operations from each other replica in order: first ("A84nxi", 1), then ("A84nxi", 2), etc. That always holds when you enforce causal-order delivery. If you don’t, then a replica’s counter values might have gaps (1, 2, 4, 7, ...); you can still encode those efficiently, using a run-length encoding, or a vector clock plus a set of “extra” dot IDs (known as a dotted vector clock).Like dot IDs, vector clocks are flexible. For example, instead of a per-replica counter, you could store the most recent logical timestamp received from each replica. That is a reasonable choice if each operation already contains a logical timestamp for LWW.Collabs: CausalMessageBufferRefs: Baquero and Preguiça 2016; Wikipedia# Sync StrategiesWe now turn to sync strategies: ways to keep collaborators in sync with each other, so that they eventually see the same states.# Op-Based CRDTsAn operation-based (op-based) CRDT keeps collaborators in sync by broadcasting operations as they happen. This sync strategy is especially useful for live collaboration, where users would like to see each others’ operations quickly.Current state (op history) + single operation -> new state (op history including operation).In the simple protocol at the top of this post, a user processes an individual operation by adding it to their operation history, then re-running the semantic function to update their app-visible state. An op-based CRDT instead stores a state that is (usually) smaller than the complete operation history, but it still contains enough information to render the app-visible state, and it can be updated incrementally in response to a received operation.Example: The op-based counter CRDT from Part 1 has an internal state that is merely the current count. When a user receives (or performs) an inc() operation, they increment the count.Formally, an op-based CRDT consists of:A set of allowed CRDT states that a replica can be in.A query that returns the app-visible state. (The CRDT state often includes extra metadata that is not visible to the rest of the app, such as LWW timestamps.)For each operation (insert, set, etc.):A prepare function that inputs the operation’s parameters and outputs a message describing that operation. An external protocol promises to broadcast this message to all collaborators. (Usually this message is just the translated form of the operation. prepare is allowed to read the current state but not mutate it.)An effect function that processes a message, updating the local state. An external protocol promises to call effect:Immediately for each locally-prepared message, so that local operations update the state immediately.Eventually for each remotely-prepared message, exactly once and (optionally) in causal order.In addition to updating their internal state, CRDT libraries’ effect functions usually also emit events that describe how the state changed. Cf. Views in Part 2.To claim that an op-based CRDT implements a given CRDT semantics, you must prove that the app-visible state always equals the semantics applied to the set of operations effected so far.As an example, let’s repeat the op-based unique set CRDT from Part 2.Per-user CRDT state: A set of pairs (id, x).Query: Return the CRDT state directly, since in this case, it coincides with the app-visible state.Operation add:prepare(x): Generate a new UID id, then return the message ("add", (id, x)).effect("add", (id, x)): Add (id, x) to your local state.Operation delete:prepare(id): Return the message ("delete", id).effect("delete", id): Delete the pair with the given id from your local state, if it is still present.It is easy to check that this op-based CRDT has the desired semantics: at any time, the query returns the set of pairs (id, x) such that you have an effected an add(id, x) operation but no delete(id) operations.Observe that the CRDT state is a lossy representation of the operation history: we don’t store any info about delete operations or deleted add operations.How can the “external protocol” (i.e., the rest of the app) guarantee that messages are effected at-most-once and in causal order? Using a history-tracking vector clock:On each replica, store a vector clock tracking the operations that have been effected so far.When a replica asks to send a prepared message, attach a new dot ID to that message before broadcasting it. Also attach the dot IDs of its immediate causal predecessors.When a replica receives a message:Check if its dot ID is redundant, according to the local vector clock. If so, stop.Check if the immediate causal predecessors’ dot IDs have been effected, according to the local vector clock. If not, block until they are.Deliver the message to effect and update the local vector clock. Do the same for any newly-unblocked messages.To ensure that messages are eventually delivered at-least-once to each replica (the other half of exactly-once), you generally need some help from the network. E.g., have a server store all messages and retry delivery until every client confirms receipt.As a final note, suppose that two users concurrently perform operations o and p. You are allowed to deliver their op-based messages to effect in either order without violating the causal-order delivery guarantee. Semantically, the two delivery orders must result in equivalent internal states: both results correspond to the same operation history, containing both o and p. Thus for an op-based CRDT, concurrent messages commute.Conversely, you can prove that if an algorithm has the API of an op-based CRDT and concurrent messages commute, then its behavior corresponds to some CRDT semantics (i.e., some pure function of the operation history). This leads to the traditional definition of an op-based CRDT in terms of commuting concurrent operations. Of course, if you only prove commutativity, there is no guarantee that the corresponding semantics are reasonable in the eyes of your users.Collabs: sendCRDT and receiveCRDT in PrimitiveCRDTRefs: Shapiro et al. 2011a# State-Based CRDTsA state-based CRDT keeps users in sync by occasionally exchanging entire states, “merging” their operation histories. This sync strategy is useful in peer-to-peer networks (peers occasionally exchange states in order to bring each other up-to-date) and for the initial sync between a client and a server (the client merges its local state with the server’s latest state, and vice-versa).Current state (op history) + other state (overlapping op history) -> merged state (union of op histories).In the simple protocol at the top of this post, the “entire state” is the literal operation history, and merging is just the set-union of operations (using the UIDs to filter duplicates). A state-based CRDT instead stores a state that is (usually) smaller than the complete operation history, but it still contains enough information to render the app-visible state, and it can be “merged” with another state.Formally, a state-based CRDT consists of:A set of allowed CRDT states that a replica can be in.A query that returns the app-visible state. (The CRDT state often includes extra metadata that is not visible to the rest of the app, such as LWW timestamps.)For each operation, a state mutator that updates the local state.A merge function that inputs two states and outputs a “merged” state. The app using the CRDT may set local state = merge(local state, other state) at any time, where the other state usually comes from a remote collaborator or storage.To claim that a state-based CRDT implements a given CRDT semantics, you must prove that the app-visible state always equals the semantics applied to the set of operations that contribute to the current state. Here an operation “contributes” to the output of its state mutator, plus future states resulting from that state (e.g., the merge of that state with another).As an example, let’s repeat just the state-based part of the LWW Register from Part 2.Per-user state: state = { value, time }, where time is a logical timestamp.Query: Return state.value.Operation set(newValue): Set state = { value: newValue, time: newTime }, where newTime is the current logical time.Merge in other state: Pick the state with the greatest logical timestamp. That is, if other.time > state.time, set state = other.It is easy to check that this state-based CRDT has the desired semantics: at any time, the query returns the value corresponding to the set operation with the greatest logical timestamp that contributes to the current state.As a final note, observe that for any CRDT states s, t, u, the following algebraic rules hold, because the same set of operations contributes to both sides of each equation:(Idempotence) merge(s, s) = s.(Commutativity) merge(s, t) = merge(t, s).(Associativity) merge(s, (merge(t, u))) = merge(merge(s, t), u).Thus for a state-based CRDT, the merge function is Associative, Commutative, and Idempotent (ACI).Conversely, you can prove that if an algorithm has the API of a state-based CRDT and it satisfies ACI, then its behavior corresponds to some CRDT semantics (i.e., some pure function of the operation history). This leads to the traditional definition of a state-based CRDT in terms of an ACI merge function. Of course, if you only prove these algebraic rules, there is no guarantee that the corresponding semantics are reasonable in the eyes of your users.Collabs: saveCRDT and loadCRDT in PrimitiveCRDTRefs: Shapiro et al. 2011a# Other Sync StrategiesIn a real collaborative app, it is inconvenient to choose op-based or state-based synchronization. Instead, it’s nice to use both, potentially within the same session.Example: When the user launches your app, first do a state-based sync with a storage server to become in sync. Then use op-based messages over TCP to stay in sync until the connection drops.Thus hybrid op-based/state-based CRDTs that support both sync strategies are popular in practice. Typically, these look like either state-based CRDTs with op-based messages tacked on (Yjs, Collabs), or they use an op-based CRDT alongside a complete operation history (Automerge).To perform a state-based merge in the latter approach, you look through the received state’s history for operations that are not already in your history, and deliver those to the op-based CRDT. This approach is simple, and it comes with a built-in version history, but it requires more effort to make efficient (as Automerge has been pursuing).Other sync strategies use optimized peer-to-peer synchronization. Traditionally, peer-to-peer synchronization uses state-based CRDTs: each peer sends a copy of its own state to the other peer, then merges in the received state. This is inefficient if the states overlap a lot - e.g., the two peers just synced one minute ago and have only updated their states slightly since then. Optimized protocols like Yjs’s sync potocol or Byzantine causal broadcast instead use back-and-forth messages to determine what info the other peer is missing and send just that.For academic work on hybrid or optimized sync strategies, look up delta-state based CRDTs (also called delta CRDTs). These are like hybrid CRDTs, with the added technical requirement that op-based messages are themselves states (in particular, they are input to the state-based merge function instead of a separate effect function). Note that some papers focus on novel sync strategies, while others focus on the orthogonal problem of how to tolerate non-causally-ordered messages.Collabs: Updates and Sync - PatternsRefs: Yjs document updates; Almeida, Shoker, and Baquero 2016; Enes et al. 2019# Misc TechniquesThe rest of this post describes specific algorithms. We start with miscellaneous techniques that are needed to implement some of the semantics from Part 2: two kinds of logical timestamps for LWW, and ways to query the causal order. These are all traditional distributed systems techniques that are not specific to CRDTs.# LWW: Lamport TimestampsRecall that you should use a logical timestamp instead of wall-clock time for Last-Writer Wins (LWW) values. A Lamport timestamp is a simple and common logical timestamp, defined by:Each replica stores a clock value time, an integer that is initially 0.Whenever you perform an LWW set operation, increment time and attach its new value to the operation, as part of a pair (time, replicaID). This pair is called a Lamport timestamp.Whenever you receive a Lamport timestamp from another replica as part of an operation, set time = max(time, received time).The total order on Lamport timestamps is given by: (t1, replica1) < (t2, replica2) if t1 < t2 or (t1 = t2 and replica1 < replica2). That is, the operation with a greater time “wins”, with ties broken using an arbitrary order on replica IDs.Operations A-F with arrows A to B, A to E, B to C, C to D, C to F. The labels are: (1, "A84nxi"); (2, "A84nxi"); (3, "A84nxi"); (5, "A84nxi"); (2, "bu2nVP"); (4, "bu2nVP").Figure 1. An operation history with each operation labeled by a Lamport timestamp.Lamport timestamps have two important properties (mentioned in Part 2):If o < p in the causal order, then (o's Lamport timestamp) < (p's Lamport timestamp). Thus a new LWW set always wins over all causally-prior sets.Note that the converse does not hold: it’s possible that (q's Lamport timestamp) < (r's Lamport timestamp) but q and r are concurrent.Lamport timestamps created by different users are always distinct (because of the replicaID tiebreaker). Thus the winner is never ambiguous.Collabs: lamportTimestampRefs: Lamport 1978; Wikipedia# LWW: Hybrid Logical ClocksHybrid logical clocks are another kind of logical timestamp that combine features of Lamport timestamps and wall-clock time. I am not qualified to write about these, but Jared Forsyth gives a readable description here: https://jaredforsyth.com/posts/hybrid-logical-clocks/.# Querying the Causal Order: Vector Clocks 2One of Part 2’s “Other Techniques” was Querying the Causal Order. For example, an access-control CRDT could include a rule like “If Alice performs an operation but an admin banned her concurrently, then treat Alice’s operation as if it had not happened”.I mentioned in Part 2 that I find this technique too complicated for practical use, except in some special cases. Nevertheless, here are some ways to implement causal-order queries.Formally, our goal is: given two operations o and p, answer the query “Is o < p in the causal order?”. More narrowly, a CRDT might query whether o and p are concurrent, i.e., neither o < p nor p < o.Recall from above that a vector clock is a map that sends a replica ID to the maximum counter value received from that replica, where each replica assigns counters to its own operations in order starting at 1 (like a dot ID). Besides storing a vector clock on each replica, we can also attach a vector clock to each operation: namely, the sender’s vector clock at the time of sending. (The sender’s own entry is incremented to account for the operation itself.)Operations A-F with arrows A to B, A to E, B to C, C to D, C to F. The labels are: ("A84nxi", 1) / { A84nxi: 1 }; ("A84nxi", 2) / { A84nxi: 2 }; ("A84nxi", 3) / { A84nxi: 3 }; ("A84nxi", 4) / { A84nxi: 4, bu2nVP: 2 }; ("bu2nVP", 1) / { A84nxi: 1, bu2nVP: 1 }; ("bu2nVP", 2) / { A84nxi: 3, bu2nVP: 2 }.Figure 2. An operation history with each operation labeled by its dot ID (blue italics) and vector clock (red normal text).Define a partial order on vector clocks by: v < w if for every replica ID r, v[r] <= w[r], and for at least one replica ID, v[r] < w[r]. (If r is not present in a map, treat its value as 0.) Then it is a classic result that the causal order on operations matches the partial order on their vector clocks. Thus storing each operation’s vector clock lets you query the causal order later.Example: In the above diagram, { A84nxi: 1, bu2nVP: 1 } < { A84nxi: 4, bu2nVP: 2 }, matching the causal order on their operations. Meanwhile, { A84nxi: 1, bu2nVP: 1 } and { A84nxi: 2 } are incomparable, matching the fact that their operations are concurrent.Often you only need to query the causal order on new operations. That is, you just received an operation p, and you want to compare it to an existing operation o. For this, it suffices to know o’s dot ID (replicaID, counter): If counter <= p.vc[replicaID], then o < p, else they are concurrent. Thus in this case, you don’t need to store each operation’s vector clock, just their dot IDs (though you must still send vector clocks over the network).The above discussion changes slightly if you do not assume causal order delivery. See Wikipedia’s update rules.We have a performance problem: The size of a vector clock is proportional to the number of past replicas. In a collaborative app, this number tends to grow without bound: each browser tab creates a new replica, including refreshes. Thus if you attach a vector clock to each op-based message, your network usage also grows without bound.Some workarounds:Instead of attaching the whole vector clock to each op-based message, just attached the UIDs of the operation’s immediate causal predecessors. (You are probably attaching these anyway for causal-order delivery.) Then on the receiver side, look up the predecessors’ vector clocks in your stored state, take their entry-wise max, add one to the sender’s entry, and store that as the operation’s vector clock.Same as 1, but instead of storing the whole vector clock for each operation, just store its immediate causal predecessors - the arrows in the operation history. This uses less space, and it still contains enough information to answer causal order queries: o < p if and only if there is a path of arrows from o to p. However, I don’t know of a way to perform those queries quickly.Instead of referencing the causal order directly, the sender can list just the “relevant” o < p as part of p’s op-based message. For example, when you set the value of a multi-value register, instead of using a vector clock to indicate which set(x) operations are causally prior, just list the UIDs of the current multi-values (cf. the multi-value register on top of a unique set).Collabs: vectorClockRefs: Baquero and Preguiça 2016; Wikipedia; Automerge issue discussing workarounds# Optimized CRDTsWe now turn to optimizations. I focus on algorithmic optimizations that change what state you store and how you access it, as opposed to low-level code tricks. Usually, the optimizations reduce the amount of metadata that you need to store in memory and on disk, at least in the common case.These optimizations are the most technical part of the blog series. You may wish to skip them for now and come back only when you are implementing one, or trust someone else to implement them in a library.# List CRDTsI assume you’ve understood Lists and Text Editing from Part 2.There are too many list CRDT algorithms and optimizations to survey here, but I want to briefly introduce one key problem and solution.When you use a text CRDT to represent a collaborative text document, the easy way to represent the state is as an ordered map (list CRDT position) -> (text character). Concretely, this map could be a tree with one node per list CRDT position, like in Fugue: A Basic List CRDT.Example of a Fugue tree.Figure 3. Example of a Fugue tree with corresponding text "abcde". Each node's UID is a dot.In such a tree, each tree node contains at minimum (1) a UID and (2) a pointer to its parent node. That is a lot of metadata for a single text character! Plus, you often need to store this metadata even for deleted characters (tombstones).Here is an optimization that dramatically reduces the metadata overhead in practice:When a replica inserts a sequence of characters from left to right (the common case), instead of creating a new UID for each character, only create a UID for the leftmost character.Store the whole sequence as a single object (id, parentId etc, [char0, char1, ..., charN]). So instead of one tree node per character, your state has one tree node per sequence, storing an array of characters.To address individual characters, use list CRDT positions of the form (id, 0), (id, 1), …, (id, N).It’s possible to later insert characters in the middle of a sequence, e.g., between char1 and char2. That’s fine; the new characters just need to indicate the corresponding list CRDT positions (e.g. “I am a left child of (id, 2)”).Applying this optimization to Fugue gives you trees like so, where only the filled nodes are stored explicitly (together with their children’s characters):Example of an optimized Fugue tree.Figure 4. Example of an optimized Fugue tree with corresponding text "abcdefg".Collabs: Waypoints in CTotalOrderRefs: Yu 2012; Jahns 2020 (Yjs blog post)# Formatting Marks (Rich Text)Recall the inline formatting CRDT from Part 2. Its internal CRDT state is an append-only log of formatting markstype Mark = {key: string;value: any;timestamp: LogicalTimestamp;start: { pos: Position, type: "before" | "after" }; // type Anchorend: { pos: Position, type: "before" | "after" }; // type Anchor}Its app-visible state is the view of this log given by: for each character c, for each format key key, find the mark with the largest timestamp satisfyingmark.key = key, andthe interval (mark.start, mark.end) contains c’s position.Then c’s format value at key is mark.value.For practical use, we would like a view that represents the same state but uses less memory. In particular, instead of storing per-character formatting info, it should look more like a Quill delta. E.g., the Quill delta representing “Quick brown fox” is{ops: [{ insert: "Quick " },{ insert: "brow", attributes: { bold: true, italic: true } },{ insert: "n fox", attributes: { italic: true } }]}Here is such a view. Its state is a map: Map<Anchor, Mark[]>, given by:For each anchor that appears as the start or end of any mark in the log,The value at anchor contains pointers to all marks that start at or strictly contain anchor. That is, { mark in log | mark.start <= anchor < mark.end }.Given this view, it is easy to look up the format of any particular character. You just need to go left until you reach an anchor that is in map, then interpret map.get(anchor) in the usual way: for each key, find the LWW winner at key and use its value. (If you reach the beginning of the list, the character has no formatting.)I claim that with sufficient coding effort, you can also do the following tasks efficiently:Convert the whole view into a Quill delta or similar rich-text-editor state.Update the view to reflect a new mark added to the log.After updating the view, emit events describing what changed, in a form like Collabs’s RichTextFormatEvent:type RichTextFormatEvent = {// The formatted range is [startIndex, endIndex).startIndex: number;endIndex: number;key: string;value: any; // null if unformattedpreviousValue: any;// The range's complete new format.format: { [key: string]: any };}Let’s briefly discuss Task 2; there’s more detail in the Peritext essay. When a new mark mark is added to the log:If mark.start is not present in map, go to the left of it until you reach an anchor prev that is in map, then do map.set(mark.start, copy of map.get(prev)). (If you reach the beginning of the list, do map.set(mark.start, []).)If mark.end is not present in map, do likewise.For each entry (anchor, array) in map such that mark.start <= anchor < mark.end, append mark to array.Some variations on this section:Collabs represents the Map<Anchor, Mark[]> literally using a LocalList - a local data structure that lets you build an ordered map on top of a separate list CRDT’s positions. Alternatively, you can store each Mark[] inline with the list CRDT, at its anchor’s location; that is how the Peritext essay does it.When saving the state to disk, you can choose whether to save the view, or forget it and recompute it from the mark log on next load. Likewise for state-based merging: you can try to merge the views directly, or just process the non-redundant marks one at a time.In each map value (a Mark[]), you can safely forget marks that have LWW-lost to another mark in the same array. Once a mark has been deleted from every map value, you can safely forget it from the log.Collabs: CRichText; its source code shows all three tasksRefs: Litt et al. 2021 (Peritext)# State-Based CounterThe easy way to count events in a collaborative app is to store the events in an append-only log or unique set. This uses more space than the count alone, but you often want that extra info anyway - e.g., to display who liked a post, in addition to the like count.Nonetheless, the optimized state-based counter CRDT is both interesting and traditional, so let’s see it.The counter’s semantics are as in Part 1’s passenger counting example: its value is the number of +1 operations in the history, regardless of concurrency.You can obviously achieve this semantics by storing an append-only log of +1 operations. To merge two states, take the union of log entries, skipping duplicate UIDs.In other words, store the entire operation history, following the simple protocol from the top of this post.Suppose the append-only log uses dot IDs as its UIDs. Then the log’s state will always look something like this:[((a6X7fx, 1), "+1"), ((a6X7fx, 2), "+1"), ((a6X7fx, 3), "+1"), ((a6X7fx, 4), "+1"),((bu91nD, 1), "+1"), ((bu91nD, 2), "+1"), ((bu91nD, 3), "+1"),((yyn898, 1), "+1"), ((yyn898, 2), "+1")]You can compress this state by storing, for each replicaID, only the range of dot IDs received from that replica. For example, the above log compresses to{a6X7fx: 4,bu91nD: 3,yyn898: 2}This is the same trick we used in Vector Clocks 1.Compressing the log in this way leads to the following algorithm, the state-based counter CRDT.Per-user state: A Map<ReplicaID, number>, mapping each replicaID to the number of +1 operations received from that replica. (Traditionally, this state is called a vector instead of a map.)App-visible state (the count): Sum of the map values.Operation +1: Add 1 to your own map entry, treating a missing entry as 0.Merge in other state: Take the entry-wise max of values, treating missing entries as 0. That is, for all r, set this.state[r] = max(this.state[r] ?? 0, other.state[r] ?? 0).For example:// Starting local state:{a6X7fx: 2, // Implies ops ((a6X7fx, 1), "+1"), ((a6X7fx, 2), "+1")bu91nD: 3,}// Other state:{a6X7fx: 4, // Implies ops ((a6X7fx, 1), "+1"), ..., ((a6X7fx, 4), "+1")bu91nD: 1,yyn898: 2}// Merged result:{a6X7fx: 4, // Implies ops ((a6X7fx, 1), "+1"), ..., ((a6X7fx, 4), "+1"): union of inputsbu91nD: 3,yyn898: 2}You can generalize the state-based counter to handle +x operations for arbitrary positive values x. However, to handle both positive and negative additions, you need to use two counters: P for positive additions and N for negative additions. The actual value is (P's value) - (N's value). This is the state-based PN-counter.Collabs: CCounterRefs: Shapiro et al. 2011a# Delta-State Based CounterYou can modify the state-based counter CRDT to also support op-based messages:When a user performs a +1 operation, broadcast its dot ID (r, c).Recipients add this dot their compresed log map, by setting map[r] = max(map[r], c).This hybrid op-based/state-based CRDT is called the delta-state based counter CRDT.Technically, the delta-state based counter CRDT assumes causal-order delivery for op-based messages. Without this assumption, a replica’s uncompressed log might contain gaps like[((a6X7fx, 1), "+1"), ((a6X7fx, 2), "+1"), ((a6X7fx, 3), "+1"), ((a6X7fx, 6), "+1")]which we can’t represent as a Map<ReplicaID, number>.You could argue that the operation ((a6X7fx, 6), "+1") lets you “infer” the prior operations ((a6X7fx, 4), "+1") and ((a6X7fx, 5), "+1"), hence you can just set the map entry to { a6X7fx: 6 }. However, this will give an unexpected counter value if those prior operations were deliberately undone, or if they’re tied to some other change that can’t be inferred (e.g., a count of comments vs the actual list of comments).Luckily, you can still compress ranges within the dot IDs that you have received. For example, you could use a run-length encoding:{a6X7fx: [1 through 3, 6 through 6]}or a map plus a set of “extra” dot IDs:{map: {a6X7fx: 3},dots: [["a6X7fx", 6]]}This idea leads to a second delta-state based counter CRDT. Its state-based merge algorithm is somewhat complicated, but it has a simple spec: decompress both inputs, take the union, and re-compress.Refs: Almeida, Shoker, and Baquero 2016# State-Based Unique SetHere is a straightforward state-based CRDT for the unique set:Per-user state:A set of elements (id, x), which is the set’s literal (app-visible) state.A set of tombstones id, which are the UIDs of all deleted elements.App-visible state: Return elements.Operation add(x): Generate a new UID id, then add (id, x) to elements.Operation delete(id): Delete the pair with the given id from elements, and add id to tombstones.State-based merge: To merge in another user’s state other = { elements, tombstones },For each (id, x) in other.elements, if it is not already present in this.elements and id is not in this.tombstones, add (id, x) to this.elements.For each id in other.tombstones, if it is not already present in this.tombstones, add id to this.tombstones and delete the pair with the given id from this.elements (if present).For example:// Starting local state:{elements: [ (("A84nxi", 1), "milk"), (("A84nxi", 3), "eggs") ],tombstones: [ ("A84nxi", 2), ("bu2nVP", 1), ("bu2nVP", 2) ]}// Other state:{elements: [(("A84nxi", 3), "eggs"), (("bu2nVP", 1), "bread"), (("bu2nVP", 2), "butter"),(("bu2nVP", 3), "cereal")],tombstones: [ ("A84nxi", 1), ("A84nxi", 2) ]}// Merged result:{elements: [ (("A84nxi", 3), "eggs"), (("bu2nVP", 3), "cereal") ],tombstones: [ ("A84nxi", 1), ("A84nxi", 2), ("bu2nVP", 1), ("bu2nVP", 2) ]}The problem with this straightforward algorithm is the tombstone set: it stores a UID for every deleted element, potentially making the CRDT state much larger than the app-visible state.Luckily, when your UIDs are dot IDs, you can use a “compression” trick similar to the state-based counter CRDT: in place of the tombstone set, store the range of dot IDs received from each replica (deleted or not), as a Map<ReplicaID, number>. That is, store a modified vector clock that only counts add operations. Any dot ID that lies within this range, but is not present in elements, must have been deleted.For example, the three states above compress to:// Starting local state:{elements: [ (("A84nxi", 1), "milk"), (("A84nxi", 3), "eggs") ],vc: { A84nxi: 3, bu2nVP: 2 }}// Other state:{elements: [(("A84nxi", 3), "eggs"), (("bu2nVP", 1), "bread"), (("bu2nVP", 2), "butter"),(("bu2nVP", 3), "cereal")],vc: { A84nxi: 3, bu2nVP: 3 }}// Merged result:{elements: [ (("A84nxi", 3), "eggs"), (("bu2nVP", 3), "cereal") ],vc: { A84nxi: 3, bu2nVP: 3 }}Compressing the tombstone set in this way leads to the following algorithm, the optimized state-based unique set:Per-user state:A set of elements (id, x) = ((r, c), x).A vector clock vc: Map<ReplicaID, number>.A local counter counter, used for dot IDs.App-visible state: Return elements.Operation add(x): Generate a new dot id = (local replicaID, ++counter), then add (id, x) to elements. Also increment vc[local replicaID].Operation delete(id): Merely delete the pair with the given id from elements.State-based merge: To merge in another user’s state other = { elements, vc },For each ((r, c), x) in other.elements, if it is not already present in this.elements and c > this.vc[r], add ((r, c), x) to this.elements. (It’s new and has not been deleted locally.)For each entry (r, c) in other.vc, for each pair ((r, c'), x) in this.elements, if c >= c' and the pair is not present in other.elements, delete it from this.elements. (It must have been deleted from the other state.)Set this.vc to the entry-wise max of this.vc and other.vc, treating missing entries as 0. That is, for all r, set this.vc[r] = max(this.vc[r] ?? 0, other.vc[r] ?? 0).You can re-use the same vector clock that you use for tracking operations, if your unique set’s dot IDs use the same counter. Nitpicks:Your unique set might skip over some dot IDs, because they are used for other operations; that is fine.If a single operation can add multiple elements at once, consider using UIDs of the form (dot ID, within-op counter) = (replicaID, per-op counter, within-op counter).The optimized unique set is especially important because you can implement many other CRDTs on top, which are then also optimized (they avoid tombstones). In particular, Part 2 describes unique set algorithms for the multi-value register, multi-value map, and add-wins set (all assuming causal-order delivery). You can also adapt the optimized unique set to manage deletions in the unique set of CRDTs.Collabs: CMultiValueMap, CSetRefs: Based on the Optimized OR-Set from Bieniusa et al. 2012# Delta-State Based Unique SetSimilar to the second delta-state based counter CRDT, you can create a delta-state based unique set that is a hybrid op-based/state-based CRDT and allows non-causal-order message delivery. I’ll leave it as an exercise.Refs: Causal δ-CRDTs in Almeida, Shoker, and Baquero 2016# ConclusionThe last two posts surveyed the two “topics” from Part 1:Semantics: An abstract description of what a collaborative app’s state should be, given its concurrency-aware operation history.Algorithms: Algorithms to efficiently compute the app’s state in specific, practical situations.Together, they covered much of the CRDT theory that I know and use. I hope that you now know it too!However, there are additional CRDT ideas outside my focus area. I’ll give a bibliography for those in the next and final post, Part 4: Further Topics.This blog post is Part 3 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further TopicsHome • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub
Sym·poly·mathesyby Chris Krychojj initWhat if we actually could replace Git? Jujutsu might give us a real shot.Assumed audience: People who have worked with Git or other modern version control systems like Mercurial, Darcs, Pijul, Bazaar, etc., and have at least a basic idea of how they work.Jujutsu is a new version control system from a software engineer at Google, where it is on track to replace Google’s existing version control systems (historically: Perforce, Piper, and Mercurial). I find it interesting both for the approach it takes and for its careful design choices in terms of both implementation details and user interface. It offers one possible answer to a question I first started asking most of a decade ago: What might a next-gen version control system look like — one which actually learned from the best parts of all of this generation’s systems, including Mercurial, Git, Darcs, Fossil, etc.?To answer that question, it is important to have a sense of what those lessons are. This is trickier than it might seem. Git has substantially the most “mind-share” in the current generation; most software developers learn it and use it not because they have done any investigation of the tool and its alternatives but because it is a de facto standard: a situation which arose in no small part because of its “killer app” in the form of GitHub. Developers who have been around for more than a decade or so have likely seen more than one version control system — but there are many, many developers for whom Git was their first and, so far, last VCS.The problems with Git are many, though. Most of all, its infamously terrible command line interface results in a terrible user experience. In my experience, very few working developers have a good mental model for Git. Instead, they have a handful of commands they have learned over the years: enough to get by, and little more. The common rejoinder is that developers ought to learn how Git works internally — that everything will make more sense that way.This is nonsense. Git’s internals are interesting on an implementation level, but frankly add up to an incoherent mess in terms of a user mental model. This is a classic mistake for software developers, and one I have fallen prey to myself any number of times. I do not blame the Git developers for it, exactly. No one should have to understand the internals of the system to use it well, though; that is a simple failure of software design. Moreover, even those internals do not particularly cohere. The index, the number of things labeled “-ish” in the glossary, the way that a “detached HEAD” interacts with branches, the distinction between tags and branches, the important distinctions between commits, refs, and objects… It is not that any one of those things is bad in isolation, but as a set they do not amount to a mental model I can describe charitably. Put in programming language terms: One of the reasons the “surface syntax” of Git is so hard is that its semantics are a bit confused, and that inevitably shows up in the interface to users.Still, a change in a system so deeply embedded in the software development ecosystem is not cheap. Is it worth the cost of adoption? Well, Jujutsu has a trick up its sleeve: there is no adoption cost. You just install it — brew install jj will do the trick on macOS — and run a single command in an existing Git repository, and… that’s it. (“There is no step 3.”) I expect that mode will always work, even though there will be a migration step at some point in the future, when Jujutsu’s own, non-Git backend becomes a viable — and ultimately the recommended — option. I am getting ahead of myself though. The first thing to understand is what Jujutsu is, and is not.Jujutsu is two things:It is a new front-end to Git. This is by far the less interesting of the two things, but in practice it is a substantial part of the experience of using the tool today. In this regard, it sits in the same notional space as something like gitoxide. Jujutsu’s jj is far more usable for day to day work than gitoxide’s gix and ein so far, though, and it also has very different aims. That takes us to:It is a new design for distributed version control. This is by far the more interesting part. In particular, Jujutsu brings to the table a few key concepts — none of which are themselves novel, but the combination of which is really nice to use in practice:Changes are distinct from revisions: an idea borrowed from Mercurial, but quite different from Git’s model.Conflicts are first-class items: an idea borrowed from Pijul and Darcs.The user interface is not only reasonable but actually really good: an idea borrowed from… literally every VCS other than Git.The combo of those means that you can use it today in your existing Git repos, as I have been for the past six months, and that it is a really good experience using it that way. (Better than Git!) Moreover, given it is being actively developed at and by Google for use as a replacement for its current custom VCS setup, it seems like it has a good future ahead of it. Net: at a minimum you get a better experience for using Git with it. At a maximum, you get an incredibly smooth and shallow on-ramp to what I earnestly hope is the future of version control.Jujutsu is not trying to do every interesting thing that other Git-alternative DVCS systems out there do. Unlike Pijul, for example, it does not work from a theory of patches such that the order changes are applied is irrelevant. However, as I noted above and show in detail below, jj does distinguish between changes and revisions, and has first-class support for conflicts, which means that many of the benefits of Pijul’s handling come along anyway. Unlike Fossil, Jujutsu is also not trying to be an all-in-one tool. Accordingly: It does not come with a replacement for GitHub or other such “forges”. It does not include bug tracking. It does not support chat or a forum or a wiki. Instead, it is currently aimed at just doing the base VCS operations well.Finally, there is a thing Jujutsu is not yet: a standalone VCS ready to use without Git. It supports its own, “native” backend for the sake of keeping that door open for future capabilities, and the test suite exercises both the Git and the “native” backend, but the “native” one is not remotely ready for regular use. That said, this one I do expect to see change over time!One of the really interesting bits about picking up Jujutsu is realizing just how weirdly Git has wired your brain, and re-learning how to think about how a version control system can work. It is one thing to believe — very strongly, in my case! — that Git’s UI design is deeply janky (and its underlying model just so-so); it is something else to experience how much better a VCS UI can be (even without replacing the underlying model!).Yoda saying “You must unlearn what you have learned.”Time to become a Jedi Knight. Jujutsu Knight? Jujutsu Master? Jujutsu apprentice, at least. Let’s dig in!OutlineUsing JujutsuThat is all interesting enough philosophically, but for a tool that, if successful, will end up being one of a software developer’s most-used tools, there is an even more important question: What is it actually like to use?Setup is painless. Running brew install jj did everything I needed. As with most modern Rust-powered CLI tools,1 Jujutsu comes with great completions right out of the box. I did make one post-install tweak, since I am going to be using this on existing Git projects: I updated my ~/.gitignore_global to ignore .jj directories anywhere on disk.2Using Jujutsu in an existing Git project is also quite easy.3 You just run jj git init --git-repo <path to repo>.4 That’s the entire flow. After that you can use git and jj commands alike on the repository, and everything Just Works™, right down to correctly handling .gitignore files. I have since run jj git init in every Git repository I am actively working on, and have had no issues in many months. It is also possible to initialize a Jujutsu copy of a Git project without having an existing Git repo, using jj git clone, which I have also done, and which works well.Cloning true-myth and initializing it as a Jujutsu repoOnce a project is initialized, working on it is fairly straightforward, though there are some significant adjustments required if you have deep-seated habits from Git!Revisions and revsetsOne of the first things to wrap your head around when first coming to Jujutsu is its approach to its revisions and revsets, i.e. “sets of revision”. Revisions are the fundamental elements of changes in Jujutsu, not “commits” as in Git. Revsets are then expressions in a functional language for selecting a set of revisions. Both the idea and the terminology are borrowed directly from Mercurial, though the implementation is totally new. (Many things about Jujutsu borrow from Mercurial — a decision which makes me quite happy.) The vast majority of Jujutsu commands take a --revision/-r command to select a revision. So far that might not sound particularly different from Git’s notion of commits and commit ranges, and they are indeed similar at a surface level. However, the differences start showing up pretty quickly, both in terms of working with revisions and in terms of how revisions are a different notion of change than a Git commit.The first place you are likely to experience how revisions and revsets are different — and neat! — is with the log command, since looking at the commit log is likely to be something you do pretty early in using a new version control tool. (Certainly it was for me.) When you clone a repo and initialize Jujutsu in it and then run jj log, you will see something rather different from what git log would show you — indeed, rather different from anything I even know how to get git log to show you. For example, here’s what I see today when running jj log on the Jujutsu repository, limiting it to show just the last 10 revisions:> jj log --limit 10@ ukvtttmt hello@chriskrycho.com 2024-02-03 09:37:24.000 -07:00 1a0b8773│ (empty) (no description set)◉ qppsqonm essiene@google.com 2024-02-03 15:06:09.000 +00:00 main* HEAD@git bcdb9beb· cli: Move git_init() from init.rs to git.rs· ◉ rzwovrll ilyagr@users.noreply.github.com 2024-02-01 14:25:17.000 -08:00┌─┘ ig/contributing@origin 01e0739d│ Update contributing.md◉ nxskksop 49699333+dependabot[bot]@users.noreply.github.com 2024-02-01 08:56:08.000· -08:00 fb6c834f· cargo: bump the cargo-dependencies group with 3 updates· ◉ tlsouwqs jonathantanmy@google.com 2024-02-02 21:26:23.000 -08:00· │ jt/missingop@origin missingop@origin 347817c6· │ workspace: recover from missing operation· ◉ zpkmktoy jonathantanmy@google.com 2024-02-02 21:16:32.000 -08:00 2d0a444e· │ workspace: inline is_stale()· ◉ qkxullnx jonathantanmy@google.com 2024-02-02 20:58:21.000 -08:00 7abf1689┌─┘ workspace: refactor for_stale_working_copy◉ yyqlyqtq yuya@tcha.org 2024-01-31 09:40:52.000 +09:00 976b8012· index: on reinit(), delete all segment files to save disk space· ◉ oqnvqzzq martinvonz@google.com 2024-01-23 10:34:16.000 -08:00┌─┘ push-oznkpsskqyyw@origin 54bd70ad│ working_copy: make reset() take a commit instead of a tree◉ rrxuwsqp stephen.g.jennings@gmail.com 2024-01-23 08:59:43.000 -08:00 57d5abab· cli: display which file's conflicts are being resolvedHere’s the output for the same basic command in Git — note that I am not trying to get a similar output from Git, just asking what it shows by default (and warning: wall of log output!):> git log -10commit: bcdb9beb6ce5ba625ae73d4839e4574db3d9e559 HEAD -> main, origin/maindate: Mon, 15 Jan 2024 22:31:33 +0000author: Essien Ita Essiencli: Move git_init() from init.rs to git.rs* Move git_init() to cli/src/commands/git.rs and call it from there.* Move print_trackable_remote_branches into cli_util since it's not git specific,but would apply to any backend that supports remote branches.* A no-op change. A follow up PR will make use of this.commit: 31e4061bab6cfc835e8ac65d263c29e99c937abfdate: Mon, 8 Jan 2024 10:41:07 +0000author: Essien Ita Essiencli: Refactor out git_init() to encapsulate all git related work.* Create a git_init() function in cli/src/commands/init.rs where all git related work is done.This function will be moved to cli/src/commands/git.rs in a subsequent PR.commit: 8423c63a0465ada99c81f87e06f833568a22cb48date: Mon, 8 Jan 2024 10:41:07 +0000author: Essien Ita Essiencli: Refactor workspace root directory creation* Add file_util::create_or_reuse_dir() which is needed by all initfunctionality regardless of the backend.commit: b3c47953e807bef202d632c4e309b9a8eb814fdedate: Wed, 31 Jan 2024 20:53:23 -0800author: Ilya Grigorievconfig.md docs: document `jj config edit` and `jj config path`This changes the intro section to recommend using `jj config edit` toedit the config instead of looking for the files manually.commit: e9c482c0176d5f0c0c28436f78bd6002aa23a5e2date: Wed, 31 Jan 2024 20:53:23 -0800author: Ilya Grigorievdocs: mention in `jj help config edit` that the command can create a filecommit: 98948554f72d4dc2d5f406da36452acb2868e6d7date: Wed, 31 Jan 2024 20:53:23 -0800author: Ilya Grigorievcli `jj config`: add `jj config path` commandcommit: 8a4b3966a6ff6b9cc1005c575d71bfc7771bced1date: Fri, 2 Feb 2024 22:08:00 -0800author: Ilya Grigorievtest_global_opts: make test_version just a bit nicer when it failscommit: 42e61327718553fae6b98d7d96dd786b1f050e4cdate: Fri, 2 Feb 2024 22:03:26 -0800author: Ilya Grigorievtest_global_opts: extract --version to its own testcommit: 42c85b33c7481efbfec01d68c0a3b1ea857196e0date: Fri, 2 Feb 2024 15:23:56 +0000author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>cargo: bump the cargo-dependencies group with 1 updateBumps the cargo-dependencies group with 1 update: [tokio](https://github.com/tokio-rs/tokio).Updates `tokio` from 1.35.1 to 1.36.0- [Release notes](https://github.com/tokio-rs/tokio/releases)- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.35.1...tokio-1.36.0)---updated-dependencies:- dependency-name: tokiodependency-type: direct:productionupdate-type: version-update:semver-minordependency-group: cargo-dependencies...Signed-off-by: dependabot[bot]commit: 32c6406e5f04d2ecb6642433b0faae2c6592c151date: Fri, 2 Feb 2024 15:22:21 +0000author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>github: bump the github-dependencies group with 1 updateBumps the github-dependencies group with 1 update: [DeterminateSystems/magic-nix-cache-action](https://github.com/determinatesystems/magic-nix-cache-action).Updates `DeterminateSystems/magic-nix-cache-action` from 1402a2dd8f56a6a6306c015089c5086f5e1ca3ef to eeabdb06718ac63a7021c6132129679a8e22d0c7- [Release notes](https://github.com/determinatesystems/magic-nix-cache-action/releases)- [Commits](https://github.com/determinatesystems/magic-nix-cache-action/compare/1402a2dd8f56a6a6306c015089c5086f5e1ca3ef...eeabdb06718ac63a7021c6132129679a8e22d0c7)---updated-dependencies:- dependency-name: DeterminateSystems/magic-nix-cache-actiondependency-type: direct:productiondependency-group: github-dependencies...Signed-off-by: dependabot[bot]What’s happening in the Jujutsu log output? Per the tutorial’s note on the log command specifically:By default, jj log lists your local commits, with some remote commits added for context. The ~ indicates that the commit has parents that are not included in the graph. We can use the -r flag to select a different set of revisions to list.What jj log does show by default was still a bit non-obvious to me, even after that. Which remote commits added for context, and why? The answer is in the help output for jj log’s -r/--revisions option:Which revisions to show. Defaults to the ui.default-revset setting, or @ | ancestors(immutable_heads().., 2) | heads(immutable_heads()) if it is not setI will come back to this revset in a moment to explain it in detail. First, though, this shows a couple other interesting features of Jujutsu’s approach to revsets and thus the log command. First, it treats some of these operations as functions (ancestors(), immutable_heads(), etc.). There is a whole list of these functions! This is not a surprise if you think about what “expressions in a functional language” implies… but it was a surprise to me because I had not yet read that bit of documentation. Second, it makes “operators” a first-class idea. Git has operators, but this goes a fair bit further:It includes - for the parent and + for a child, and these stack and compose, so writing @-+-+ is the same as @ as long as the history is linear. (That is an important distinction!)It supports union |, intersection &, and difference ~ operators.A leading ::, which means “ancestors”. A trailing :: means “descendants”. Using :: between commits gives a view of the directed acyclic graph range between two commits. Notably, <id1>::<id2> is just <id1>:: & ::<id2>.There is also a .. operator, which also composes appropriately (and, smartly, is the same as .. in Git when used between two commits, <id1>..<id2>). The trailing version, <id>.., is interesting: it is “revisions that are not ancestors of <id>”. Likewise, the leading version ..<id> is all revisions which are ancestors of <id>Now, I used <id> here, but throughout these actually operate on revsets, so you could use them with any revset. For example, ..tags() will give you the ancestors of all tags. This strikes me as extremely interesting: I think it will dodge a lot of pain in dealing with Git histories, because it lets you ask questions about the history in a compositional way using normal set logic. To make that concrete: back in October, Jujutsu contributor @aseipp pointed out how easy it is to use this to get a log which excludes gh-pages. (Anyone who has worked on a repo with a gh-pages branch knows how annoying it is to have it cluttering up your view of the rest of your Git history!) First, you define an alias for the revset that only includes the gh-pages branch: 'gh-pages' = 'remote_branches(exact:"gh-pages")'. Then you can exclude it from other queries with the ~ negation operator: jj log -r "all() ~ ancestors(gh-pages)" would give you a log view for every revision with all() and then exclude every ancestor of the gh-pages branch.Jujutsu also provides a really capable templating system, which uses “a functional language to customize output of commands”. That functional language is built on top of the functional language that the whole language uses for describing revisions (described in brief above!), so you can use the same kinds of operators in templates for output as you do for navigating and manipulating the repository. The template format is still evolving, but you can use it to customize the output today… while being aware that you may have to update it in the future. Keywords include things like description and change_id, and these can be customized in Jujutsu’s config. For example, I made this tweak to mine, overriding the built-in format_short_id alias:[template-aliases]'format_short_id(id)' = 'id.shortest()'This gives me super short names for changes and commits, which makes for a much nicer experience when reading and working with both in the log output: Jujutsu will give me the shortest unique identifier for a given change or commit, which I can then use with commands like jj new. Additionally, there are a number of built-in templates. For example, to see the equivalent of Git’s log --pretty you can use Jujutsu’s log -T builtin_log_detailed (-T for “template”; you can also use the long from --template). You can define your own templates in a [templates] section, or add your own [template-aliases] block, using the template language and any combination of further functions you define yourself.That’s all well and good, but even with reading the docs for the revset language and the templating language, it still took me a bit to actually quite make sense out of the default output, much less to get a handle on how to customize the output. Right now, the docs have a bit of a flavor of explanations for people who already have a pretty good handle on version control systems, and the description of what you get from jj log is a good example of that. As the project gains momentum, it will need other kinds of more-introductory material, but the current status is totally fair and reasonable for the stage the project is at. And, to be fair to Jujutsu, both the revset language and the templating language are incredibly easier to understand and work with than the corresponding Git materials.Returning to the difference between the default output from jj log and git log, the key is that unless you pass -r, Jujutsu uses the ui.default-revset selector to provide a much more informative view than git log does. Again, the default is @ | ancestors(immutable_heads().., 2) | heads(immutable_heads()). Walking through that:The @ operator selects the current head revision.The | union operator says “or this other revset”, so this will show @ itself and the result of the other two queries.The immutable_heads() function gets the list of head revisions which are, well, immutable. By default, this is trunk() | tags(), so whatever the trunk branch is (most commonly main or master) and also any tags in the repository.Adding .. to the first immutable_heads() function selects revisions which are not ancestors of those immutable heads. This is basically asking for branches which are not the trunk and which do not end at a tag.Then ancestors(immutable_heads().., 2) requests the ancestors of those branches, but only two deep.Finally, heads() gets the tips of all branches which appear in the revset passed to it: a head is a commit with no children. Thus, heads(immutable_heads()) gets just the branch tips for the list of revisions computed by immutable_heads().5When you put those all together, your log view will always show your current head change, all the open branches which have not been merged into your trunk branch, and whatever you have configured to be immutable — out of the box, trunk and all tags. That is vastly more informative than git log’s default output, even if it is a bit surprising the first time you see it. Nor is it particularly possible to get that in a single git log command. By contrast, getting the equivalent of git log is trivial.To show the full history for a given change, you can use the :: ancestors operator. Since jj log always gives you the identifier for a revision, you can follow it up with jj log --revision ::<change id>, or jj log -r ::<change id> for short. For example, in one repo where I am trying this, the most recent commit identifier starts with mwoq (Jujutsu helpfully highlights the segment of the change identifier you need to use), so I could write jj log -r ::mwoq, and this will show all the ancestors of mwoq, or jj log -r ..mwoq to get all the ancestors of the commit except the root. (The root is uninteresting.) Net, the equivalent command for “show me all the history for this commit” is:$ jj log -r ..@Revsets are very powerful, very flexible, and yet much easier to use than Git’s operators. That is in part because of the language used to express them. It is also in part because revsets build on a fundamentally different view of the world than Git commits: Jujutsu’s idea of changes.ChangesIn Git, as in Subversion and Mercurial and other version control systems before them, when you finish with a change, you commit it. In Jujutsu, there is no first-class notion of “committing” code. This took me a fair bit to wrap my head around! Instead, Jujutsu has two discrete operations: describe and new. jj describe lets you provide a descriptive message for any change. jj new starts a new change. You can think of git commit --message "something I did" as being equivalent to jj describe --message "some I did" && jj new. This falls out of the fact that jj describe and jj new are orthogonal, and much more capable than git commit as a result.The describe command works on any commit. It defaults to the commit that is the current working copy. If you want to rewrite a message earlier in your commit history, though, that is not a special operation like it is in Git, where you have to perform an interactive rebase to do it. You just call jj describe with a --revision (or -r for short, as everywhere in Jujutsu) argument. For example:# long version$ jj describe --revision abcd --message "An updated message."# short version$ jj describe -r abcd -m "An updated message."That’s it. How you choose to integrate that into your workflow is a matter for you and your team to decide, of course. Jujutsu understands that some branches should not have their history rewritten this way, though, and lets you specify what the “immutable heads” revset should be accordingly. This actually makes it safer than Git, where the tool itself does not understand that kind of immutability and we rely on forges to protect certain branches from being targeted by a force push.The new command is the core of creating any new change, and it does not require there to be only a single parent. You can create a new change with as many parents as is appropriate! Is a given change logically the child of four other changes, with identifiers a, b, c, and d? jj new a b c d. That’s it. One neat consequence that falls out of this: a merge in Jujutsu is just jj new with the requirement that it have at least two parents. (“At least two parents” because having multiple parents for a merge is not a special case as with Git’s “octopus” merges.) Likewise, you do not need a commit command, because you can describe a given change at any time with describe, and you can create a new change at any time with new. If you already know the next thing you are going to do, you can even describe it by passing -m/--message to new when creating the new change!6A demo of using jj new to create a three-parent mergeMost of the time with Git, I am doing one of two things when I go to commit a change:Committing everything that is in my working copy: git commit --all7 is an extremely common operation for me.Committing a subset of it, not by using Git’s -p to do it via that atrocious interface, but instead opening Fork and doing it with Fork’s staging UI.In the first case, Jujutsu’s choice to skip Git’s “index” looks like a very good one. In the second case, I was initially skeptical. Once I got the hang of working this way, though, I started to come around. My workflow with Fork looks an awful lot like the workflow that Jujutsu pushes you toward with actually using a diff tool. With Jujutsu, though, any diff tool can work. Want to use Vim? Go for it.What is more, Jujutsu’s approach to the working copy results in a really interesting shift. In every version control system I have worked with previously (including CVS, PVCS, SVN), the workflow has been some variation on:Make a bunch of changes.Create a commit and write a message to describe it.With both Mercurial and Git, it also became possible to rewrite history in various ways. I use Git’s rebase --interactive command extensively when working on large sets of changes. (I did the same with Mercurial’s history rewriting when I was using it a decade ago.) That expanded the list of common operations to include two more:Possibly directly amend that set of changes and/or its description.Possibly restructure history: breaking apart changes, reordering them, rewriting their message, changing what commit they land on top of, and more.Jujutsu flips all of that on its head. A change, not a commit, is the fundamental element of the mental and working model. That means that you can describe a change that is still “in progress” as it were. I discovered this while working on a little example code for a blog post I plan to publish later this month: you can describe the change you are working on and then keep working on it. The act of describing the change is distinct from the act of “committing” and thus starting a new change. This falls out naturally from the fact that the working copy state is something you can operate on directly: akin to Git’s index, but without its many pitfalls. (This simplification affects a lot of things, as I will discuss further below; but it is especially important for new learners. Getting my head around the index was one of those things I found quite challenging initially with Git a decade ago.)When you are ready to start a new change, you use either jj commit to “finalize” this commit with a message, or jj new to “Create a new, empty change and edit it in the working copy”. Implied: jj commit is just a convenience for jj describe followed by jj new. And a bonus: this means that rewording a message earlier in history does not involve some kind of rebase operation; you just jj describe --revision <target>.What is more, jj new lets you create a new commit anywhere in the history of your project, trivially:-A, --insert-afterInsert the new change between the target commit(s) and their children[aliases: after]-B, --insert-beforeInsert the new change between the target commit(s) and their parents[aliases: before]You can do this using interactive rebasing with Git (or with history rewriting with Mercurial, though I am afraid my hg is rusty enough that I do not remember the details). What you cannot do in Git specifically is say “Start a new change at point x” unless you are in the middle of a rebase operation, which makes it inherently somewhat fragile. To be extra clear: Git allows you to check out make a new change at any point in your graph, but it creates a branch at that point, and none of the descendants of that original point in your commit graph will come along without explicitly rebasing. Moreover, even once you do an explicit rebase and cherry-pick in the commit, the original commit is still hanging out, so you likely need to delete that branch. With jj new -A <some change ID>, you just insert the change directly into the history. Jujutsu will rebase every child in the history, including any merges if necessary; it “just works”. That does not guarantee you will not have conflicts, of course, but Jujutsu also handles conflicts better — way better — than Git. More on that below.I never use git reflog so much as when doing interactive rebases. Once I got the hang of Jujutsu’s ability to jj new anywhere, it basically obviates most of the places I have needed Git’s interactive rebase mode, especially when combined with Jujutsu’s aforementioned support for “first-class conflicts”. There is still an escape hatch for mistakes, though: jj op log shows all the operations you have performed on the repo — and frankly, is much more useful and powerful than git reflog, because it logs all the operations, including whenever Jujutsu updates its view of your working copy via jj status, when it fetches new revisions from a remote.Additionally, Jujutsu allows you to see how any change has evolved over time. This handily solves multiple pain points in Git. For example, if you have made changes in your working copy, and would like to split it into multiple changes, Git only has a binary state to let you tease those apart: staged, or not. As a result, that kind of operation ranges in difficulty from merely painful to outright impossible. With its obslog command,8 Jujutsu allows you to see how a change has evolved over time. Since the working copy is just one more kind of “change”, you can very easily retrieve earlier state — any time you did a jj status check, or any other command which snapshotted the state of the repository (which is most of them). That applies equally to earlier changes. If you just rebased, for example, and realize you moved some changes to code into the wrong revision, you can use the combination of obslog and new and restore (or move) to pull it back apart into the desired sequence of changes. (This one is hard to describe, so I may put up a video of it later!)SplitThis also leads to another significant difference with Git: around breaking up your current set of changes on disk. As I noted above, Jujutsu treats the working copy itself as a commit instead of having an “index” like Git. Git really only lets you break apart a set of changes with the index, using git add --patch. Jujutsu instead has a split command, which launches a diff editor and lets you select what you want to incorporate — rather like git add --patch does. As with all of its commands, though, jj split works exactly the same way on any commit; the working copy commit gets it “for free”.Philosophically, I really like this. Practically, though, it is a slightly bumpier experience for me than the Git approach at the moment. Recall that I do not use git add --patch directly. Instead, I always stage changes into the Git index using a graphical tool like Fork. That workflow is slightly nicer than editing a diff — at least, as Jujutsu does it today. In Fork (and similar tools), you start with no changes and add what you want to the change set you want. By contrast, jj split launches a diff view with all the changes from a given commit present: splitting the commit involves removing changes from the right side of the diff so that it has only the changes you want to be present in the first of two new commits; whatever is not present in the final version of the right side when you close your diff editor ends up in the second commit.If this sounds a little complicated, that is because it is — at least for today. That qualifier is important, because a lot of this is down to tooling, and we have about as much dedicated tooling for Jujutsu as Git had in 2007, which is to say: not much. Qualifier notwithstanding, and philosophical elegance notwithstanding, the complexity is still real here in early 2024. There are two big downsides as things stand. First, I find it comes with more cognitive load. It requires thinking in terms of negation rather than addition, and the “second commit” becomes less and less visible over time as you remove it from the first commit. Second, it requires you to repeat the operation when breaking up something into more than two commits. I semi-regularly take a single bucket of changes on disk and chunk it up into many more than just 2 commits, though! That significantly multiplies the cognitive overhead.Now, since I started working with Jujutsu, the team has switched the default view for working with these kinds of diffs to using scm-diff-editor, a TUI which has a first-class notion of this kind of workflow.9 That TUI works reasonably well, but is much less pleasant to use than something like the nice GUIs of Fork or Tower.The net is: when I want to break apart changes, at least for the moment I find myself quite tempted to go back to Fork and Git’s index. I do not think this problem is intractable, and I think the idea of jj split is right. It just — “just”! — needs some careful design work. Preferably, the split command would make it straightforward to generate an arbitrary number of commits from one initial commit, and it would allow progressive creation of each commit from a “vs. the previous commit” baseline. This is the upside of the index in Git: it does actually reflect the reality that there are three separate “buckets” in view when splitting apart a change: the baseline before all changes, the set of all the changes, and the set you want to include in the commit. Existing diff tools do not really handle this — other than the integrated index-aware diff tools in Git clients, which then have their own oddities when interacting with Jujutsu, since it ignores the index.First-class conflictsAnother huge feature of Jujutsu is its support for first-class conflicts. Instead of a conflict resulting in a nightmare that has to be resolved before you can move on, Jujutsu can incorporate both the merge and its resolution (whether manual or automatic) directly into commit history. Just having the conflicts in history does not seem that weird. “Okay, you committed the text conflict markers from git, neat.” But: having the conflict and its resolution in history, especially when Jujutsu figured out how to do that resolution for you, as part of a rebase operation? That is just plain wild.A while back, I was working on a change to a library I maintain10 and decided to flip the order in which I landed two changes to package.json. Unfortunately, those changes were adjacent to each other in the file and so flipping the order they would land in seemed likely to be painfully difficult. It was actually trivial. First of all, the flow itself was great: instead of launching an editor for interactive rebase, I just explicitly told Jujutsu to do the rebases: jj rebase --revision <source> --destination <target>. I did that for each of the items I wanted to reorder and I was done. (I could also have rebased a whole series of commits; I just did not need to in this case.) Literally, that was it: because Jujutsu had agreed with me that JSON is a terrible format for changes like this and committed a merge conflict, then resolved the merge conflict via the next rebase command, and simply carried on.At a mechanical level, Jujutsu will add conflict markers to a file, not unlike those Git adds in merge conflicts. However, unlike Git, those are not just markers in a file. They are part of a system which understands what conflicts are semantically, and therefore also what resolving a conflict is semantically. This not only produces nice automatic outcomes like the one I described with my library above; it also means that you have more options for how to accomplish a resolution, and for how to treat a conflict. Git trains you to see a conflict between two branches as a problem. It requires you to solve that problem before moving on. Jujutsu allows you to treat a conflict as a problem which much be resolved, but it does not require it. Resolving conflicts in merges in Git is often quite messy. It is even worse when rebasing. I have spent an incredibly amount of time attempting merges only to give up and git reset --hard <before the merge>, and possibly even more time trying to resolve a conflicting in a rebase only to bail with git rebase --abort. Jujutsu allows you to create a merge, leave the conflict in place, and then introduce a resolution in the next commit, telling the whole story with your change history.Conflict resolution with mergesLikewise with a rebase: depending on whether you require all your intermediate revisions to be able to be built or would rather show a history including conflicts, you could choose to rebase, leave all the intermediate changes conflicted, and resolve it only at the end.Conflict resolution with rebasesConflicts are inevitable when you have enough people working on a repository. Honestly: conflicts happen when I am working alone in a repository, as suggested by my anecdote above. Having this ability to keep working with the repository even in a conflicted state, as well as to resolve the conflicts in a more interactive and iterative way is something I now find difficult to live without.Changing changesThere are a few other niceties which fall out of Jujutsu’s distinction between changes and commits, especially when combined with first-class conflicts.First up, jj squash takes all the changes in a given commit and, well, squashes them into the parent of that commit.11 Given a working copy with a bunch of changes, you can move them straight into the parent by just typing jj squash. If you want to squash some change besides the one you are currently editing, you just pass the -r/--revision flag, as with most Jujutsu commands: jj squash -r abc will squash the change identified by abc into its parent. You can also use the --interactive (-i for short) argument to move just a part of a change into its parent. Using that flag will pop up your configured diff editor just like jj split will and allow you to select which items you want to move into the parent and which you want to keep separate. Or, for an even faster option, if you have specific files to move while leaving others alone, and you do not need to handle subsections of those files, you can pass them as the final arguments to the command, like jj squash ./path/a ./path/c.As it turns out, this ability to move part of one change into a different change is a really useful thing to be able to do in general. I find it particularly handy when building up a set of changes where I want each one to be coherent — say, for the sake of having a commit history which is easy for others to review. You could do that by doing some combination of jj split and jj new --after <some change ID> and then doing jj rebase to move around the changes… but as usual, Jujutsu has a better way. The squash command is actually just a shortcut for Jujutsu’s move command with some arguments filled in. The move command has --from and --to arguments which let you specify which revisions you want to move between. When you run jj squash with no other arguments, that is the equivalent of jj move --from @ --to @-. When you run jj squash -r abc, that is the equivalent of jj move --from abc --to abc-. Since it takes those arguments explicitly, though, move lets you move changes around between any changes. They do not need to be anywhere near each other in history.A demo of using jj moveThis eliminates another entire category of places I have historically had to reach for git rebase --interactive. While there are still a few times where I think Jujutsu could use something akin to Git’s interactive rebase mode, they are legitimately few, and mostly to do with wanting to be able to do batch reordering of commits. To be fair, though, I only want to do that perhaps a few times a year.BranchesBranches are another of the very significant differences between Jujutsu and Git — another place where Jujutsu acts a bit more like Mercurial, in fact. In Git, everything happens on named branches. You can operate on anonymous branches in Git, but it will yell at you constantly about being on a “detached HEAD”. Jujutsu inverts this. The normal working mode in Jujutsu is just to make a series of changes, which then naturally form “branches” in the change graph, but which do not require a name out of the gate. You can give a branch a name any time, using jj branch create. That name is just a pointer to the change you pointed it at, though; it does not automatically “follow” you as you do jj new to create new changes. (Readers familiar with Mercurial may recognize that this is very similar to its bookmarks), though without the notion of “active” and “inactive” bookmarks.)To update what a branch name points to, you use the branch set command. To completely get rid of a branch, including removing it from any remotes you have pushed the branch to, you use the branch delete command. Handily, if you want to forget all your local branch operations (though not the changes they apply to), you can use the branch forget command. That can come in useful when your local copy of a branch has diverged from what is on the remote and you don’t want to reconcile the changes and just want to get back to whatever is on the remote for that branch. No need for git reset --hard origin/<branch name>, just jj branch forget <branch name> and then the next time you pull from the remote, you will get back its view of the branch!It’s not just me who wants this!Jujutsu’s defaulting to anonymous branches took me a bit to get used to, after a decade of doing all of my work in Git and of necessity having to do my work on named branches. As with so many things about Jujutsu, though, I have very much come to appreciate this default. In particular,I find this approach makes really good sense for all the steps where I am not yet sharing a set of changes with others. Even once I am sharing the changes with others, Git’s requirement of a branch name can start to feel kind of silly at times. Especially for the case where I am making some small and self-contained change, the name of a given branch is often just some short, snake-case-ified version of the commit message. The default log template shows me the current set of branches, and their commit messages are usually sufficiently informative that I do not need anything else.However, there are some downsides to this approach in practice, at least given today’s ecosystem. First, the lack of a “current branch” makes for some extra friction when working with tools like GitHub, GitLab, Gitea, and so on. The GitHub model (which other tools have copied) treats branches as the basis for all work. GitHub displays warning messages about commits which are not on a branch, and will not allow you to create a pull request from an anonymous branch. In many ways, this is simply because Git itself treats branches as special and important. GitHub is just following Git’s example of loud warnings about being on a “detached HEAD” commit, after all.What this means in practice, though, is that there is an extra operation required any time you want to push your changes to GitHub or a similar forge. With Git, you simply git push after making your changes. (More on Git interop below.) Since Git keeps the current branch pointing at the current HEAD, Git aliases git push with no arguments to git push <configured remote for current branch> <current branch>. Jujutsu does not do this, and given how its branching model works today, cannot do this, because named branches do not “follow” your operations. Instead, you must first explicitly set the branch to the commit you want to push. In the most common case, where you are pushing your latest set of changes, that is just jj branch set <branch name>; it takes the current change automatically. Only then can you run jj git push to actually get an update. This is only a paper cut, but it is a paper cut. It is one extra command every single time you go to push a change to share with others, or even just to get it off of your machine.12 That might not seem like a lot, but it adds up.There is a real tension in the design space here, though. On the one hand, the main time I use branches in Jujutsu at this point is for pushing to a Git forge like GitHub. I rarely feel the need for them for just working on a set of changes, where jj log and jj new <some revision> give me everything I need. In that sense, it seems like having the branch “follow along” with my work would be natural: if I have gone to the trouble of creating a name for a branch and pushing it to some remote, then it is very likely I want to keep it up to date as I add changes to the branch I named. On the other hand, there is a big upside to not doing that automatically: pushing changes becomes an intentional act. I cannot count the number of times I have been working on what is essentially just an experiment in a Git repo, forgotten to change from the foo-feature to a new foo-feature-experiment branch, and then done a git push. Especially if I am collaborating with others on foo-feature, now I have to force push back to the previous to reset things, and let others know to wait for that, etc. That never happens with the Jujutsu model. Since updating a named branch is always an intentional act, you can experiment to your heart’s content, and know you will never accidentally push changes to a branch that way. I go back and forth: Maybe the little bit of extra friction when you do want to push a branch is worth it for all the times you do not have to consciously move a branch backwards to avoid pushing changes you are not yet ready to share.(As you might expect, the default of anonymous branches has some knock-on effects for how it interacts with Git tooling in general; I say more on this below.)Jujutsu also has a handy little feature for when you have done a bunch of work on an anonymous branch and are ready to push it to a Git forge. The jj git push subcommand takes an optional --change/-c flag, which creates a branch based on your current change ID. It works really well when you only have a single change you are going to push and then continually work on, or any time you are content that your current change will remain the tip of the branch. It works a little less well when you are going to add further changes later, because you need to then actually use the branch name with jj branch set push/<change ID> -r <revision>.Taking a step back, though, working with branches in Jujutsu is great overall. The branch command is a particularly good lens for seeing what a well-designed CLI is like and how it can make your work easier. Notice that the various commands there are all of the form jj branch <do something>. There are a handful of other branch subcommands not mentioned so far: list, rename, track, and untrack. Git has slowly improved its design here over the past few years, but still lacks the straightforward coherence of Jujutsu’s design. For one thing, all of these are subcommands in Jujutsu, not like Git’s mishmash of flags which can be combined in some cases but not others, and have different meanings depending on where they are deployed. For another, as with the rest of Jujutsu’s CLI structure, they use the same options to mean the same things. If you want to list all the branches which point to a given set of revisions, you use the -r/--revisions flag, exactly like you do with any other command involving revisions in Jujutsu. In general, Jujutsu has a very strong and careful distinction between commands (including subcommands) and options. Git does not. The track and untrack subcommands are a perfect example. In Jujutsu, you track a remote branch by running a command like jj branch track <branch>@<remote>. The corresponding Git command is git branch --set-upstream-to <remote>/<branch>. But to list and filter branches in Git, you also pass flags, e.g. git branch --all is the equivalent of jj branch list --all. The Git one is shorter, but also notably less coherent; there is no way to build a mental model for it. With Jujutsu, the mental model is obvious and consistent: jj <command> <options> or jj <context> <command> <options>, where <context> is something like branch or workspace or op (for operation).Git interopJujutsu’s native backend exists, and every feature has to work with it, so it will some day be a real feature of the VCS. Today, though, the Git backend is the only one you should use. So much so that if you try to run jj init without passing --git, Jujutsu won’t let you by default:> jj initError: The native backend is disallowed by default.Hint: Did you mean to pass `--git`?Set `ui.allow-init-native` to allow initializing a repo with the native backend.In practice, you are going to be using the Git backend. In practice, I have been using the Git backend for the last seven months, full time, on every one of my personal repositories and all the open source projects I have contributed to. With the sole exception of someone watching me while we pair, no one has noticed, because the Git integration is that solid and robust. This interop means that adoption can be very low friction. Any individual can simply run jj git init --git-repo . in a given Git repository, and start doing their work with Jujutsu instead of Git, and all that work gets translated directly into operations on the Git repository.Interoperating with Git also means that there is a two way-street between Jujutsu and Git. You can do a bunch of work with jj commands, and then if you hit something you don’t know how to do with Jujutsu yet, you can flip over and do it the way you already know with a git command. When you next run a jj command, like jj status, it will (very quickly!) import the updates from Git and go back about its normal business. The same thing happens when you run commands like jj git fetch to get the latest updates from a Git remote. All the explicit Git interop commands live under a git subcommand: jj git push, jj git fetch, etc. There are a handful of these, including the ability to explicitly ask to synchronize with the Git repository, but the only ones I use on a day to day basis are jj git push and jj git fetch. Notably, there is no jj git pull, because Jujutsu keeps a distinction between getting the latest changes from the server and changing your local copy’s state. I have not missed git pull at all.This clean interop does not mean that Git sees everything Jujutsu sees, though. Initializing a Jujutsu repo adds a .jj directory to your project, which is where it stores its extra metadata. This, for example, is where Jujutsu keeps track of its own representation of changes, including how any given change has evolved, in terms of the underlying revisions. In the case of a Git repository, those revisions just are the Git commits, and although you rarely need to work with or name them directly, they have the same SHAs, so any time you would name a specific Git commit, you can reference it directly as a Jujutsu revision as well. (This is particularly handy when bouncing between jj commands and Git-aware tools which know nothing of Jujutsu’s change identifiers.) The .jj directory also includes the operation log, and in the case of a fresh Jujutsu repo (not one created from an existing Git repository), is where the backing Git repo lives.This Git integration currently runs on libgit2, so there is effectively no risk of breaking your repo because of a Jujutsu – Git interop issue. To be sure, there can be bugs in Jujutsu itself, and you can do things using Jujutsu that will leave you in a bit of a mess, but the same is true of any tool which works on your Git repository. The risk might be very slightly higher here than with your average GUI Git client, since Jujutsu is mapping different semantics onto the repository, but I have extremely high confidence in the project at this point, and I think you can too.Is it ready?Unsurprisingly, given the scale of the problem domain, there are still some rough edges and gaps. For example: commit signing with GPG or SSH does not yet work. There is an open PR for the basics of the feature with GPG support, and SSH support will be straightforward to add once the basics, but landed it has not.13 The list of actual gaps or missing features is getting short, though. When I started using Jujutsu back in July 2023, there was not yet any support for sparse checkouts or for workspaces (analogous to Git worktrees). Both of those landed in the interval, and there is consistent forward motion from both Google and non-Google contributors. In fact, the biggest gap I see as a regular user in Jujutsu itself is the lack of the kinds of capabilities that will hopefully come once work starts in earnest on the native backend.The real gaps and rough edges at this point are down to the lack of an ecosystem of tools around Jujutsu, and the ways that existing Git tools interact with Jujutsu’s design for Git interop. The lack of tooling is obvious: no one has built the equivalent of Fork or Tower, and there is no native integration in IDEs like IntelliJ or Visual Studio or in editors like VS Code or Vim. Since Jujutsu currently works primarily in terms of Git, you will get some useful feedback. All of those tools expect to be working in terms of Git’s index and not in terms of a Jujutsu-style working copy, though. Moreover, most of them (unsurprisingly!) share Git’s own confusion about why you are working on a detached HEAD nearly all the time. On the upside, viewing the history of a repo generally works well, with the exception that some tools will not show anonymous branches/detached HEADs other than one you have actively checked out. Detached heads also tend to confuse tools like GitHub’s gh; you will often need to do a bit of extra manual argument-passing to get them to work. (gh pr create --web --head <name> is has been showing up in my history a lot for exactly this reason.)Some of Jujutsu’s very nice features also make other parts of working on mainstream Git forges a bit wonky. For example, notice what each of these operations has in common:Inserting changes at arbitrary points.Rewording a change description.Rebasing a series of changes.Splitting apart commits.Combining existing commits.They are all changes to history. If you have pushed a branch to a remote, doing any of these operations with changes on that branch and pushing to a remote again will be a force push. Most mainstream Git forges handle force pushing pretty badly. In particular, GitHub has some support for showing diffs between force pushes, but it is very basic and loses all conversational context. As a result, any workflow which makes heavy use of force pushes will be bumpy. Jujutsu is not to blame for the gaps in those tools, but it certainly does expose them.14 Nor do I not blame GitHub for the quirks in interop, though. It is not JujutsuLab after all, and Jujutsu is doing things which do not perfectly map onto the Git model. Since most open source software development happens on forges like GitHub and GitLab, though, these things do regularly come up and cause some friction.The biggest place I feel this today is in the lack of tools designed to work with Jujutsu around splitting, moving, and otherwise interactively editing changes. Other than @arxanas’ excellent scm-diff-editor, the TUI, which Jujutsu bundles for editing diffs on the command line, there are zero good tools for those operations. I mean it when I say scm-diff-editor is excellent, but I also do not love working in a TUI for this kind of thing, so I have cajoled both Kaleidoscope and BBEdit into working to some degree. As I noted when describing how jj split works, though, it is not a particularly good experience. These tools are simply not designed for this workflow. They understand an index, and they do not understand splitting apart changes. Net, we are going to want new tooling which actually understands Jujutsu.There are opportunities here beyond implementing the same kinds of capabilities that many editors, IDEs, and dedicated VCS viewers provide today for Git. Given a tool which makes rebasing, merging, re-describing changes, etc. are all normal and easy operations, GUI tools could make all of those much easier. Any number of the Git GUIs have tried, but Git’s underlying model simply makes it clunky. That does not have to be the case with Jujutsu. Likewise, surfacing things like Jujutsu’s operation and change evolution logs should be much easier than surfacing the Git reflog, and provide easier ways to recover lost work or simply to change one’s mind.ConclusionJujutsu has become my version control tool of choice since I picked it up over the summer. The rough edges and gaps I described throughout this write-up notwithstanding, I much prefer it to working with Git directly. I do not hesitate to recommend that you try it out on personal or open source projects. Indeed, I actively recommend it! I have used Jujutsu almost exclusively for the past seven months, and I am not sure what would make me go back to using Git other than Jujutsu being abandoned entirely. Given its apparently-bright future at Google, that seems unlikely.15 Moreover, because using it in existing Git repositories is transparent, there is no inherent reason individual developers or teams cannot use it today. (Your corporate security policy might have be a different story.)Is Jujutsu ready for you to roll out at your Fortune 500 company? Probably not. While it is improving at a steady clip — most of the rough edges I hit in mid-2023 are long since fixed — it is still undergoing breaking changes in design here and there, and there is effectively no material out there about how to use it yet. (This essay exists, in part, as an attempt to change that!) Beyond Jujutsu itself, there is a lot of work to be done to build an ecosystem around it. Most of the remaining rough edges are squarely to do with the lack of understanding from other tools. The project is marching steadily toward a 1.0 release… someday. As for when that might be, there are as far as I know no plans: there is still too much to do. Above all, I am very eager to see what a native Jujutsu backend would look like. Today, it is “just” a much better model for working with Git repos. A world where the same level of smarts being applied to the front end goes into the backend too is a world well worth looking forward to.Thoughts, comments, or questions? Discuss:Hacker Newslobste.rsMastodonThreadsBlueskyTwitter/XAppendix: Kaleidoscope setup and tipsAs alluded to above, I have done my best to make it possible to use Kaleidoscope, my beloved diff-and-merge tool, with Jujutsu. I have had only mixed success. The appropriate setup that gives the best results so far:Add the following to your Jujutsu config (jj config edit --user) to configure Kaleidoscope for the various diff and merge operations:[ui]diff-editor = ["ksdiff", "--wait", "$left", "--no-snapshot", "$right", "--no-snapshot"]merge-editor = ["ksdiff", "--merge", "--output", "$output", "--base", "$base", "--", "$left", "--snapshot", "$right", "--snapshot"]I will note, however, that I have still not been 100% successful using Kaleidoscope this way. In particular, jj split does not give me the desired results; it often ends up reporting “Nothing changed” when I close Kaleidoscope.When opening a file diff, you must Option⎇-double-click, not do a normal double-click, so that it will preserve the --no-snapshot behavior. That --no-snapshot argument to ksdiff is what makes the resulting diff editable, which is what Jujutsu needs for its just-edit-a-diff workflow. I have been in touch with the Kaleidoscope folks about this, which is how I even know about this workaround; they are evaluating whether it is possible to make the normal double-click flow preserve the --no-snapshot in this case so you do not have to do the workaround.NotesYes, it is written in Rust, and it is pretty darn fast. But Git is written in C, and is also pretty darn fast. There are of course some safety upsides to using Rust here, but Rust is not particularly core to Jujutsu’s “branding”. It was just a fairly obvious choice for a project like this at this point — which is exactly what I have long hoped Rust would become! ↩︎Pro tip for Mac users: add .DS_Store to your ~/.gitignore_global and live a much less annoyed life — whether using Git or Jujutsu. ↩︎I did have one odd hiccup along the way due to a bug (already fixed, though not in a released version) in how Jujutsu handles a failure when initializing in a directory. While confusing, the problem was fixed in the next release… and this is what I expected of still-relatively-early software. ↩︎The plain jj init command is reserved for initializing with the native backend… which is currently turned off. This is absolutely the right call for now, until the native backend is ready, but it is a mild bit of extra friction (and makes the title of this essay a bit amusing until the native backend comes online…). ↩︎This is not quite the same as Git’s HEAD or as Mercurial’s “tip” — there is only one of either of those, and they are not the same as each other! ↩︎If you look at the jj help output today, you will notice that Jujutsu has checkout, merge, and commit commands. Each is just an alias for a behavior using new, describe, or both, though:checkout is just an alias for newcommit is just a shortcut for jj describe -m "<some message>" && jj newmerge is just jj new with an implicit @ as the first argument.All of these are going to go away in the medium term with both documentation and output from the CLI that teach people to use new instead. ↩︎Actually it is normally git ci -am "<message>" with -a for “all” (--all) and -m for the message, and smashed together to avoid any needless extra typing. ↩︎The name is from Mercurial’s evolution feature, where it refers to changes which have become obsolescent, thus obslog is the “obsolescent changes log”. I recently suggested to the Jujutsu maintainers that renaming this might be helpful, because it took me six months of daily use to discover this incredibly helpful tool. ↩︎They also enabled support for a three-pane view in Meld, which allegedly makes it somewhat better. However, Meld is pretty janky on macOS (as GTK apps basically always are), and it has a terrible startup time for reasons that are unclear at this point, which means this was not a great experience in the first place… and Meld crashes on launch on the current version of macOS. ↩︎Yes, this is what I do for fun on my time off. At least: partially. ↩︎For people coming from Git, there is also an amend alias, so you can use jj amend instead, but it does the same thing as squash and in fact the help text for jj amend makes it clear that it just is squash. ↩︎If that sounds like paranoia, well, you only have to lose everything on your machine once due to someone spilling a whole cup of water on it at a coffee shop to learn to be a bit paranoid about having off-machine backups of everything. I git push all the time. ↩︎I care about about this feature and have some hopes of helping get it across the line myself here in February 2024, but we will see! ↩︎There are plenty of interesting arguments out there about the GitHub collaboration design, alternatives represented by the Phabricator or Gerrit review models, and so on. This piece is long enough without them! ↩︎Google is famous for killing products, but less so developer tools. ↩︎Thanks:Waleed Khan (@arxanas), Joy Reynolds (@joyously), and Isabella Basso (@isinyaaa) all took time to read and comment on earlier drafts of this mammoth essay, and it is substantially better for their feedback!Posted:This entry was originally published in Essays on February 2, 2024, and last updated on February 8, 2024 (you can see the full revision history here); it was started on July 1, 2023.Meaningful changes since creating this page:February 8, 2024: Updated to use jj git init instead of plain jj init, to match the 0.14 release.February 4, 2024: Filled out the section on Git interop. (How did I miss that before publishing?!?)February 3, 2024: Added an example of the log format right up front in that section.February 2, 2024: Reworked section on revsets and prepared for publication!February 1, 2024: Finished a draft! Added one and updated another asciinema for some of the basics, finished up the tooling section, and made a bunch of small edits.January 31, 2024: Finished describing first-class conflicts and added asciinema recordings showing conflicts with merges and rebases.January 30, 2024: Added a ton of material on branches and on CLI design.January 29, 2024: Wrote up a section on “changing changes”, focused on the squash and move commands.January 29, 2024: Added a section on obslog, and made sure the text was consistent on use of “Jujutsu” vs. jj for the name of the tool vs. command-line invocations.January 18, 2024: Made some further structural revisions, removing some now-defunct copy about the original plan, expanded on the conclusion, and substantially expanded the conclusion.January 16, 2024: jj init is an essay, and I am rewriting it—not a dev journal, but an essay introduction to the tool.November 2, 2023: Added a first pass at a conclusion, and started on the restructuring this needs.November 1, 2023: Describing a bit about how jj new -A works and integrates with its story for clean rebases.October 31, 2023: Filling in what makes jj interesting, and explaining templates a bit.August 7, 2023: A first pass at jj describe and jj new.August 7, 2023: YODA! And an introduction to the “Rewiring your Git brain” section.August 7, 2023: Adding more structure to the piece, and identifying the next pieces to write.July 31, 2023: Starting to do some work on the introduction.July 24, 2023: Correcting my description of revision behavior per discussion with the maintainer.July 24, 2023: Describing my current feelings about the jj split and auto-committed working copy vs. git add --patch (as mediated by a UI).July 13, 2023: Elaborated on the development of version control systems (both personally and in general!)… and added a bunch of <abbr> tags.July 12, 2023: Added a section on the experience of having first-class merging Just Work™, added an appendix about Kaleidoscope setup and usage, rewrote the paragraph where I previously mentioned the issues about Kaleidoscope, and iterated on the commit-vs.-change distinction.July 9, 2023: Rewrote the jj log section to incorporate info about revsets, rewrote a couple of the existing sections now that I have significantly more experience, and added a bunch of notes to myself about what to tackle next in this write-up.July 3, 2023: Wrote up some experience notes on actually using jj describe and jj new: this is pretty wild, and I think I like it?July 3, 2023: Reorganized and clarified the existing material a bit.July 2, 2023: Added some initial notes about initial setup bumps. And a lot of notes on the things I learned in trying to fix those!Spotted a typo? Submit a correction!Topics:software development tools version control Jujutsu GitRespond:Thoughts, comments, or questions? Shoot me an email (it’s way better than traditional comments), or leave a comment on Hacker News or lobste.rs.About:I’m Chris Krycho—a follower of Christ, a husband, and a dad. I’m a software engineer by trade; a theologian by vocation; and a writer, runner and cyclist, composer, and erstwhile podcaster by hobby.Support:If you especially like what I’m doing here, you can buy me a book, or click the affiliate links in book reviews!
# Document TitleSteno & PLAboutPatch terminologyJan 11, 2024Intended audienceDevelopers of version control systems, specifically jj.Those interested in the version control pedagogy.OriginReprint of research originally published on Google Docs.Surveyed five participants from various "big tech" companies (>$1B valuation).Mood Investigative.MethodologyResultsRemarksConclusionsRelated postsCommentsMethodologyQ: I am doing research for a source control project, can you answer what the following nouns mean to you in the context of source control, if anything? (Ordered alphabetically)a “change”a “commit”a “patch”a “revision”ResultsP1 (Google, uses Piper + CLs):Change: a difference in code. What isn’t committed yet.Commit: a cl. Code that is ready to push and has a description along with itPatch: a commit number that that someone made that may or may not be pushed yet […] A change that’s not yoursRevision: a change to a commit?P2 (big tech, uses GitLab + MRs):Change: added/removed/updated filesCommit: a group of related changes with a descriptionPatch: a textual representation of changes between two versionsRevision: a version of the repository, like the state of all files after a set of commitsP3 (Google, uses Fig + CLs):Change: A change to me is any difference in code. Uncommitted to pushed. I’ve heard people say I’ve pushed the change.Commit: A commit is a saved code diff with a description.Patch: A patch is a diff between any two commits how to turn commit a into into b.Revision: Revisions idk. I think at work they are snapshots of a code base so all changes at a point in time.P4 (Microsoft, uses GitHub + PRs):Change: the entire change I want to check into the codebase this can be multiple commits but it’s what I’m putting up for reviewCommit: a portion of my changePatch: a group of commits or a change I want to use on another repo/branchRevision: An id for a group of commits or a single commitP5 (big tech, uses GitHub + PRs):Change: your update to source files in a repositoryCommit: description of changePatch: I don’t really use this but I would think a quick fix (image, imports, other small changes etc)Revision: some number or set of numbers corresponding to changeRemarksTake-aways:Change: People largely don’t think of a “change” as an physical object, rather just a diff or abstract object.It can potentially range from uncommitted to committed to pushed (P1–P5).Unlike others, P4 thinks of it as a larger unit than a commit (more like a “review”), probably due to the GitHub PR workflow.Commit: Universally, commits are considered to have messages. However, the interpretation of a commit as a snapshot vs diff appears to be implicit (compare P2’s “commit” vs “revision”).Patch: Split between interpretations:Either it represents a diff between two versions of the code (P2, P3).Or it’s a higher-level interpretation of a patch as a transmissible change. Particularly for getting a change from someone else (P1), but can also refer to a change that you want to use on a different branch (P4).P5 merely considers a “patch” to be a “small fix”, which is also a generally accepted meaning, although a little imprecise in terms of source control (refers to the intention of the patch, rather than the mechanics of the patch itself).Revision: This is really interesting. The underlying mental models are very different, but the semantic implications end up aligning, more so than for the term “commit”!P1: Not a specific source control term, just “the effect of revising”.P2, P3: Effect of “applying all commits”. This implies that they consider “commits” as diffs and “revisions” as snapshots.P4, P5, Some notions that it’s specifically the identifier of a change/commit. It’s something that you can reference or send to others.Surprisingly to me, P2–P5 actually all essentially agree that “revision” means a snapshot of the codebase. The mental models are quite different (“accumulation of diffs” vs “stable identifier”) but they refer to the same ultimate result: a specific state of the codebase (…or a way to refer to it — what’s in a name?). This is essentially the opposite of “commit”, where everyone thinks that they agree on what they are, but they’re actually split — roughly evenly? — into snapshot vs diff mental models.ConclusionsConclusions for jj:We already knew that “change” is a difficult term, syntactically speaking. It’s also now apparent that it’s semantically unclear. Only P4 thought of it as a “reviewable unit”, which would probably most closely match the jj interpretation. We should switch away from this term.People are largely settled on what “commits” are in the ways that we thought.There are two main mental models, where participants appear to implicitly consider them to be either snapshots or diffs, as we know.They have to have messages according to participants (unlike in jj, where a commit/change may not yet have a message).It’s possible this is an artifact of the Git mental model, rather than fundamental. We don’t see a lot of confusion when we tell people “your commits can have empty messages”.I think the real implication is that the set of changes is packaged/finalized into one unit, as opposed to “changes”, which might be in flux or not properly packaged into one unit for publishing/sharing.Half of respondents think that “patch” primarily refers to a diff, while half think that it refers to a transmissible change.In my opinion, the “transmissible change” interpretation aligns most closely with jj changes at present. In particular, you put those up for review and people can download them.I also think the “diff” interpretation aligns with jj interpretation (as you can rebase patches around, and the semantic content of the patch doesn’t change); however, there is a great deal of discussion on Discord suggesting that people think of “patches” as immutable, and this doesn’t match the jj semantics where you can rebase them around (IIUC).Overall, I think “patch” is still the best term we have as a replacement for jj “changes” (unless somebody can propose a better one), and it’s clear that we should move away from “change” as a term.“Revision” is much more semantically clear than I thought it was. This means that we can adopt/coopt the existing term and ascribe the specific “snapshot” meaning that we do today.We already do use “revision” in many places, most notably “revsets”. For consistency, we likely want to standardize “revision” instead of “commit” as a term.Related postsThe following are hand-curated posts which you might find interesting.Date Title19 Jun 2021 git undo: We can do better12 Oct 2021 Lightning-fast rebases with git-move19 Oct 2022 Build-aware sparse checkouts16 Nov 2022 Bringing revsets to Git05 Jan 2023 Where are my Git UI features from the future?11 Jan 2024 (this post) Patch terminologyWant to see more of my posts? Follow me on Twitter or subscribe via RSS.CommentsSteno & PLsubscribe via RSSWaleed Khanme@waleedkhan.namearxanasarxanasThis is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.
# Document TitleMounting git commits as folders with NFS• git •Hello! The other day, I started wondering – has anyone ever made a FUSE filesystem for a git repository where all every commit is a folder? It turns out the answer is yes! There’s giblefs, GitMounter, and git9 for Plan 9.But FUSE is pretty annoying to use on Mac – you need to install a kernel extension, and Mac OS seems to be making it harder and harder to install kernel extensions for security reasons. Also I had a few ideas for how to organize the filesystem differently than those projects.So I thought it would be fun to experiment with ways to mount filesystems on Mac OS other than FUSE, so I built a project that does that called git-commit-folders. It works (at least on my computer) with both FUSE and NFS, and there’s a broken WebDav implementation too.It’s pretty experimental (I’m not sure if this is actually a useful piece of software to have or just a fun toy to think about how git works) but it was fun to write and I’ve enjoyed using it myself on small repositories so here are some of the problems I ran into while writing it.goal: show how commits are like foldersThe main reason I wanted to make this was to give folks some intuition for how git works under the hood. After all, git commits really are very similar to folders – every Git commit contains a directory listing of the files in it, and that directory can have subdirectories, etc.It’s just that git commits aren’t actually implemented as folders to save disk space.So in git-commit-folders, every commit is actually a folder, and if you want to explore your old commits, you can do it just by exploring the filesystem! For example, if I look at the initial commit for my blog, it looks like this:$ ls commits/8d/8dc0/8dc0cb0b4b0de3c6f40674198cb2bd44aeee9b86/READMEand a few commits later, it looks like this:$ ls /tmp/git-homepage/commits/c9/c94e/c94e6f531d02e658d96a3b6255bbf424367765e9/_config.yml config.rb Rakefile rubypants.rb sourcebranches are symlinksIn the filesystem mounted by git-commit-folders, commits are the only real folders – everything else (branches, tags, etc) is a symlink to a commit. This mirrors how git works under the hood.$ ls -l branches/lr-xr-xr-x 59 bork bazil-fuse -> ../commits/ff/ff56/ff563b089f9d952cd21ac4d68d8f13c94183dcd8lr-xr-xr-x 59 bork follow-symlink -> ../commits/7f/7f73/7f73779a8ff79a2a1e21553c6c9cd5d195f33030lr-xr-xr-x 59 bork go-mod-branch -> ../commits/91/912d/912da3150d9cfa74523b42fae028bbb320b6804flr-xr-xr-x 59 bork mac-version -> ../commits/30/3008/30082dcd702b59435f71969cf453828f60753e67lr-xr-xr-x 59 bork mac-version-debugging -> ../commits/18/18c0/18c0db074ec9b70cb7a28ad9d3f9850082129ce0lr-xr-xr-x 59 bork main -> ../commits/04/043e/043e90debbeb0fc6b4e28cf8776e874aa5b6e673$ ls -l tags/lr-xr-xr-x - bork 31 Dec 1969 test-tag -> ../commits/16/16a3/16a3d776dc163aa8286fb89fde51183ed90c71d0This definitely doesn’t completely explain how git works (there’s a lot more to it than just “a commit is like a folder!”), but my hope is that it makes thie idea that every commit is like a folder with an old version of your code” feel a little more concrete.why might this be useful?Before I get into the implementation, I want to talk about why having a filesystem with a folder for every git commit in it might be useful. A lot of my projects I end up never really using at all (like dnspeep) but I did find myself using this project a little bit while I was working on it.The main uses I’ve found so far are:searching for a function I deleted – I can run grep someFunction branch_histories/main/*/commit.go to find an old version of itquickly looking at a file on another branch to copy a line from it, like vim branches/other-branch/go.modsearching every branch for a function, like grep someFunction branches/*/commit.goAll of these are through symlinks to commits instead of referencing commits directly.None of these are the most efficient way to do this (you can use git show and git log -S or maybe git grep to accomplish something similar), but personally I always forget the syntax and navigating a filesystem feels easier to me. git worktree also lets you have multiple branches checked out at the same time, but to me it feels weird to set up an entire worktree just to look at 1 file.Next I want to talk about some problems I ran into.problem 1: webdav or NFS?The two filesystems I could that were natively supported by Mac OS were WebDav and NFS. I couldn’t tell which would be easier to implement so I just tried both.At first webdav seemed easier and it turns out that golang.org/x/net has a webdav implementation, which was pretty easy to set up.But that implementation doesn’t support symlinks, I think because it uses the io/fs interface and io/fs doesn’t support symlinks yet. Looks like that’s in progress though. So I gave up on webdav and decided to focus on the NFS implementation, using this go-nfs NFSv3 library.Someone also mentioned that there’s FileProvider on Mac but I didn’t look into that.problem 2: how to keep all the implementations in sync?I was implementing 3 different filesystems (FUSE, NFS, and WebDav), and it wasn’t clear to me how to avoid a lot of duplicated code.My friend Dave suggested writing one core implementation and then writing adapters (like fuse2nfs and fuse2dav) to translate it into the NFS and WebDav verions. What this looked like in practice is that I needed to implement 3 filesystem interfaces:fs.FS for FUSEbilly.Filesystem for NFSwebdav.Filesystem for webdavSo I put all the core logic in the fs.FS interface, and then wrote two functions:func Fuse2Dav(fs fs.FS) webdav.FileSystemfunc Fuse2NFS(fs fs.FS) billy.FilesystemAll of the filesystems were kind of similar so the translation wasn’t too hard, there were just 1 million annoying bugs to fix.problem 3: I didn’t want to list every commitSome git repositories have thousands or millions of commits. My first idea for how to address this was to make commits/ appear empty, so that it works like this:$ ls commits/$ ls commits/80210c25a86f75440110e4bc280e388b2c098fbd/fuse fuse2nfs go.mod go.sum main.go README.mdSo every commit would be available if you reference it directly, but you can’t list them. This is a weird thing for a filesystem to do but it actually works fine in FUSE. I couldn’t get it to work in NFS though. I assume what’s going on here is that if you tell NFS that a directory is empty, it’ll interpret that the directory is actually empty, which is fair.I ended up handling this by:organizing the commits by their 2-character prefix the way .git/objects does (so that ls commits shows 0b 03 05 06 07 09 1b 1e 3e 4a), but doing 2 levels of this so that a 18d46e76d7c2eedd8577fae67e3f1d4db25018b0 is at commits/18/18df/18d46e76d7c2eedd8577fae67e3f1d4db25018b0listing all the packed commits hashes only once at the beginning, caching them in memory, and then only updating the loose objects afterwards. The idea is that almost all of the commits in the repo should be packed and git doesn’t repack its commits very often.This seems to work okay on the Linux kernel which has ~1 million commits. It takes maybe a minute to do the initial load on my machine and then after that it just needs to do fast incremental updates.Each commit hash is only 20 bytes so caching 1 million commit hashes isn’t a big deal, it’s just 20MB.I think a smarter way to do this would be to load the commit listings lazily – Git sorts its packfiles by commit ID, so you can pretty easily do a binary search to find all commits starting with 1b or 1b8c. The git library I was using doesn’t have great support for this though, because listing all commits in a Git repository is a really weird thing to do. I spent maybe a couple of days trying to implement it but I didn’t manage to get the performance I wanted so I gave up.problem 4: “not a directory”I kept getting this error:"/tmp/mnt2/commits/59/59167d7d09fd7a1d64aa1d5be73bc484f6621894/": Not a directory (os error 20)This really threw me off at first but it turns out that this just means that there was an error while listing the directory, and the way the NFS library handles that error is with “Not a directory”. This happened a bunch of times and I just needed to track the bug down every time.There were a lot of weird errors like this. I also got cd: system call interrupted which was pretty upsetting but ultimately was just some other bug in my program.Eventually I realized that I could use Wireshark to look at all the NFS packets being sent back and forth, which made some of this stuff easier to debug.problem 5: inode numbersAt first I was accidentally setting all my directory inode numbers to 0. This was bad because if if you run find on a directory where the inode number of every directory is 0, it’ll complain about filesystem loops and give up, which is very fair.I fixed this by defining an inode(string) function which hashed a string to get the inode number, and using the tree ID / blob ID as the string to hash.problem 6: stale file handlesI kept getting this “Stale NFS file handle” error. The problem is that I need to be able to take an opaque 64-byte NFS “file handle” and map it to the right directory.The way the NFS library I’m using works is that it generates a file handle for every file and caches those references with a fixed size cache. This works fine for small repositories, but if there are too many files then it’ll overflow the cache and you’ll start getting stale file handle errors.This is still a problem and I’m not sure how to fix it. I don’t understand how real NFS servers do this, maybe they just have a really big cache?The NFS file handle is 64 bytes (64 bytes! not bits!) which is pretty big, so it does seem like you could just encode the entire file path in the handle a lot of the time and not cache it at all. Maybe I’ll try to implement that at some point.problem 7: branch historiesThe branch_histories/ directory only lists the latest 100 commits for each branch right now. Not sure what the right move is there – it would be nice to be able to list the full history of the branch somehow. Maybe I could use a similar subfolder trick to the commits/ directory.problem 8: submodulesGit repositories sometimes have submodules. I don’t understand anything about submodules so right now I’m just ignoring them. So that’s a bug.problem 9: is NFSv4 better?I built this with NFSv3 because the only Go library I could find at the time was an NFSv3 library. After I was done I discovered that the buildbarn project has an NFSv4 server in it. Would it be better to use that?I don’t know if this is actually a problem or how big of an advantage it would be to use NFSv4. I’m also a little unsure about using the buildbarn NFS library because it’s not clear if they expect other people to use it or not.that’s all!There are probably more problems I forgot but that’s all I can think of for now. I may or may not fix the NFS stale file handle problem or the “it takes 1 minute to start up on the linux kernel” problem, who knows!Thanks to my friend vasi who explained one million things about filesystems to me.
# Document TitleDependencies Belong in Version ControlNovember 25th, 2023I believe that all project dependencies belong in version control. Source code, binary assets, third-party libraries, and even compiler toolchains. Everything.The process of building any project should be trivial. Clone repo, invoke build command, and that's it. It shouldn't require a complex configure script, downloading Strawberry Perl, installing Conda, or any of that bullshit.Infact I'll go one step further. A user should be able to perform a clean OS install, download a zip of master, disconnect from the internet, and build. The build process shouldn't require installing any extra tools or content. If it's something the build needs then it belongs in version control.First InstinctYour gut reaction may be revulsion. That's it not possible. Or that it's unreasonable.You're not totally wrong. If you're using Git for version control then committing ten gigabytes of cross-platform compiler toolchains is infeasible.That doesn't change my claim. Dependencies do belong in version control. Even if it's not practical today due to Git's limitations. More on that later.WhyWhy do dependencies belong in version control? I'll give a few reasons.UsabilityReliabilityReproducibilitySustainabilityUsabilityCommitting dependencies makes projects trivial to build and run. I have regularly failed to build open source projects and given up in a fit of frustrated rage.My background is C++ gamedev. C++ infamously doesn't have a standard build system. Which means every project has it's own bullshit build system, project generator, dependency manager, scripting runtimes, etc.ML and GenAI projects are a god damned nightmare to build. They're so terrible to build that there are countless meta-projects that exists solely to provide one-click installers (example: EasyDiffusion). These installers are fragile and sometimes need to be run several times to succeed.Commit your dependencies and everything "just works". My extreme frustration with trying, and failing, to build open source projects is what inspired this post.ReliabilityHave you ever had a build fail because of a network error on some third-party server? Commit your dependencies and that will never happen.There's a whole class of problems that simply disappear when depdendencies are committed. Builds won't break because of an OS update. Network errors don't exist. You eliminate "works on my machine" issues because someone didn't have the right version of CUDA installed.ReproducibilityBuilds are much easier to reproduce when version control contains everything. Great build systems are hermetic and allow for determistic builds. This is only possible when your build doesn't depend on your system environment.Lockfiles are only a partial solution to reproducibility. Docker images are a poor man's VCS.SustainabilityCommitting dependencies makes it trivial to recreate old builds. God help you if you try to build a webdev stack from 2013.In video games it's not uncommon to release old games on new platforms. These games can easily be 10 or 20 years old. How many modern projects will be easy to build in 20 years? Hell, how many will be easy to build in 5?Commit your dependencies and ancient code bases will be as easy to rebuild as possible. Although new platforms will require new code, of course.Proof of LifeTo prove that this isn't completely crazy I built a proof of life C++ demo. My program is exceedingly simple:#include <fmt/core.h>int main() {fmt::print("Hello world from C++ 👋\n");fmt::print("goodbye cruel world from C++ ☠️\n");return 0;}The folder structure looks like this:\root\sample_cpp_app- main.cpp\thirdparty\fmt (3 MB)\toolchains\win\cmake (106 MB)\LLVM (2.5 GB)\mingw64 (577 MB)\ninja (570 KB)\Python311 (20.5 MB)- CMakeLists.txt- build.bat- build.pyThe toolchains folder contains five dependencies - CMake, LLVM, Ming64, Ninja, and Python 3.11. Their combined size is 3.19 gigabytes. No effort was made to trim these folders down in size.The build.bat file nukes all environment variables and sets PATH=C:\Windows\System32;. This ensures only the included toolchains are used to compile.The end result is a C++ project that "just works".But Wait There's MoreHere's where it gets fun. I wrote a Python that script that scans the directory for "last file accessed time" to track "touched files". This let's me check how many toolchain files are actually needed by the build. It produces this output:Checking initial file access times... 🥸👨🔬🔬Building... 👷♂️💪🛠️Compile success! 😁Checking new file access times... 🥸👨🔬🔬File Access StatsTouched 508 files. Total Size: 272.00 MBUntouched 23138 files. Total Size: 2.93 GBTouched 2.1% of filesTouched 8.3% of bytesRunning program...Target exe: c:\temp\code\toolchain_vcs\bin\main.exeHello world from C++ 👋goodbye cruel world from C++ ☠️Built and ran successfully! 😍Well will you look at that!Despite committing 3 gigabytes of toolchains we only actually needed a mere 272 megabytes. Well under 10%! Even better we touched just 2.0% of repo files.The largest files touched were:clang++.exe [116.04 MB]ld.lld.exe [86.05 MB]llvm-ar.exe [28.97 MB]cmake.exe [11.26 MB]libgcc.a [5.79 MB]libstdc++.dll.a [5.32 MB]libmsvcrt.a [2.00 MB]libstdc++-6.dll [1.93 MB]libkernel32.a [1.27 MB]My key takeaway is this: toolchain file sizes are tractable for version control if you can trim the fat.This sparks my joy. Imagine cloning a repo, clicking build, and having it just work. What a wonderful and delightful world that would be!A Vision for the FutureI'd like to paint a small dream for what I will call Next Gen Version Control Software (NGVCS). This is my vision for a Git/Perforce successor. Here are some of the key featurs I want NGVCS to have:virtual file system to fetch only files a user touchescopy-on-write file storagesystem cache for NGVCS filesLet's pretend for a moment that every open source project commits their dependencies. Each one contains a full copy of Python, Cuda, Clang, MSVC, libraries, etc. What would happen?First, the user clones a random GenAI repo. This is near instantaneous as files are not prefetched. The user then invokes the build script. As files are accessed they're downloaded. The very first build may download a few hundred megabytes of data. Notably it does NOT download the entire repo. If the user is on Linux it won't download any binaries for macOS or Windows.Second, the user clones another GenAI repo and builds. Does this need to re-download gigabytes of duplicated toolchain content? No! Both projects use NGVCS which has a system wide file cache. Since we're also using a copy-on-write file system these files instantly materialize in the second repo at zero cost.The end result is beautiful. Every project is trivial to fetch, build, and run. And users only have to download the minimum set of files to do so.The Real World and Counter ArgumentsHopefully I've convinced some of you that committing dependencies is at least a good idea in an ideal world.Now let's consider the real world and a few counter arguments.The Elephant in the Room - GitUnfortunately I must admit that committing dependencies is not be practical today. The problem is Git. One of my unpopular opinions is that Git isn't very good. Among its many sins is terrible support for large files and large repositories.The root issue is that Git's architecture and default behavior expects all users to have a full copy of the entire repo history. Which means every version of every binary toolchain for every platform. Yikes!There are various work arounds - Git LFS, Git Submodules, shallow clones, partial clones, etc. The problem is these aren't first-class features. They are, imho, second-class hacks. 😓In theory Git could be updated to more properly support large projects. I believe Git should be shallow and partial by default. Almost all software projects are defacto centralized. Needing full history isn't the default, it's an edge case. Users should opt-in to full history only if they need it.ContainersAn alternative to committing dependencies is to use containers. If you build out of a container you get most, if not all, of the benefits. You can even maintain an archive of docker images that reliably re-build tagged releases.Congrats, you're now using Docker as your VCS!My snarky opinion is that Docker and friends primarily exist because modern build systems are so god damned fragile that the only way to reliably build and deploy is to create a full OS image. This is insanity!Containers shouldn't be required simply to build and run projects. It's embarassing that's the world we live in.LicensingNot all dependencies are authorized for redistribution. I believe MSVC and XCode both disallow redistribution of compiler toolchains? Game consoles like Sony PlayStation and Nintendo Switch don't publicly release headers, libs, or compilers.This is mostly ok. If you're working on a console project then you're already working on a closed source project. Developers already use permission controls to gate access.The lack of redistribution rights for "normal" toolchains is annoying. However permissive options are available. If committing dependencies becomes common practice then I think it's likely that toolchain licenses will update to accomdate.Updating DependenciesCommitting library dependencies to version control means they need to be updated. If you have lots of repos to update this could be a moderate pain in the ass.This is also the opposite of how Linux works. In Linux land you use a hot mess of system libraries sprinkled chaotically across the search path. That way when there is a security fix you update a single .so (or three) and your system is safe.I think this is largely a non-issue. Are you building and running your services out of Docker? Do you have a fleet of machines? Do you have lockfiles? Do you compile any thirdparty libraries from source? If the answer to any of these questions is yes, and it is, then you already have a non-trivial procedure to apply security fixes.Committing dependencies to VCS doesn't make security updates much harder. In fact, having a monorepo source of truth can make things easier!DVCSOne of Git's claims to fame is its distributed nature. At long last developers can commit work from an internetless cafe or airplane!My NGVCS dream implies defacto centralization. Especially for large projects with large histories. Does that mean an internet connection is required? Absolutely not! Even Perforce, the King of centralized VCS, supports offline mode. Git continues to function locally even when working with shallow and partial Git clones.Offline mode and decentralization are independent concepts. I don't know why so many people get this wrong.LibrariesDo I really think that every library, such as fmt, should commit gigabytes of compilers to version control?That's a good question. For languages like Rust which have a universal build system probably not. For languages like C++ and Python maybe yes! It'd be a hell of a lot easier to contribute to open source projects if step 0 wasn't "spend 8 hours configuring environment to build".For libraries the answer may be "it depends". For executables I think the answer is "yes, commit everything".Dreams vs RealityNGVCS is obviously a dream. It doesn't exist today. Actually, that's not quite true. This is exactly how Google and Meta operate today. Infact numerous large companies have custom NGVCS equivalents for internal use. Unfortunately there isn't a good solution in the public sphere.Is committing dependencies reasonable for Git users today? The answer is... almost? It's at least closer than most people realize! A full Python deployment is merely tens to hundreds of megabytes. Clang is only a few gigabytes. A 2TB SSD is only $100. I would enthusiastically donate a few gigabytes of hard drive space in exchange for builds that "just work".Committing dependencies to Git might be possible to do cleanly today with shallow, sparse, and LFS clones. Maybe. It'd be great if you could run git clone --depth=1 --sparse=windows. Maybe someday.ConclusionI strongly believe that dependencies belong in version control. I believe it is "The Right Thing". There are significant benefits to usability, reliability, reproducibility, sustainability, and more.Committing all dependencies to a Git repo may be more practical than you realize. The actual file size is very reasonable.Improvements to VCS software can allow repos to commit cross-platform dependencies while allowing users to download the bare minimum amount of content. It's the best of everything.I hope that I have convinced you that committing dependencies and toolchains is "The Right Thing". I hope that version control systems evolve to accomodate this as a best practice.Thank you.Bonus SectionIf you read it this far, thank you! Here are some extra thoughts I wanted to share but couldn't squeeze into the main article.Sample ProjectThe sample project can be downloaded via Dropbox as a 636mb .7zip file. It should be trivial to download and build! Linux and macOS toolchains aren't included because I only have a Windows machine to test on. It's not on GitHub because they have an unnecessary file size limit.Git LFSMy dream NGVCS has first class support for all the features I mentioned and more.Git LFS is, imho, a hacky, second class citizen. It works and people use it. But it requires a bunch of extra effort and running extra commands.DeploymentI have a related rant that not only should all dependencies be checked into the build system, but that deployments should also include all dependencies. Yes, deploy 2gb+ of CUDA dlls so your exe will reliably run. No, don't force me to use Docker to run your simple Python project.Git AlternativesThere are a handful of interesting Git alternatives in the pipeline.Jujutsu - Git but betterPijul - Somewhat academic patch-based VCSSapling - Open source version of Meta's VCS. Not fully usable outside of Meta infra.Xethub - Git at 100Tb scale to support massive ML modelsGit isn't going to be replaced anytime soon, unfortunately. But there are a variety of projects exploring different ideas. VCS is far from a solved problem. Be open minded!Package ManagersPackage managers are not necessarily a silver bullet. Rust's Cargo is pretty good. NPM is fine I guess. Meanwhile Python's package ecosystem is an absolute disaster. There may be a compile-time vs run-time distinction here.A good package manager is a decent solution. However package managers exist on a largely per-language basis. And sometimes per-platform. Committing dependencies is a guaranteed good solution for all languages on all platforms.Polyglot projects that involve multiple languages need multiple package managers. Yuck.
# Document Titlegit branches: intuition & reality• git •Hello! I’ve been working on writing a zine about git so I’ve been thinking about git branches a lot. I keep hearing from people that they find the way git branches work to be counterintuitive. It got me thinking: what might an “intuitive” notion of a branch be, and how is it different from how git actually works?So in this post I want to briefly talk aboutan intuitive mental model I think many people havehow git actually represents branches internally (“branches are a pointer to a commit” etc)how the “intuitive model” and the real way it works are actually pretty closely relatedsome limits of the intuitive model and why it might cause problemsNothing in this post is remotely groundbreaking so I’m going to try to keep it pretty short.an intuitive model of a branchOf course, people have many different intuitions about branches. Here’s the one that I think corresponds most closely to the physical “a branch of an apple tree” metaphor.My guess is that a lot of people think about a git branch like this: the 2 commits in pink in this picture are on a “branch”.I think there are two important things about this diagram:the branch has 2 commits on itthe branch has a “parent” (main) which it’s an offshoot ofThat seems pretty reasonable, but that’s not how git defines a branch – most importantly, git doesn’t have any concept of a branch’s “parent”. So how does git define a branch?in git, a branch is the full historyIn git, a branch is the full history of every previous commit, not just the “offshoot” commits. So in our picture above both branches (main and branch) have 4 commits on them.I made an example repository at https://github.com/jvns/branch-example which has its branches set up the same way as in the picture above. Let’s look at the 2 branches:main has 4 commits on it:$ git log --oneline main70f727a df654888 c3997a46 ba74606f aand mybranch has 4 commits on it too. The bottom two commits are shared between both branches.$ git log --oneline mybranch13cb960 y9554dab x3997a46 ba74606f aSo mybranch has 4 commits on it, not just the 2 commits 13cb960 and 9554dab that are “offshoot” commits.You can get git to draw all the commits on both branches like this:$ git log --all --oneline --graph* 70f727a (HEAD -> main, origin/main) d* f654888 c| * 13cb960 (origin/mybranch, mybranch) y| * 9554dab x|/* 3997a46 b* a74606f aa branch is stored as a commit IDInternally in git, branches are stored as tiny text files which have a commit ID in them. That commit is the latest commit on the branch. This is the “technically correct” definition I was talking about at the beginning.Let’s look at the text files for main and mybranch in our example repo:$ cat .git/refs/heads/main70f727acbe9ea3e3ed3092605721d2eda8ebb3f4$ cat .git/refs/heads/mybranch13cb960ad86c78bfa2a85de21cd54818105692bcThis makes sense: 70f727 is the latest commit on main and 13cb96 is the latest commit on mybranch.The reason this works is that every commit contains a pointer to its parent(s), so git can follow the chain of pointers to get every commit on the branch.Like I mentioned before, the thing that’s missing here is any relationship at all between these two branches. There’s no indication that mybranch is an offshoot of main.Now that we’ve talked about how the intuitive notion of a branch is “wrong”, I want to talk about how it’s also right in some very important ways.people’s intuition is usually not that wrongI think it’s pretty popular to tell people that their intuition about git is “wrong”. I find that kind of silly – in general, even if people’s intuition about a topic is technically incorrect in some ways, people usually have the intuition they do for very legitimate reasons! “Wrong” models can be super useful.So let’s talk about 3 ways the intuitive “offshoot” notion of a branch matches up very closely with how we actually use git in practice.rebases use the “intuitive” notion of a branchNow let’s go back to our original picture.When you rebase mybranch on main, it takes the commits on the “intuitive” branch (just the 2 pink commits) and replays them onto main.The result is that just the 2 (x and y) get copied. Here’s what that looks like:$ git switch mybranch$ git rebase main$ git log --oneline mybranch952fa64 (HEAD -> mybranch) y7d50681 x70f727a (origin/main, main) df654888 c3997a46 ba74606f aHere git rebase has created two new commits (952fa64 and 7d50681) whose information comes from the previous two x and y commits.So the intuitive model isn’t THAT wrong! It tells you exactly what happens in a rebase.But because git doesn’t know that mybranch is an offshoot of main, you need to tell it explicitly where to rebase the branch.merges use the “intuitive” notion of a branch tooMerges don’t copy commits, but they do need a “base” commit: the way merges work is that it looks at two sets of changes (starting from the shared base) and then merges them.Let’s undo the rebase we just did and then see what the merge base is.$ git switch mybranch$ git reset --hard 13cb960 # undo the rebase$ git merge-base main mybranch3997a466c50d2618f10d435d36ef12d5c6f62f57This gives us the “base” commit where our branch branched off, 3997a4. That’s exactly the commit you would think it might be based on our intuitive picture.github pull requests also use the intuitive ideaIf we create a pull request on GitHub to merge mybranch into main, it’ll also show us 2 commits: the commits x and y. That makes sense and also matches our intuitive notion of a branch.I assume if you make a merge request on GitLab it shows you something similar.intuition is pretty good, but it has some limitsThis leaves our intuitive definition of a branch looking pretty good actually! The “intuitive” idea of what a branch is matches exactly with how merges and rebases and GitHub pull requests work.You do need to explicitly specify the other branch when merging or rebasing or making a pull request (like git rebase main), because git doesn’t know what branch you think your offshoot is based on.But the intuitive notion of a branch has one fairly serious problem: the way you intuitively think about main and an offshoot branch are very different, and git doesn’t know that.So let’s talk about the different kinds of git branches.trunk and offshoot branchesTo a human, main and mybranch are pretty different, and you probably have pretty different intentions around how you want to use them.I think it’s pretty normal to think of some branches as being “trunk” branches, and some branches as being “offshoots”. Also you can have an offshoot of an offshoot.Of course, git itself doesn’t make any such distinctions (the term “offshoot” is one I just made up!), but what kind of a branch it is definitely affects how you treat it.For example:you might rebase mybranch onto main but you probably wouldn’t rebase main onto mybranch – that would be weird!in general people are much more careful around rewriting the history on “trunk” branches than short-lived offshoot branchesgit lets you do rebases “backwards”One thing I think throws people off about git is – because git doesn’t have any notion of whether a branch is an “offshoot” of another branch, it won’t give you any guidance about if/when it’s appropriate to rebase branch X on branch Y. You just have to know.for example, you can do either:$ git checkout main$ git rebase mybranchor$ git checkout mybranch$ git rebase mainGit will happily let you do either one, even though in this case git rebase main is extremely normal and git rebase mybranch is pretty weird. A lot of people said they found this confusing so here’s a picture of the two kinds of rebases:Similarly, you can do merges “backwards”, though that’s much more normal than doing a backwards rebase – merging mybranch into main and main into mybranch are both useful things to do for different reasons.Here’s a diagram of the two ways you can merge:git’s lack of hierarchy between branches is a little weirdI hear the statement “the main branch is not special” a lot and I’ve been puzzled about it – in most of the repositories I work in, main is pretty special! Why are people saying it’s not?I think the point is that even though branches do have relationships between them (main is often special!), git doesn’t know anything about those relationships.You have to tell git explicitly about the relationship between branches every single time you run a git command like git rebase or git merge, and if you make a mistake things can get really weird.I don’t know whether git’s design here is “right” or “wrong” (it definitely has some pros and cons, and I’m very tired of reading endless arguments about it), but I do think it’s surprising to a lot of people for good reason.git’s UI around branches is weird tooLet’s say you want to look at just the “offshoot” commits on a branch, which as we’ve discussed is a completely normal thing to want.Here’s how to see just the 2 offshoot commits on our branch with git log:$ git switch mybranch$ git log main..mybranch --oneline13cb960 (HEAD -> mybranch, origin/mybranch) y9554dab xYou can look at the combined diff for those same 2 commits with git diff like this:$ git diff main...mybranchSo to see the 2 commits x and y with git log, you need to use 2 dots (..), but to look at the same commits with git diff, you need to use 3 dots (...).Personally I can never remember what .. and ... mean so I just avoid them completely even though in principle they seem useful.in GitHub, the default branch is specialAlso, it’s worth mentioning that GitHub does have a “special branch”: every github repo has a “default branch” (in git terms, it’s what HEAD points at), which is special in the following ways:it’s what you check out when you git clone the repositoryit’s the default destination for pull requestsgithub will suggest that you protect the default branch from force pushesand probably even more that I’m not thinking of.that’s all!This all seems extremely obvious in retrospect, but it took me a long time to figure out what a more “intuitive” idea of a branch even might be because I was so used to the technical “a branch is a reference to a commit” definition.I also hadn’t really thought about how git makes you tell it about the hierarchy between your branches every time you run a git rebase or git merge command – for me it’s second nature to do that and it’s not a big deal, but now that I’m thinking about it, it’s pretty easy to see how somebody could get mixed up.
# Document TitleLogin / Registerall pages recent changesviewedithistorydiscussRosettaStoneTranslations with other distributed VCSBasic distributed version controlBranchingAdding, moving, removing filesInspecting the working directoryCommittingInspecting the repository historyUndoingCollaborating with othersAdvanced usageTranslations from SubversionDiscrepencies between DVCSSee alsoTranslations with other distributed VCSBasic distributed version controlThe following commands have the same name in darcs, git and hg, with minor differences due to difference in concepts:initclonepullpushlogBranchingconcept darcs git hgbranch na branch branchswitch branch na [1] checkout update[1] No in-repo branching yet, see issue555Adding, moving, removing filesconcept darcs git hgtrack file add add addcopy file na na copymove/rename file move mv renameInspecting the working directoryconcept darcs git hgworking dir status whatsnew -s status statushigh-level local diff whatsnew na nadiff local diff [1] diff diff[1] we tend to use the high-level local diff (darcs whatsnew) instead. This displays the patches themselves (eg ‘mv foo bar’) and not just their effects (eg ‘rm foo’ followed by “add bar”)Committingconcept darcs git hgcommit locally record commit commitamend commit amend commit –amend commit –amendtag changes/revisions tag tag tagInspecting the repository historyconcept darcs git hglog log log loglog with diffs log -v log -p log -pmanifest show files ls-files manifestsummarise outgoing changes push –dry-run log origin/master .. outgoingsummarise incoming changes pull –dry-run log ..origin/mast er incomingdiff repos or versions diff diff incoming /outgoing/dif f -rblame/annotate annotate blame annotateUndoingconcept darcs git hgrevert a file revert foo checkout foo revert foorevert full working copy revert reset –hard revert -aundo commit (leaving working copy untouched) unrecord reset –soft rollbackamend commit amend commit –amend commit –amenddestroy last patch/ changeset obliterate delete the commit strip [1]destroy any patch/ changeset obliterate rebase -i, delete the commit strip [1]create anti-changeset rollback revert backout[1] requires extension (mq for strip)Collaborating with othersconcept darcs git hgsend by mail send send-email email [1][1] requires extension (patchbomb for email)Advanced usageconcept darcs git hgport commit to X rebase rebase/cherry -pick transplantTranslations from SubversionSubversion idiom Similar darcs idiomsvn checkout darcs clonesvn update darcs pullsvn status -u darcs pull –dry-run (summarize remote changes)svn status darcs whatsnew –summary (summarize local changes)svn status | grep ‘?’ darcs whatsnew -ls | grep ^a (list potential files to add)svn revert foo.txt darcs revert foo.txt (revert to foo.txt from repo)svn diff darcs whatsnew (for local changes)svn diff darcs diff (for local and recorded changes)svn commit darcs record + darcs pushsvn diff | mail darcs sendsvn add darcs addsvn log darcs logDiscrepencies between DVCSGit has the notion of an index (which affects the meanings of some of the commands), Darcs just has its simple branch-is-repo-is-workspace model.See alsoFeaturesDifferences from GitDifferences from SubversionWikipedia’s comparison of revision control systemsWiki source: darcs get --lazy http://darcs.net/darcs-wikiPowered by: gitit + darcs
# Document TitleHow git cherry-pick and revert use 3-way merge• git •Hello! I was trying to explain to someone how git cherry-pick works the other day, and I found myself getting confused.What went wrong was: I thought that git cherry-pick was basically applying a patch, but when I tried to actually do it that way, it didn’t work!Let’s talk about what I thought cherry-pick did (applying a patch), why that’s not quite true, and what it actually does instead (a “3-way merge”).This post is extremely in the weeds and you definitely don’t need to understand this stuff to use git effectively. But if you (like me) are curious about git’s internals, let’s talk about it!cherry-pick isn’t applying a patchThe way I previously understood git cherry-pick COMMIT_ID is:calculate the diff for COMMIT_ID, like git show COMMIT_ID --patch > out.patchApply the patch to the current branch, like git apply out.patchBefore we get into this – I want to be clear that this model is mostly right, and if that’s your mental model that’s fine. But it’s wrong in some subtle ways and I think that’s kind of interesting, so let’s see how it works.If I try to do the “calculate the diff and apply the patch” thing in a case where there’s a merge conflict, here’s what happens:$ git show 10e96e46 --patch > out.patch$ git apply out.patcherror: patch failed: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown:17error: content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown: patch does not applyThis just fails – it doesn’t give me any way to resolve the conflict or figure out how to solve the problem.This is quite different from what actually happens when run git cherry-pick, which is that I get a merge conflict:$ git cherry-pick 10e96e46error: could not apply 10e96e46... wiphint: After resolving the conflicts, mark them withhint: "git add/rm <pathspec>", then runhint: "git cherry-pick --continue".So it seems like the “git is applying a patch” model isn’t quite right. But the error message literally does say “could not apply 10e96e46”, so it’s not quite wrong either. What’s going on?so what is cherry-pick doing?I went digging through git’s source code to see how cherry-pick works, and ended up at this line of code:res = do_recursive_merge(r, base, next, base_label, next_label, &head, &msgbuf, opts);So a cherry-pick is a… merge? What? How? What is it even merging? And how does merging even work in the first place?I realized that I didn’t really know how git’s merge worked, so I googled it and found out that git does a thing called “3-way merge”. What’s that?how git merges files: the 3-way mergeLet’s say I want to merge these 2 files. We’ll call them v1.py and v2.py.def greet():greeting = "hello"name = "julia"return greeting + " " + namedef say_hello():greeting = "hello"name = "aanya"return greeting + " " + nameThere are two lines that differ: we havedef greet() and def say_helloname = "aanya" and name = "julia"How do we know what to pick? It seems impossible!But what if I told you that the original function was this (base.py)?def say_hello():greeting = "hello"name = "julia"return greeting + " " + nameSuddenly it seems a lot clearer! v1 changed the function’s name to greet and v2 set name = "aanya". So to merge, we should make both those changes:def greet():greeting = "hello"name = "aanya"return greeting + " " + nameWe can ask git to do this merge with git merge-file, and it gives us exactly the result we expected: it picks def greet() and name = "aanya".$ git merge-file v1.py base.py v2.py -pdef greet():greeting = "hello"name = "aanya"return greeting + " " + name⏎This way of merging where you merge 2 files + their original version is called a 3-way merge.If you want to try it out yourself in a browser, I made a little playground at jvns.ca/3-way-merge/. I made it very quickly so it’s not mobile friendly.git merges changes, not filesThe way I think about the 3-way merge is – git merges changes, not files. We have an original file and 2 possible changes to it, and git tries to combine both of those changes in a reasonable way. Sometimes it can’t (for example if both changes change the same line), and then you get a merge conflict.Git can also merge more than 2 possible changes: you can have an original file and 8 possible changes, and it can try to reconcile all of them. That’s called an octopus merge but I don’t know much more than that, I’ve never done one.how git uses 3-way merge to apply a patchNow let’s get a little weird! When we talk about git “applying a patch” (as you do in a rebase or revert or cherry-pick), it’s not actually creating a patch file and applying it. Instead, it’s doing a 3-way merge.Here’s how applying commit X as a patch to your current commit corresponds to this v1, v2, and base setup from before:The version of the file in your current commit is v1.The version of the file before commit X is baseThe version of the file in commit X. Call that v2Run git merge-file v1 base v2 to combine them (technically git does not actually run git merge-file, it runs a C function that does it)Together, you can think of base and v2 as being the “patch”: the diff between them is the change that you want to apply to v1.how cherry-pick worksLet’s say we have this commit graph, and we want to cherry-pick Y on to main:A - B (main)\\X - Y - ZHow do we turn that into a 3-way merge? Here’s how it translates into our v1, v2 and base from earlier:B is v1X is the base, Y is v2So together X and Y are the “patch”.And git rebase is just like git cherry-pick, but repeated a bunch of times.how revert worksNow let’s say we want to run git revert Y on this commit graphX - Y - Z - A - BB is v1Y is the base, X is v2This is exactly like a cherry-pick, but with X and Y reversed. We have to flip them because we want to apply a “reverse patch”.Revert and cherry-pick are so closely related in git that they’re actually implemented in the same file: revert.c.this “3-way patch” is a really cool trickThis trick of using a 3-way merge to apply a commit as a patch seems really clever and cool and I’m surprised that I’d never heard of it before! I don’t know of a name for it, but I kind of want to call it a “3-way patch”.The idea is that with a 3-way patch, you specify the patch as 2 files: the file before the patch and after (base and v2 in our language in this post).So there are 3 files involved: 1 for the original and 2 for the patch.The point is that the 3-way patch is a much better way to patch than a normal patch, because you have a lot more context for merging when you have both full files.Here’s more or less what a normal patch for our example looks like:@@ -1,1 +1,1 @@:- def greet():+ def say_hello():greeting = "hello"and a 3-way patch. This “3-way patch” is not a real file format, it’s just something I made up.BEFORE: (the full file)def greet():greeting = "hello"name = "julia"return greeting + " " + nameAFTER: (the full file)def say_hello():greeting = "hello"name = "julia"return greeting + " " + name“Building Git” talks about thisThe book Building Git by James Coglan is the only place I could find other than the git source code explaining how git cherry-pick actually uses 3-way merge under the hood (I thought Pro Git might talk about it, but it didn’t seem to as far as I could tell).I actually went to buy it and it turned out that I’d already bought it in 2019 so it was a good reference to have here :)merging is actually much more complicated than thisThere’s more to merging in git than the 3-way merge – there’s something called a “recursive merge” that I don’t understand, and there are a bunch of details about how to deal with handling file deletions and moves, and there are also multiple merge algorithms.My best idea for where to learn more about this stuff is Building Git, though I haven’t read the whole thing.so what does git apply do?I also went looking through git’s source to find out what git apply does, and it seems to (unsurprisingly) be in apply.c. That code parses a patch file, and then hunts through the target file to figure out where to apply it. The core logic seems to be around here: I think the idea is to start at the line number that the patch suggested and then hunt forwards and backwards from there to try to find it:/** There's probably some smart way to do this, but I'll leave* that to the smart and beautiful people. I'm simple and stupid.*/backwards = current;backwards_lno = line;forwards = current;forwards_lno = line;current_lno = line;for (i = 0; ; i++) {...That all seems pretty intuitive and about what I’d naively expect.how git apply --3way worksgit apply also has a --3way flag that does a 3-way merge. So we actually could have more or less implemented git cherry-pick with git apply like this:$ git show 10e96e46 --patch > out.patch$ git apply out.patch --3wayApplied patch to 'content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown' with conflicts.U content/post/2023-07-28-why-is-dns-still-hard-to-learn-.markdown--3way doesn’t just use the contents of the patch file though! The patch file starts with:index d63ade04..65778fc0 100644d63ade04 and 65778fc0 are the IDs of the old/new versions of that file in git’s object database, so git can retrieve them to do a 3-way patch application. This won’t work if someone emails you a patch and you don’t have the files for the new/old versions of the file though: if you’re missing the blobs you’ll get this error:$ git apply out.patcherror: repository lacks the necessary blob to perform 3-way merge.3-way merge is oldA couple of people pointed out that 3-way merge is much older than git, it’s from the late 70s or something. Here’s a paper from 2007 talking about itthat’s all!I was pretty surprised to learn that I didn’t actually understand the core way that git applies patches internally – it was really cool to learn about!I have lots of issues with git’s UI but I think this particular thing is not one of them. The 3-way merge seems like a nice unified way to solve a bunch of different problems, it’s pretty intuitive for people (the idea of “applying a patch” is one that a lot of programmers are used to thinking about, and the fact that it’s implemented as a 3-way merge under the hood is an implementation detail that nobody actually ever needs to think about).Also a very quick plug: I’m working on writing a zine about git, if you’re interested in getting an email when it comes out you can sign up to my very infrequent announcements mailing list.
# Document Titlegit rebase: what can go wrong?Hello! While talking with folks about Git, I’ve been seeing a comment over and over to the effect of “I hate rebase”. People seemed to feel pretty strongly about this, and I was really surprised because I don’t run into a lot of problems with rebase and I use it all the time.I’ve found that if many people have a very strong opinion that’s different from mine, usually it’s because they have different experiences around that thing from me.So I asked on Mastodon:today I’m thinking about the tradeoffs of using git rebase a bit. I think the goal of rebase is to have a nice linear commit history, which is something I like.but what are the costs of using rebase? what problems has it caused for you in practice? I’m really only interested in specific bad experiences you’ve had here – not opinions or general statements like “rewriting history is bad”I got a huge number of incredible answers to this, and I’m going to do my best to summarize them here. I’ll also mention solutions or workarounds to those problems in cases where I know of a solution. Here’s the list:fixing the same conflict repeatedly is annoyingrebasing a lot of commits is hardundoing a rebase is hardforce pushing to shared branches can cause lost workforce pushing makes code reviews harderlosing commit metadatamore difficult revertsrebasing can break intermediate commitsaccidentally run git commit –amend instead of git rebase –continuesplitting commits in an interactive rebase is hardcomplex rebases are hardrebasing long lived branches can be annoyingrebase and commit disciplinea “squash and merge” workflowmiscellaneous problemsMy goal with this isn’t to convince anyone that rebase is bad and you shouldn’t use it (I’m certainly going to keep using rebase!). But seeing all these problems made me want to be more cautious about recommending rebase to newcomers without explaining how to use it safely. It also makes me wonder if there’s an easier workflow for cleaning up your commit history that’s harder to accidentally mess up.my git workflow assumptionsFirst, I know that people use a lot of different Git workflows. I’m going to be talking about the workflow I’m used to when working on a team, which is:the team uses a central Github/Gitlab repo to coordinatethere’s one central main branch. It’s protected from force pushes.people write code in feature branches and make pull requests to mainThe web service is deployed from main every time a pull request is merged.the only way to make a change to main is by making a pull request on Github/Gitlab and merging itThis is not the only “correct” git workflow (it’s a very “we run a web service” workflow and open source project or desktop software with releases generally use a slightly different workflow). But it’s what I know so that’s what I’ll talk about.two kinds of rebaseAlso before we start: one big thing I noticed is that there were 2 different kinds of rebase that kept coming up, and only one of them requires you to deal with merge conflicts.rebasing on an ancestor, like git rebase -i HEAD^^^^^^^ to squash many small commits into one. As long as you’re just squashing commits, you’ll never have to resolve a merge conflict while doing this.rebasing onto a branch that has diverged, like git rebase main. This can cause merge conflicts.I think it’s useful to make this distinction because sometimes I’m thinking about rebase type 1 (which is a lot less likely to cause problems), but people who are struggling with it are thinking about rebase type 2.Now let’s move on to all the problems!fixing the same conflict repeatedly is annoyingIf you make many tiny commits, sometimes you end up in a hellish loop where you have to fix the same merge conflict 10 times. You can also end up fixing merge conflicts totally unnecessarily (like dealing with a merge conflict in code that a future commit deletes).There are a few ways to make this better:first do a git rebase -i HEAD^^^^^^^^^^^ to squash all of the tiny commits into 1 big commit and then a git rebase main to rebase onto a different branch. Then you only have to fix the conflicts once.use git rerere to automate repeatedly resolving the same merge conflicts (“rerere” stands for “reuse recorded resolution”, it’ll record your previous merge conflict resolutions and replay them). I’ve never tried this but I think you need to set git config rerere.enabled true and then it’ll automatically help you.Also if I find myself resolving merge conflicts more than once in a rebase, I’ll usually run git rebase --abort to stop it and then squash my commits into one and try again.rebasing a lot of commits is hardGenerally when I’m doing a rebase onto a different branch, I’m rebasing 1-2 commits. Maybe sometimes 5! Usually there are no conflicts and it works fine.Some people described rebasing hundreds of commits by many different people onto a different branch. That sounds really difficult and I don’t envy that task.undoing a rebase is hardI heard from several people that when they were new to rebase, they messed up a rebase and permanently lost a week of work that they then had to redo.The problem here is that undoing a rebase that went wrong is much more complicated than undoing a merge that went wrong (you can undo a bad merge with something like git reset --hard HEAD^). Many newcomers to rebase don’t even realize that undoing a rebase is even possible, and I think it’s pretty easy to understand why.That said, it is possible to undo a rebase that went wrong. Here’s an example of how to undo a rebase using git reflog.step 1: Do a bad rebase (for example run git rebase -I HEAD^^^^^ and just delete 3 commits)step 2: Run git reflog. You should see something like this:ee244c4 (HEAD -> main) HEAD@{0}: rebase (finish): returning to refs/heads/mainee244c4 (HEAD -> main) HEAD@{1}: rebase (pick): testfdb8d73 HEAD@{2}: rebase (start): checkout HEAD^^^^^^^ca7fe25 HEAD@{3}: commit: 16 bits by default073bc72 HEAD@{4}: commit: only show tooltips on desktopstep 3: Find the entry immediately before rebase (start). In my case that’s ca7fe25step 4: Run git reset --hard ca7fe25A couple of other ways to undo a rebase:Apparently @ always refers to your current branch in git, so you can run git reset --hard @{1} to reset your branch to its previous location.Another solution folks mentioned that avoids having to use the reflog is to make a “backup branch” with git switch -c backup before rebasing, so you can easily get back to the old commit.force pushing to shared branches can cause lost workA few people mentioned the following situation:You’re collaborating on a branch with someoneYou push some changesThey rebase the branch and run git push --force (maybe by accident)Now when you run git pull, it’s a mess – you get the a fatal: Need to specify how to reconcile divergent branches errorWhile trying to deal with the fallout you might lose some commits, especially if some of the people are involved aren’t very comfortable with gitThis is an even worse situation than the “undoing a rebase is hard” situation because the missing commits might be split across many different people’s and the only worse thing than having to hunt through the reflog is multiple different people having to hunt through the reflog.This has never happened to me because the only branch I’ve ever collaborated on is main, and main has always been protected from force pushing (in my experience the only way you can get something into main is through a pull request). So I’ve never even really been in a situation where this could happen. But I can definitely see how this would cause problems.The main tools I know to avoid this are:don’t rebase on shared branchesuse --force-with-lease when force pushing, to make sure that nobody else has pushed to the branch since your last fetchApparently the “since your last fetch” is important here – if you run git fetch immediately before running git push --force-with-lease, the --force-with-lease won’t protect you at all.I was curious about why people would run git push --force on a shared branch. Some reasons people gave were:they’re working on a collaborative feature branch, and the feature branch needs to be rebased onto main. The idea here is that you’re just really careful about coordinating the rebase so nothing gets lost.as an open source maintainer, sometimes they need to rebase a contributor’s branch to fix a merge conflictthey’re new to git, read some instructions online that suggested git rebase and git push --force as a solution, and followed them without understanding the consequencesthey’re used to doing git push --force on a personal branch and ran it on a shared branch by accidentforce pushing makes code reviews harderThe situation here is:You make a pull request on GitHubPeople leave some commentsYou update the code to address the comments, rebase to clean up your commits, and force pushNow when the reviewer comes back, it’s hard for them to tell what you changed since the last time you saw it – all the commits show up as “new”.One way to avoid this is to push new commits addressing the review comments, and then after the PR is approved do a rebase to reorganize everything.I think some reviewers are more annoyed by this problem than others, it’s kind of a personal preference. Also this might be a Github-specific issue, other code review tools might have better tools for managing this.losing commit metadataIf you’re rebasing to squash commits, you can lose important commit metadata like Co-Authored-By. Also if you GPG sign your commits, rebase loses the signatures.There’s probably other commit metadata that you can lose that I’m not thinking of.I haven’t run into this one so I’m not sure how to avoid it. I think GPG signing commits isn’t as popular as it used to be.more difficult revertsSomeone mentioned that it’s important for them to be able to easily revert merging any branch (in case the branch broke something), and if the branch contains multiple commits and was merged with rebase, then you need to do multiple reverts to undo the commits.In a merge workflow, I think you can revert merging any branch just by reverting the merge commit.rebasing can break intermediate commitsIf you’re trying to have a very clean commit history where the tests pass on every commit (very admirable!), rebasing can result in some intermediate commits that are broken and don’t pass the tests, even if the final commit passes the tests.Apparently you can avoid this by using git rebase -x to run the test suite at every step of the rebase and make sure that the tests are still passing. I’ve never done that though.accidentally run git commit --amend instead of git rebase --continueA couple of people mentioned issues with running git commit --amend instead of git rebase --continue when resolving a merge conflict.The reason this is confusing is that there are two reasons when you might want to edit files during a rebase:editing a commit (by using edit in git rebase -i), where you need to write git commit --amend when you’re donea merge conflict, where you need to run git rebase --continue when you’re doneIt’s very easy to get these two cases mixed up because they feel very similar. I think what goes wrong here is that you:Start a rebaseRun into a merge conflictResolve the merge conflict, and run git add file.txtRun git commit because that’s what you’re used to doing after you run git addBut you were supposed to run git rebase --continue! Now you have a weird extra commit, and maybe it has the wrong commit message and/or authorsplitting commits in an interactive rebase is hardThe whole point of rebase is to clean up your commit history, and combining commits with rebase is pretty easy. But what if you want to split up a commit into 2 smaller commits? It’s not as easy, especially if the commit you want to split is a few commits back! I actually don’t really know how to do it even though I feel very comfortable with rebase. I’d probably just do git reset HEAD^^^ or something and use git add -p to redo all my commits from scratch.One person shared their workflow for splitting commits with rebase.complex rebases are hardIf you try to do too many things in a single git rebase -i (reorder commits AND combine commits AND modify a commit), it can get really confusing.To avoid this, I personally prefer to only do 1 thing per rebase, and if I want to do 2 different things I’ll do 2 rebases.rebasing long lived branches can be annoyingIf your branch is long-lived (like for 1 month), having to rebase repeatedly gets painful. It might be easier to just do 1 merge at the end and only resolve the conflicts once.The dream is to avoid this problem by not having long-lived branches but it doesn’t always work out that way in practice.miscellaneous problemsA few more issues that I think are not that common:Stopping a rebase wrong: If you try to abort a rebase that’s going badly with git reset --hard instead of git rebase --abort, things will behave weirdly until you stop it properlyWeird interactions with merge commits: A couple of quotes about this: “If you rebase your working copy to keep a clean history for a branch, but the underlying project uses merges, the result can be ugly. If you do rebase -i HEAD~4 and the fourth commit back is a merge, you can see dozens of commits in the interactive editor.“, “I’ve learned the hard way to never rebase if I’ve merged anything from another branch”rebase and commit disciplineI’ve seen a lot of people arguing about rebase. I’ve been thinking about why this is and I’ve noticed that people work at a few different levels of “commit discipline”:Literally anything goes, “wip”, “fix”, “idk”, “add thing”When you make a pull request (on github/gitlab), squash all of your crappy commits into a single commit with a reasonable message (usually the PR title)Atomic Beautiful Commits – every change is split into the appropriate number of commits, where each one has a nice commit message and where they all tell a story around the change you’re makingOften I think different people inside the same company have different levels of commit discipline, and I’ve seen people argue about this a lot. Personally I’m mostly a Level 2 person. I think Level 3 might be what people mean when they say “clean commit history”.I think Level 1 and Level 2 are pretty easy to achieve without rebase – for level 1, you don’t have to do anything, and for level 2, you can either press “squash and merge” in github or run git switch main; git merge --squash mybranch on the command line.But for Level 3, you either need rebase or some other tool (like GitUp) to help you organize your commits to tell a nice story.I’ve been wondering if when people argue about whether people “should” use rebase or not, they’re really arguing about which minimum level of commit discipline should be required.I think how this plays out also depends on how big the changes folks are making – if folks are usually making pretty small pull requests anyway, squashing them into 1 commit isn’t a big deal, but if you’re making a 6000-line change you probably want to split it up into multiple commits.a “squash and merge” workflowA couple of people mentioned using this workflow that doesn’t use rebase:make commitsRun git merge main to merge main into the branch periodically (and fix conflicts if necessary)When you’re done, use GitHub’s “squash and merge” feature (which is the equivalent of running git checkout main; git merge --squash mybranch) to squash all of the changes into 1 commit. This gets rid of all the “ugly” merge commits.I originally thought this would make the log of commits on my branch too ugly, but apparently git log main..mybranch will just show you the changes on your branch, like this:$ git log main..mybranch756d4af (HEAD -> mybranch) Merge branch 'main' into mybranch20106fd Merge branch 'main' into mybranchd7da423 some commit on my branch85a5d7d some other commit on my branchOf course, the goal here isn’t to force people who have made beautiful atomic commits to squash their commits – it’s just to provide an easy option for folks to clean up a messy commit history (“add new feature; wip; wip; fix; fix; fix; fix; fix;“) without having to use rebase.I’d be curious to hear about other people who use a workflow like this and if it works well.there are more problems than I expectedI went into this really feeling like “rebase is fine, what could go wrong?” But many of these problems actually have happened to me in the past, it’s just that over the years I’ve learned how to avoid or fix all of them.And I’ve never really seen anyone share best practices for rebase, other than “never force push to a shared branch”. All of these honestly make me a lot more reluctant to recommend using rebase.To recap, I think these are my personal rebase rules I follow:stop a rebase if it’s going badly instead of letting it finish (with git rebase --abort)know how to use git reflog to undo a bad rebasedon’t rebase a million tiny commits (instead do it in 2 steps: git rebase -i HEAD^^^^ and then git rebase main)don’t do more than one thing in a git rebase -i. Keep it simple.never force push to a shared branchnever rebase commits that have already been pushed to mainThanks to Marco Rogers for encouraging me to think about the problems people have with rebase, and to everyone on Mastodon who helped with this.
# Document TitleConfusing git terminology• git •Hello! I’m slowly working on explaining git. One of my biggest problems is that after almost 15 years of using git, I’ve become very used to git’s idiosyncracies and it’s easy for me to forget what’s confusing about it.So I asked people on Mastodon:what git jargon do you find confusing? thinking of writing a blog post that explains some of git’s weirder terminology: “detached HEAD state”, “fast-forward”, “index/staging area/staged”, “ahead of ‘origin/main’ by 1 commit”, etcI got a lot of GREAT answers and I’ll try to summarize some of them here. Here’s a list of the terms:HEAD and “heads”“detached HEAD state”“ours” and “theirs” while merging or rebasing“Your branch is up to date with ‘origin/main’”HEAD^, HEAD~ HEAD^^, HEAD~~, HEAD^2, HEAD~2.. and …“can be fast-forwarded”“reference”, “symbolic reference”refspecs“tree-ish”“index”, “staged”, “cached”“reset”, “revert”, “restore”“untracked files”, “remote-tracking branch”, “track remote branch”checkoutreflogmerge vs rebase vs cherry-pickrebase –ontocommitmore confusing termsI’ve done my best to explain what’s going on with these terms, but they cover basically every single major feature of git which is definitely too much for a single blog post so it’s pretty patchy in some places.HEAD and “heads”A few people said they were confused by the terms HEAD and refs/heads/main, because it sounds like it’s some complicated technical internal thing.Here’s a quick summary:“heads” are “branches”. Internally in git, branches are stored in a directory called .git/refs/heads. (technically the official git glossary says that the branch is all the commits on it and the head is just the most recent commit, but they’re 2 different ways to think about the same thing)HEAD is the current branch. It’s stored in .git/HEAD.I think that “a head is a branch, HEAD is the current branch” is a good candidate for the weirdest terminology choice in git, but it’s definitely too late for a clearer naming scheme so let’s move on.There are some important exceptions to “HEAD is the current branch”, which we’ll talk about next.“detached HEAD state”You’ve probably seen this message:$ git checkout v0.1You are in 'detached HEAD' state. You can look around, make experimentalchanges and commit them, and you can discard any commits you make in thisstate without impacting any branches by switching back to a branch.[...]Here’s the deal with this message:In Git, usually you have a “current branch” checked out, for example main.The place the current branch is stored is called HEAD.Any new commits you make will get added to your current branch, and if you run git merge other_branch, that will also affect your current branchBut HEAD doesn’t have to be a branch! Instead it can be a commit ID.Git calls this state (where HEAD is a commit ID instead of a branch) “detached HEAD state”For example, you can get into detached HEAD state by checking out a tag, because a tag isn’t a branchif you don’t have a current branch, a bunch of things break:git pull doesn’t work at all (since the whole point of it is to update your current branch)neither does git push unless you use it in a special waygit commit, git merge, git rebase, and git cherry-pick do still work, but they’ll leave you with “orphaned” commits that aren’t connected to any branch, so those commits will be hard to findYou can get out of detached HEAD state by either creating a new branch or switching to an existing branch“ours” and “theirs” while merging or rebasingIf you have a merge conflict, you can run git checkout --ours file.txt to pick the version of file.txt from the “ours” side. But which side is “ours” and which side is “theirs”?I always find this confusing and I never use git checkout --ours because of that, but I looked it up to see which is which.For merges, here’s how it works: the current branch is “ours” and the branch you’re merging in is “theirs”, like this. Seems reasonable.$ git checkout merge-into-ours # current branch is "ours"$ git merge from-theirs # branch we're merging in is "theirs"For rebases it’s the opposite – the current branch is “theirs” and the target branch we’re rebasing onto is “ours”, like this:$ git checkout theirs # current branch is "theirs"$ git rebase ours # branch we're rebasing onto is "ours"I think the reason for this is that under the hood git rebase main is repeatedly merging commits from the current branch into a copy of the main branch (you can see what I mean by that in this weird shell script the implements git rebase using git merge. But I still find it confusing.This nice tiny site explains the “ours” and “theirs” terms.A couple of people also mentioned that VSCode calls “ours”/“theirs” “current change”/“incoming change”, and that it’s confusing in the exact same way.“Your branch is up to date with ‘origin/main’”This message seems straightforward – it’s saying that your main branch is up to date with the origin!But it’s actually a little misleading. You might think that this means that your main branch is up to date. It doesn’t. What it actually means is – if you last ran git fetch or git pull 5 days ago, then your main branch is up to date with all the changes as of 5 days ago.So if you don’t realize that, it can give you a false sense of security.I think git could theoretically give you a more useful message like “is up to date with the origin’s main as of your last fetch 5 days ago” because the time that the most recent fetch happened is stored in the reflog, but it doesn’t.HEAD^, HEAD~ HEAD^^, HEAD~~, HEAD^2, HEAD~2I’ve known for a long time that HEAD^ refers to the previous commit, but I’ve been confused for a long time about the difference between HEAD~ and HEAD^.I looked it up, and here’s how these relate to each other:HEAD^ and HEAD~ are the same thing (1 commit ago)HEAD^^^ and HEAD~~~ and HEAD~3 are the same thing (3 commits ago)HEAD^3 refers the the third parent of a commit, and is different from HEAD~3This seems weird – why are HEAD~ and HEAD^ the same thing? And what’s the “third parent”? Is that the same thing as the parent’s parent’s parent? (spoiler: it isn’t) Let’s talk about it!Most commits have only one parent. But merge commits have multiple parents – they’re merging together 2 or more commits. In Git HEAD^ means “the parent of the HEAD commit”. But what if HEAD is a merge commit? What does HEAD^ refer to?The answer is that HEAD^ refers to the the first parent of the merge, HEAD^2 is the second parent, HEAD^3 is the third parent, etc.But I guess they also wanted a way to refer to “3 commits ago”, so HEAD^3 is the third parent of the current commit (which may have many parents if it’s a merge commit), and HEAD~3 is the parent’s parent’s parent.I think in the context of the merge commit ours/theirs discussion earlier, HEAD^ is “ours” and HEAD^2 is “theirs”... and ...Here are two commands:git log main..testgit log main...testWhat’s the difference between .. and ...? I never use these so I had to look it up in man git-range-diff. It seems like the answer is that in this case:A - B main\C - D testmain..test is commits C and Dtest..main is commit Bmain...test is commits B, C, and DBut it gets worse: apparently git diff also supports .. and ..., but they do something completely different than they do with git log? I think the summary is:git log test..main shows changes on main that aren’t on test, whereas git log test...main shows changes on both sides.git diff test..main shows test changes and main changes (it diffs B and D) whereas git diff test...main diffs A and D (it only shows you the diff on one side).this blog post talks about it a bit more.“can be fast-forwarded”Here’s a very common message you’ll see in git status:$ git statusOn branch mainYour branch is behind 'origin/main' by 2 commits, and can be fast-forwarded.(use "git pull" to update your local branch)What does “fast-forwarded” mean? Basically it’s trying to say that the two branches look something like this: (newest commits are on the right)main: A - B - Corigin/main: A - B - C - D - Eor visualized another way:A - B - C - D - E (origin/main)|mainHere origin/main just has 2 extra commits that main doesn’t have, so it’s easy to bring main up to date – we just need to add those 2 commits. Literally nothing can possibly go wrong – there’s no possibility of merge conflicts. A fast forward merge is a very good thing! It’s the easiest way to combine 2 branches.After running git pull, you’ll end up this state:main: A - B - C - D - Eorigin/main: A - B - C - D - EHere’s an example of a state which can’t be fast-forwarded.A - B - C - X (main)|- - D - E (origin/main)Here main has a commit that origin/main doesn’t have (X). So you can’t do a fast forward. In that case, git status would say:$ git statusYour branch and 'origin/main' have diverged,and have 1 and 2 different commits each, respectively.“reference”, “symbolic reference”I’ve always found the term “reference” kind of confusing. There are at least 3 things that get called “references” in gitbranches and tags like main and v0.2HEAD, which is the current branchthings like HEAD^^^ which git will resolve to a commit ID. Technically these are probably not “references”, I guess git calls them “revision parameters” but I’ve never used that term.“symbolic reference” is a very weird term to me because personally I think the only symbolic reference I’ve ever used is HEAD (the current branch), and HEAD has a very central place in git (most of git’s core commands’ behaviour depends on the value of HEAD), so I’m not sure what the point of having it as a generic concept is.refspecsWhen you configure a git remote in .git/config, there’s this +refs/heads/main:refs/remotes/origin/main thing.[remote "origin"]url = git@github.com:jvns/pandas-cookbookfetch = +refs/heads/main:refs/remotes/origin/mainI don’t really know what this means, I’ve always just used whatever the default is when you do a git clone or git remote add, and I’ve never felt any motivation to learn about it or change it from the default.“tree-ish”The man page for git checkout says:git checkout [-f|--ours|--theirs|-m|--conflict=<style>] [<tree-ish>] [--] <pathspec>...What’s tree-ish??? What git is trying to say here is when you run git checkout THING ., THING can be either:a commit ID (like 182cd3f)a reference to a commit ID (like main or HEAD^^ or v0.3.2)a subdirectory inside a commit (like main:./docs)I think that’s it????Personally I’ve never used the “directory inside a commit” thing and from my perspective “tree-ish” might as well just mean “commit or reference to commit”.“index”, “staged”, “cached”All of these refer to the exact same thing (the file .git/index, which is where your changes are staged when you run git add):git diff --cachedgit rm --cachedgit diff --stagedthe file .git/indexEven though they all ultimately refer to the same file, there’s some variation in how those terms are used in practice:Apparently the flags --index and --cached do not generally mean the same thing. I have personally never used the --index flag so I’m not going to get into it, but this blog post by Junio Hamano (git’s lead maintainer) explains all the gnarly detailsthe “index” lists untracked files (I guess for performance reasons) but you don’t usually think of the “staging area” as including untracked files”“reset”, “revert”, “restore”A bunch of people mentioned that “reset”, “revert” and “restore” are very similar words and it’s hard to differentiate them.I think it’s made worse becausegit reset --hard and git restore . on their own do basically the same thing. (though git reset --hard COMMIT and git restore --source COMMIT . are completely different from each other)the respective man pages don’t give very helpful descriptions:git reset: “Reset current HEAD to the specified state”git revert: “Revert some existing commits”git restore: “Restore working tree files”Those short descriptions do give you a better sense for which noun is being affected (“current HEAD”, “some commits”, “working tree files”) but they assume you know what “reset”, “revert” and “restore” mean in this context.Here are some short descriptions of what they each do:git revert COMMIT: Create a new commit that’s the “opposite” of COMMIT on your current branch (if COMMIT added 3 lines, the new commit will delete those 3 lines)git reset --hard COMMIT: Force your current branch back to the state it was at COMMIT, erasing any new changes since COMMIT. Very dangerous operation.git restore --source=COMMIT PATH: Take all the files in PATH back to how they were at COMMIT, without changing any other files or commit history.“untracked files”, “remote-tracking branch”, “track remote branch”Git uses the word “track” in 3 different related ways:Untracked files: in the output of git status. This means those files aren’t managed by Git and won’t be included in commits.a “remote tracking branch” like origin/main. This is a local reference, and it’s the commit ID that main pointed to on the remote origin the last time you ran git pull or git fetch.“branch foo set up to track remote branch bar from origin”The “untracked files” and “remote tracking branch” thing is not too bad – they both use “track”, but the context is very different. No big deal. But I think the other two uses of “track” are actually quite confusing:main is a branch that tracks a remoteorigin/main is a remote-tracking branchBut a “branch that tracks a remote” and a “remote-tracking branch” are different things in Git and the distinction is pretty important! Here’s a quick summary of the differences:main is a branch. You can make commits to it, merge into it, etc. It’s often configured to “track” the remote main in .git/config, which means that you can use git pull and git push to push/pull changes.origin/main is not a branch. It’s a “remote-tracking branch”, which is not a kind of branch (I’m sorry). You can’t make commits to it. The only way you can update it is by running git pull or git fetch to get the latest state of main from the remote.I’d never really thought about this ambiguity before but I think it’s pretty easy to see why folks are confused by it.checkoutCheckout does two totally unrelated things:git checkout BRANCH switches branchesgit checkout file.txt discards your unstaged changes to file.txtThis is well known to be confusing and git has actually split those two functions into git switch and git restore (though you can still use checkout if, like me, you have 15 years of muscle memory around git checkout that you don’t feel like unlearning)Also personally after 15 years I still can’t remember the order of the arguments to git checkout main file.txt for restoring the version of file.txt from the main branch.I think sometimes you need to pass -- to checkout as an argument somewhere to help it figure out which argument is a branch and which ones are paths but I never do that and I’m not sure when it’s needed.reflogLots of people mentioning reading reflog as re-flog and not ref-log. I won’t get deep into the reflog here because this post is REALLY long but:“reference” is an umbrella term git uses for branches, tags, and HEADthe reference log (“reflog”) gives you the history of everything a reference has ever pointed toIt can help get you out of some VERY bad git situations, like if you accidentally delete an important branchI find it one of the most confusing parts of git’s UI and I try to avoid needing to use it.merge vs rebase vs cherry-pickA bunch of people mentioned being confused about the difference between merge and rebase and not understanding what the “base” in rebase was supposed to be.I’ll try to summarize them very briefly here, but I don’t think these 1-line explanations are that useful because people structure their workflows around merge / rebase in pretty different ways and to really understand merge/rebase you need to understand the workflows. Also pictures really help. That could really be its whole own blog post though so I’m not going to get into it.merge creates a single new commit that merges the 2 branchesrebase copies commits on the current branch to the target branch, one at a time.cherry-pick is similar to rebase, but with a totally different syntax (one big difference is that rebase copies commits FROM the current branch, cherry-pick copies commits TO the current branch)rebase --ontogit rebase has an flag called onto. This has always seemed confusing to me because the whole point of git rebase main is to rebase the current branch onto main. So what’s the extra onto argument about?I looked it up, and --onto definitely solves a problem that I’ve rarely/never actually had, but I guess I’ll write down my understanding of it anyway.A - B - C (main)\D - E - F - G (mybranch)|otherbranchImagine that for some reason I just want to move commits F and G to be rebased on top of main. I think there’s probably some git workflow where this comes up a lot.Apparently you can run git rebase --onto main otherbranch mybranch to do that. It seems impossible to me to remember the syntax for this (there are 3 different branch names involved, which for me is too many), but I heard about it from a bunch of people so I guess it must be useful.commitSomeone mentioned that they found it confusing that commit is used both as a verb and a noun in git.for example:verb: “Remember to commit often”noun: “the most recent commit on main“My guess is that most folks get used to this relatively quickly, but this use of “commit” is different from how it’s used in SQL databases, where I think “commit” is just a verb (you “COMMIT” to end a transaction) and not a noun.Also in git you can think of a Git commit in 3 different ways:a snapshot of the current state of every filea diff from the parent commita history of every previous commitNone of those are wrong: different commands use commits in all of these ways. For example git show treats a commit as a diff, git log treats it as a history, and git restore treats it as a snapshot.But git’s terminology doesn’t do much to help you understand in which sense a commit is being used by a given command.more confusing termsHere are a bunch more confusing terms. I don’t know what a lot of these mean.things I don’t really understand myself:“the git pickaxe” (maybe this is git log -S and git log -G, for searching the diffs of previous commits?)submodules (all I know is that they don’t work the way I want them to work)“cone mode” in git sparse checkout (no idea what this is but someone mentioned it)things that people mentioned finding confusing but that I left out of this post because it was already 3000 words:blob, treethe direction of “merge”“origin”, “upstream”, “downstream”that push and pull aren’t oppositesthe relationship between fetch and pull (pull = fetch + merge)git porcelainsubtreesworktreesthe stash“master” or “main” (it sounds like it has a special meaning inside git but it doesn’t)when you need to use origin main (like git push origin main) vs origin/maingithub terms people mentioned being confused by:“pull request” (vs “merge request” in gitlab which folks seemed to think was clearer)what “squash and merge” and “rebase and merge” do (I’d never actually heard of git merge --squash until yesterday, I thought “squash and merge” was a special github feature)it’s genuinely “every git term”I was surprised that basically every other core feature of git was mentioned by at least one person as being confusing in some way. I’d be interested in hearing more examples of confusing git terms that I missed too.There’s another great post about this from 2012 called the most confusing git terminology. It talks more about how git’s terminology relates to CVS and Subversion’s terminology.If I had to pick the 3 most confusing git terms, I think right now I’d pick:a head is a branch, HEAD is the current branch“remote tracking branch” and “branch that tracks a remote” being different thingshow “index”, “staged”, “cached” all refer to the same thingthat’s all!I learned a lot from writing this – I learned a few new facts about git, but more importantly I feel like I have a slightly better sense now for what someone might mean when they say that everything in git is confusing.I really hadn’t thought about a lot of these issues before – like I’d never realized how “tracking” is used in such a weird way when discussing branches.Also as usual I might have made some mistakes, especially since I ended up in a bunch of corners of git that I hadn’t visited before.Also a very quick plug: I’m working on writing a zine about git, if you’re interested in getting an email when it comes out you can sign up to my very infrequent announcements mailing list.
# Document TitleUnified Versus Split DiffOct 23, 2023Which is better for code reviews, a unified diff or a split diff?A split diff looks like this for me:And this is a unified one:If the changes are simple and small, both views are good. But for larger, more complex changes neither works for me.For a large change, I don’t want to do a “diff review”, I want to do a proper code review of a codebase at a particular instant in time, paying specific attention to the recently changed areas, but mostly just doing general review, as if I am writing the code. I need to run tests, use goto definition and other editor navigation features, apply local changes to check if some things could have been written differently, look at the wider context to notice things that should have been changed, and in general notice anything that might be not quite right with the codebase, irrespective of the historical path to the current state of the code.So, for me, the ideal diff view would look rather like this:On the left, the current state of the code (which is also the on-disk state), with changes subtly highlighted in the margins. On the right, the unified diff for the portion of the codebase currently visible on the left.Sadly, this format of review isn’t well supported by the tools — everyone seems to be happy reviewing diffs, rather than the actual code?I have a low-tech and pretty inefficient workflow for this style of review. A gpr script for checking out a pull request locally:$ gpr 1234 --reviewInternally, it does roughly$ git fetch upstream refs/pull/1234/head$ git switch --detach FETCH_HEAD$ git reset $(git merge-base HEAD main)The last line is the key — it erases all the commits from the pull request, but keeps all of the changes. This lets me abuse my workflow for staging&committing to do a code review — edamagit shows the list of changed files, I get “go to next/previous change” shortcuts in the editor, I can even use the staging area to mark hunks I have reviewed.The only thing I don’t get is automatic synchronization between magit status buffer, and the file that’s currently open in the editor. That is, to view the current file and the diff on the side, I have to manually open the diff and scroll it to the point I am currently looking at.I wish it was easier to get this close to the code without building custom ad-hoc tools!P.S. This post talks about how to review code, but reviewing the code is not necessary the primary goal of code review. See this related post: Two Kinds of Code Review.
Some miscellaneous git facts• git •I’ve been very slowly working on writing about how Git works. I thought I already knew Git pretty well, but as usual when I try to explain something I’ve been learning some new things.None of these things feel super surprising in retrospect, but I hadn’t thought about them clearly before.The facts are:the “index”, “staging area” and “–cached” are all the same thingthe stash is a bunch of commitsnot all references are branches or tagsmerge commits aren’t emptyLet’s talk about them!the “index”, “staging area” and “–cached” are all the same thingWhen you run git add file.txt, and then git status, you’ll see something like this:$ git add content/post/2023-10-20-some-miscellaneous-git-facts.markdown$ git statusChanges to be committed:(use "git restore --staged <file>..." to unstage)new file: content/post/2023-10-20-some-miscellaneous-git-facts.markdownPeople usually call this “staging a file” or “adding a file to the staging area”.When you stage a file with git add, behind the scenes git adds the file to its object database (in .git/objects) and updates a file called .git/index to refer to the newly added file.This “staging area” actually gets referred to by 3 different names in Git. All of these refer to the exact same thing (the file .git/index):git diff --cachedgit diff --stagedthe file .git/indexI felt like I should have realized this earlier, but I didn’t, so there it is.the stash is a bunch of commitsWhen I run git stash to stash my changes, I’ve always been a bit confused about where those changes actually went. It turns out that when you run git stash, git makes some commits with your changes and labels them with a reference called stash (in .git/refs/stash).Let’s stash this blog post and look at the log of the stash reference:$ git log stash --oneline6cb983fe (refs/stash) WIP on main: c6ee55ed wip2ff2c273 index on main: c6ee55ed wip... some more stuffNow we can look at the commit 2ff2c273 to see what it contains:$ git show 2ff2c273 --statcommit 2ff2c273357c94a0087104f776a8dd28ee467769Author: Julia Evans <julia@jvns.ca>Date: Fri Oct 20 14:49:20 2023 -0400index on main: c6ee55ed wipcontent/post/2023-10-20-some-miscellaneous-git-facts.markdown | 40 ++++++++++++++++++++++++++++++++++++++++Unsurprisingly, it contains this blog post. Makes sense!git stash actually creates 2 separate commits: one for the index, and one for your changes that you haven’t staged yet. I found this kind of heartening because I’ve been working on a tool to snapshot and restore the state of a git repository (that I may or may not ever release) and I came up with a very similar design, so that made me feel better about my choices.Apparently older commits in the stash are stored in the reflog.not all references are branches or tagsGit’s documentation often refers to “references” in a generic way that I find a little confusing sometimes. Personally 99% of the time when I deal with a “reference” in Git it’s a branch or HEAD and the other 1% of the time it’s a tag. I actually didn’t know ANY examples of references that weren’t branches or tags or HEAD.But now I know one example – the stash is a reference, and it’s not a branch or tag! So that’s cool.Here are all the references in my blog’s git repository (other than HEAD):$ find .git/refs -type f.git/refs/heads/main.git/refs/remotes/origin/HEAD.git/refs/remotes/origin/main.git/refs/stashSome other references people mentioned in reponses to this post:refs/notes/*, from git notesrefs/pull/123/head, and `refs/pull/123/head for GitHub pull requests (which you can get with git fetch origin refs/pull/123/merge)refs/bisect/*, from git bisectmerge commits aren’t emptyHere’s a toy git repo where I created two branches x and y, each with 1 file (x.txt and y.txt) and merged them. Let’s look at the merge commit.$ git log --oneline96a8afb (HEAD -> y) Merge branch 'x' into y0931e45 y1d8bd2d (x) xIf I run git show 96a8afb, the commit looks “empty”: there’s no diff!git show 96a8afbcommit 96a8afbf776c2cebccf8ec0dba7c6c765ea5d987 (HEAD -> y)Merge: 0931e45 1d8bd2dAuthor: Julia Evans <julia@jvns.ca>Date: Fri Oct 20 14:07:00 2023 -0400Merge branch 'x' into yBut if I diff the merge commit against each of its two parent commits separately, you can see that of course there is a diff:$ git diff 0931e45 96a8afb --statx.txt | 1 +1 file changed, 1 insertion(+)$ git diff 1d8bd2d 96a8afb --staty.txt | 1 +1 file changed, 1 insertion(+)It seems kind of obvious in retrospect that merge commits aren’t actually “empty” (they’re snapshots of the current state of the repo, just like any other commit), but I’d never thought about why they appear to be empty.Apparently the reason that these merge diffs are empty is that merge diffs only show conflicts – if I instead create a repo with a merge conflict (one branch added x and another branch added y to the same file), and show the merge commit where I resolved the conflict, it looks like this:$ git show HEADcommit 3bfe8311afa4da867426c0bf6343420217486594Merge: 782b3d5 ac7046dAuthor: Julia Evans <julia@jvns.ca>Date: Fri Oct 20 15:29:06 2023 -0400Merge branch 'x' into ydiff --cc file.txtindex 975fbec,587be6b..b680253--- a/file.txt+++ b/file.txt@@@ -1,1 -1,1 +1,1 @@@- y-x++zIt looks like this is trying to tell me that one branch added x, another branch added y, and the merge commit resolved it by putting z instead. But in the earlier example, there was no conflict, so Git didn’t display a diff at all.(thanks to Jordi for telling me how merge diffs work)that’s all!I’ll keep this post short, maybe I’ll write another blog post with more git facts as I learn them.
# Document TitleCRDT Survey, Part 2: Semantic TechniquesMatthew Weidner | Oct 17th, 2023Home | RSS FeedKeywords: CRDTs, collaborative apps, semanticsThis blog post is Part 2 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further Topics# Semantic TechniquesIn Part 1, I defined a collaborative app’s semantics as an abstract definition of what the app’s state should be, given the operations that users have performed.Your choice of semantics should be informed by users’ intents and expectations: if one user does X while an offline user concurrently does Y, what do the users want to happen when they sync up? Even after you figure out specific scenarios, though, it is tricky to design a strategy that is well-defined in every situation (multi-way concurrency, extensive offline work, etc.).CRDT semantic techniques help you with this goal. Like the data structures and design patterns that you learn about when programming single-user apps, these techniques provide valuable guidance, but they are not a replacement for deciding what your app should actually do.The techniques come in various forms:Specific building blocks - e.g., list CRDT positions. (Single-user app analogy: specific data structures like a hash map.)General-purpose ideas that must be applied wisely - e.g., unique IDs. (Single-user analogy: object-oriented programming techniques.)Example semantics for specific parts of a collaborative app - e.g., a list with a move operation. (Single-user analogy: Learning from an existing app’s architecture.)Some of these techniques will be familiar if you’ve read Designing Data Structures for Collaborative Apps, but I promise there are new ones here as well.# Table of ContentsThis post is meant to be usable as a reference. However, some techniques build on prior techniques. I recommend reading linearly until you reach Composed Examples, then hopping around to whatever interests you.Describing SemanticsCausal OrderBasic TechniquesUnique IDs (UIDs) • Append-Only Log • Unique Set • Lists and Text Editing • Last Writer Wins (LWW) • LWW Map • Multi-Value Register • Multi-Value MapComposition TechniquesViews • Objects • Nested Objects • Map-Like Object • Unique Set of CRDTs • List of CRDTsComposed ExamplesAdd-Wins Set • List-with-Move • Internally-Mutable Register • CRDT-Valued Map • Archiving Collections • Update-Wins Collections • Spreadsheet GridAdvanced TechniquesFormatting Marks (Rich Text) • Spreadsheet Formatting • Global Modifiers • Forests and Trees • Undo/RedoOther TechniquesRemove-Wins Set • PN-Set • Observed-Reset Operations • Querying the Causal Order • Topological SortCapstonesRecipe Editor • Block-Based Rich Text# Describing SemanticsI’ll describe a CRDT’s semantics by specifying a pure function of the operation history: a function that inputs the history of operations that users have performed, and outputs the current app-visible state.A box with six "+1"s labeled "Operation history", an arrow labeled "Semantic function", and a large 6 labeled "App state".Note that I don’t expect you to implement a literal “operation history + pure function”; that would be inefficient. Instead, you are supposed to implement an algorithm that gives the same result. E.g., an op-based CRDT that satisfies: whenever a user has received the messages corresponding to operations S, the user’s state matches the pure function applied to S. I’ll give a few of these algorithms below, and more in Part 3.More precisely, I’ll describe a CRDT’s semantics as:A collection of operations that users are allowed to perform on the CRDT. Example: Call inc() to increment a counter.For each operation, a translated version that gets stored in the (abstract) operation history. Example: When a user deletes the ingredient at index 0 in an ingredients list, we might instead store the operation Delete the ingredient with unique ID <xyz>.A pure function that inputs a set of translated operations and some ordering metadata (next paragraph), and outputs the intended state of a user who is aware of those operations. Example: A counter’s semantic function inputs the set of inc() operations and outputs its size, ignoring ordering metadata.The “ordering metadata” is a collection of arrows indicating which operations were aware of each other. E.g., here is a diagram representing the operation history from Part 1:Operations A-G with arrows A to B, B to C, C to D, D to E, C to F, F to G. The labels are: "Add ingr 'Broc: 1 ct' w/ UID <xyz>"; "Add ingr 'Oil: 15 mL' w/ UID <abc>"; "Add ingr 'Salt: 2 mL' w/ UID <123>"; "Delete ingr <xyz>"; "Set amt <123> to 3 mL"; "Prepend 'Olive ' to ingr <abc>"; "Halve the recipe".One user performs a sequence of operations to create the initial recipe.After seeing those operations, two users concurrently do Delete ingredient <xyz> and Prepend "Olive " to ingredient <abc>.After seeing each other’s operations, the two users do two more concurrent operations.I’ll use diagrams like this throughout the post to represent operation histories. You can think of them like git commit graphs, except that each point is labeled with its operation instead of its state/hash, and parallel “heads” (the rightmost points) are implicitly merged.Example: A user who has received the above operation history already sees the result of both heads Set amt <123> to 3 mL and <Halve the recipe>, even though there is no “merge commit”. If that user performs another operation, it will get arrows from both heads, like an explicit merge commit:Previous figure with an additional operation H labeled "Delete ingr <abc>" and arrows E to H, G to H.Describing semantics in terms of a pure function of the operation history lets us sidestep the usual CRDT rules like “concurrent messages must commute” and “the merge function must be idempotent”. Indeed, the point of those rules is to guarantee that a given CRDT algorithm corresponds to some pure function of the operation history (cf. Part 1’s definition of a CRDT). We instead directly say what pure function we want, then define CRDTs to match (or trust you to do so).Strong convergence is the property that a CRDT’s state is a pure function of the operation history - i.e., users who have received the same set of ops are in equivalent states. Strong Eventual Consistency (SEC) additionally requires that two users who stop performing operations will eventually be in equivalent states; it follows from strong convergence in any network where users eventually exchange operations (Shapiro et al. 2011b).These properties are necessary for collaborative apps, but they are not sufficient: you still need to check that your CRDT’s specific semantics are reasonable for your app. It is easy to forget this if you get bogged down in e.g. a proof that concurrent messages commute.# Causal OrderFormally, arrows in our operation history diagrams indicate the “causal order” on operations. We will use the causal order to define the multi-value register and some later techniques, so if you want a formal definition, read this section first (else you can skip ahead).The causal order is the partial order < on pairs of operations defined by:If a user had received operation o before performing their own operation p, then o < p. This includes the case that they performed both o and p in that order.(Transitivity) If o < p and p < q, then o < q.Our operation histories indicate o < p by drawing an arrow from o to p. Except, we omit arrows that are implied by transitivity - equivalently, by following a sequence of other arrows.Operations A, B, C, D, with arrows from A to B, B to C, A to D, and D to C.Figure 1. One user performs operations A, B, C in sequence. After receiving A but not B, another user performs D; the first user receives that before performing D. The causal order is then A < B, A < C, A < D, B < C, D < C. In the figure, the arrow for A < C is implied.Some derived terms:When o < p, we say that o is causally prior to p / o is a causal predecessor of p, and p is causally greater than o.When we neither have o < p nor p < o, we say that o and p are concurrent.When o < p, you may also see the phrases “o happened-before p” or “p is causally aware of o”.o is an immediate causal predecessor of p if o < p and there is no r such that o < r < p. These are precisely the pairs (o, p) connected by an arrow in our operation histories: all non-immediate causal predecessors are implied by transitivity.In the above figure, B is causally greater than A, causally prior to C, and concurrent to D. The immediate causal predecessors of C are B and D; A is a causal predecessor, but not an immediate one.It is easy to track the causal order in a CRDT setting: label each operation by IDs for its immediate causal predecessors (the tails of its incoming arrows). Thus when choosing our “pure function of the operation history”, it is okay if that function queries the causal order. We will see an example of this in the multi-value register.Often, CRDT-based apps choose to enforce causal-order delivery: a user’s app will not process an operation (updating the app’s state) until after processing all causally-prior operations. (An op may be processed in any order relative to concurrent operations.) In other words, operations are processed in causal order. This simplifies programming and makes sense to users, by providing a guarantee called causal consistency. For example, it ensures that if one user adds an ingredient to a recipe and then writes instructions for it, all users will see the ingredient before the instructions. However, there are times when you might choose to forgo causal-order delivery - e.g., when there are undone operations. (More examples in Part 4).In Collabs: CRuntime (causal-order delivery), vectorClock (causal order access)References: Lamport 1978# Basic TechniquesWe begin with basic semantic techniques. Most of these were not invented as CRDTs; instead, they are database techniques or programming folklore. It is often easy to implement them yourself or use them outside of a traditional CRDT framework.# Unique IDs (UIDs)To refer to a piece of content, assign it an immutable Unique ID (UID). Use that UID in operations involving the content, instead of using a mutable descriptor like its index in a list.Example: In a recipe editor, assign each ingredient a UID. When a user edits an ingredient’s amount, indicate which ingredient using its UID. This solves Part 1’s example.By “piece of content”, I mean anything that the user views as a distinct “thing”, with its own long-lived identity: an ingredient, a spreadsheet cell, a document, etc. Note that the content may be internally mutable. Other analogies:Anything that would be its own object in a single-user app. Its UID is the distributed version of a “pointer” to that object.Anything that would get its own row in a normalized database table. Its UID functions as the primary key.To ensure that all users agree on a piece of content’s UID, the content’s creator should assign the UID at creation time and broadcast it. E.g., include a new ingredient’s UID in the corresponding “Add Ingredient” operation. The assigned UID must be unique even if multiple users create UIDs concurrently; you can ensure that by using UUIDs, or Part 3’s dot IDs.UIDs are useful even in non-collaborative contexts. For example, a single-user spreadsheet formula that references cell B3 should store the UIDs of its column (B) and row (3) instead of the literal string “B3”. That way, the formula still references “the same cell” even if a new row shifts the cell to B4.# Append-Only LogUse an append-only log to record events indefinitely. This is a CRDT with a single operation add(x), where x is an immutable value to store alongside the event. Internally, add(x) gets translated to an operation add(id, x), where id is a new UID; this lets you distinguish events with the same values. Given an operation history made of these add(id, x) events, the current state is just the set of all pairs (id, x).Example: In a delivery tracking system, each package’s history is an append-only log of events. Each event’s value describes what happened (scanned, delivered, etc.) and the wall-clock time. The app displays the events directly to the user in wall-clock time order. Conflicting concurrent operations indicate a real-world conflict and must be resolved manually.I usually think of an append-only log as unordered, like a set (despite the word “append”). If you do want to display events in a consistent order, you can include a timestamp in the value and sort by that, or use a list CRDT (below) instead of an append-only log. Consider using a logical timestamp like in LWW, so that the order is compatible with the causal order: o < p implies o appears before p.Refs: Log in Shapiro et al. 2011b# Unique SetA unique set is like an append-only log, but it also allows deletes. It is the basis for any collection that grows and shrinks dynamically: sets, lists, certain maps, etc.Its operations are:add(x): Adds an operation add(id, x) to the history, where id is a new UID. (This is the same as the append-only log’s add(x), except that we call the entry an element instead of an event.)delete(id), where id is the UID of the element to be deleted.Given an operation history, the unique set’s state is the set of pairs (id, x) such that there is an add(id, x) operation but no delete(id) operations.Operations A-F with arrows A to B, A to D, B to C, B to E, D to E, E to F. The labels are: add(ac63, "doc/Hund"); add(x72z, "cat/Katze"); delete(ac63); delete(ac63); add(8f8x, "chicken/Huhn"); delete(x72z).Figure 2. In a collaborative flash card app, you could represent the deck of cards as a unique set, using x to hold the flash card's value (its front and back strings). Users can edit the deck by adding a new card or deleting an existing one, and duplicate cards are allowed. Given the above operation history, the current state is { (8f8x, "chicken/Huhn") }.You can think of the unique set as an obvious way of working with UID-labeled content. It is analogous to a database table with operations to insert and delete rows, using the UID (= primary key) to identify rows. Or, thinking of UIDs like distributed pointers, add and delete are the distributed versions of new and free.It’s easy to convert the unique set’s semantics to an op-based CRDT.Per-user state: The literal state, which is a set of pairs (id, x).Operation add(x): Generate a new UID id, then broadcast add(id, x). Upon receiving this message, each user (including the initiator) adds the pair (id, x) to their local state.Operation delete(id): Broadcast delete(id). Upon receiving this message, each user deletes the pair with the given id, if it is still present. Note: this assumes causal-order delivery - otherwise, you might receive delete(id) before add(id, x), then forget that the element is deleted.A state-based CRDT is more difficult; Part 3 will give a nontrivial optimized algorithm.Refs: U-Set in Shapiro et al. 2011a# Lists and Text EditingIn collaborative text editing, users can insert (type) and delete characters in an ordered list. Inserting or deleting a character shifts later characters’ indices, in the style of JavaScript’s Array.splice.The CRDT way to handle this is: assign each character a unique immutable list CRDT position when it’s typed. These positions are a special kind of UID that are ordered: given two positions p and q, you can ask whether p < q or q < p. Then the text’s state is given by:Sort all list elements (position, char) by position.Display the characters in that order.Classic list CRDTs have operations insert and delete, which are like the unique set’s add and delete operations, except using positions instead of generic UIDs. A text CRDT is the same but with individual text characters for values. See a previous blog post for details.But the real semantic technique is the positions themselves. Abstractly, they are “opaque things that are immutable and ordered”. To match users’ expectations, list CRDT positions must satisfy a few rules (Attiya et al. 2016):The order is total: if p and q are distinct positions, then either p < q or q < p, even if p and q were created by different users concurrently.If p < q on one user’s device at one time, then p < q on all users’ devices at all times. Example: characters in a collaborative text document do not reverse order, no matter what happens to characters around them.If p < q and q < r, then p < r. This holds even if q is not currently part of the app’s state.This definition still gives us some freedom in choosing <. The Fugue paper (myself and Martin Kleppmann, 2023) gives a particular choice of < and motivates why we think you should prefer it over any other. Seph Gentle’s Diamond Types and the Braid group’s Sync9 each independently chose nearly identical semantics (thanks to Michael Toomim and Greg Little for bringing the latter to our attention).List CRDT positions are our first “real” CRDT technique - they don’t come from databases or programming folklore, and it is not obvious how to implement them. Their algorithms have a reputation for difficulty, but you usually only need to understand the “unique immutable position” abstraction, which is simple. You can even use list CRDT positions outside of a traditional CRDT framework, e.g., using my list-positions library.Collabs: CValueList, Position.Refs: Many - see “Background and Related Work” in the Fugue paper# Last Writer Wins (LWW)If multiple users set a value concurrently, and there is no better way to resolve this conflict, just pick the “last” value as the winner. This is the Last Writer Wins (LWW) rule.Example: Two users concurrently change the color of a pixel in a shared whiteboard. Use LWW to pick the final color.Traditionally, “last” meant “the last value to reach the central database”. In a CRDT setting, instead, when a user performs an operation, their own device assigns a timestamp for that operation. The operation with the greatest assigned timestamp wins: its value is the one displayed to the user.Formally, an LWW register is a CRDT representing a single variable, with sole operation set(value, timestamp). Given an operation history made of these set operations, the current state is the value with the largest timestamp.Operations A-E with arrows A to B, A to D, B to C, D to C, and D to E. The labels are: none; set("blue", (3, alice)); set("blue", (6, alice)); set("red", (5, bob)); set("green", (7, bob)).Figure 3. Possible operation history for an LWW register using logical timestamps (the pairs (3, "alice")). The greatest assigned timestamp is (7, bob), so the current state is "green".The timestamp should usually be a logical timestamp instead of literal wall-clock time (e.g., a Lamport timestamp. Otherwise, clock skew can cause a confusing situation: you try to overwrite the current local value with a new one, but your clock is behind, so the current value remains the winner. Lamport timestamps also build in a tiebreaker so that the winner is never ambiguous.Let’s make these semantics concrete by converting them to a hybrid op-based/state-based CRDT. Specifically, we’ll do an LWW register with value type T.Per-user state: state = { value: T, time: LogicalTimestamp }.Operation set(newValue): Broadcast an op-based CRDT message { newValue, newTime }, where newTime is the current logical time. Upon receiving this message, each user (including the initiator) does:If newTime > state.time, set state = { value: newValue, time: newTime }.State-based merge: To merge in another user’s state other = { value, time }, treat it like an op-based message: if other.time > state.time, set state = other.You can check that state.value always comes from the received operation with the greatest assigned timestamp, matching our semantics above.When using LWW, pay attention to the granularity of writes. For example, in a slide editor, suppose one user moves an image while another user resizes it concurrently. If you implement both actions as writes to a single LWWRegister<{ x, y, width, height }>, then one action will overwrite the other - probably not what you want. Instead, use two different LWW registers, one for { x, y } and one for { width, height }, so that both actions can take effect.Collabs: lamportTimestampRefs: Johnson and Thomas 1976; Shapiro et al. 2011a# LWW MapAn LWW map applies the last-writer-wins rule to each value in a map. Formally, its operations are set(key, value, timestamp) and delete(key, timestamp). The current state is given by:For each key, find the operation on key with the largest timestamp.If the operation is set(key, value, timestamp), then the value at key is value.Otherwise (i.e., the operation is delete(key, timestamp) or there are no operations on key), key is not present in the map.Observe that a delete operation behaves just like a set operation with a special value. In particular, when implementing the LWW map, it is not safe to forget about deleted keys: you have to remember their latest timestamps as usual, for future LWW comparisons. Otherwise, your semantics might be ill-defined (not a pure function of the operation history), as pointed out by Kleppmann (2022).In the next section, we’ll see an alternative semantics that does let you forget about deleted keys: the multi-value map.# Multi-Value RegisterThis is another “real” CRDT technique, and our first technique that explicitly references the arrows in an operation history (formally, the causal order).When multiple users set a value concurrently, sometimes you want to preserve all of the conflicting values, instead of just applying LWW.Example: One user enters a complex, powerful formula in a spreadsheet cell. Concurrently, another user figures out the intended value by hand and enters that. The first user will be annoyed if the second user’s write erases their hard work.The multi-value register does this, by following the rule: its current value is the set of all values that have not yet been overwritten. Specifically, it has a single operation set(x). Its current state is the set of all values whose operations are at the heads of the operation history (formally, the maximal operations in the causal order). For example, here the current state is { "gray", "blue" }:Operations A-F with arrows A to B, B to C, B to F, C to D, E to B. The labels are: set("green"); set("red"); set("green"); set("gray"); set("purple"); set("blue").Multi-values (also called conflicts) are hard to display, so you should have a single value that you show by default. This displayed value can be chosen arbitrarily (e.g. LWW), or by some semantic rule. For example:In a bug tracking system, if a bug has multiple conflicting priorities, display the highest one (Zawirski et al. 2016).For a boolean value, if any of the multi-values are true, display true. This yields the enable-wins flag CRDT. Alternatively, if any of the multi-values are false, display false (a disable-wins flag).Other multi-values can be shown on demand, like in Pixelpusher, or just hidden.As with LWW, pay attention to the granularity of writes.The multi-value register sounds hard to implement because it references the causal order. But actually, if your app enforces causal-order delivery, then you can easily implement a multi-value register on top of a unique set.Per-user state: A unique set uSet of pairs (id, x). The multi-values are all of the x’s.Operation set(x): Locally, loop over uSet calling uSet.delete(id) on every existing element. Then call uSet.add(x).Convince yourself that this gives the same semantics as above.Collabs: CVarRefs: Shapiro et al. 2011a; Zawirski et al. 2016# Multi-Value MapLike the LWW map, a multi-value map applies the multi-value register semantics to each value in a map. Formally, its operations are set(key, value) and delete(key). The current state is given by:For each key, consider the operation history restricted to set(key, value) and delete(key) operations.Among those operations, restrict to the heads of the operation history. Equivalently, these are the operations that have not been overwritten by another key operation. Formally, they are the maximal operations in the causal order.The value at key is the set of all values appearing among the heads’ set operations. If this set is empty, then key is not present in the map.Operations A-G with arrows A to B, C to D, C to F, E to D, E to F, F to G. The labels are: set("display", "block"); delete("display"); set("margin", "0"); set("margin", "10px"); set("margin", "20px"); set("height", "auto"); delete("margin").Figure 4. Multi-value map operations on a CSS class. Obviously key "height" maps to the single value "auto", while key "display" is not present in the map. For key "margin", observe that when restricting to its operations, only set("margin", "10px") and delete("margin") are heads of the operation history (i.e., not overwritten); thus "margin" maps to the single value "10px".As with the multi-value register, each present key can have a displayed value that you show by default. For example, you could apply LWW to the multi-values. That gives a semantics similar to the LWW map, but when you implement it as an op-based CRDT, you can forget about deleted values. (Hint: Implement the multi-value map on top of a unique set like above.)Collabs: CValueMapRefs: Kleppmann 2022# Composition TechniquesWe next move on to composition techniques. These create new CRDTs from existing ones.Composition has several benefits over making a CRDT from scratch:Semantically, you are guaranteed that the composed output is actually a CRDT: its state is always a pure function of the operation history (i.e., users who have received the same set of ops are in equivalent states).Algorithmically, you get op-based and state-based CRDT algorithms “for free” from the components. Those components are probably already optimized and tested.It is much easier to add a new system feature (e.g., undo/redo) to a few basic CRDTs and composition techniques, than to add it to your app’s top-level state directly.In particular, it is safe to use a composed algorithm that appears to work well in the situations you care about (e.g., all pairs of concurrent operations), even if you are not sure what it will do in arbitrarily complex scenarios. You are guaranteed that it will at least satisfy strong convergence and have equivalent op-based vs state-based behaviors.Like most of our basic techniques, these composition techniques are not really CRDT-specific, and you can easily use them outside of a traditional CRDT framework. Figma’s collaboration system is a good example of this.# ViewsNot all app states have a good CRDT representation. But often you can store some underlying state as a CRDT, then compute your app’s state as a view (pure function) of that CRDT state.Example: Suppose a collaborative text editor represents its state as a linked list of characters. Storing the linked list directly as a CRDT would cause trouble: concurrent operations can easily cause broken links, partitions, and cycles. Instead, store a traditional list CRDT, then construct the linked list representation as a view of that at runtime.At runtime, one way to obtain the view is to apply a pure function to your CRDT state each time that CRDT state changes, or each time the view is requested. This should sound familiar to web developers (React, Elm, …).Another way is to “maintain” the view, updating it incrementally each time the CRDT state changes. CRDT libraries usually emit “events” that make this possible. View maintenance is a known hard problem, but it is not hard in a CRDT-specific way. Also, it is easy to unit test: you can always compare to the pure-function approach.Collabs: Events# ObjectsIt is natural to wrap a CRDT in an app-specific API: when the user performs an operation in the app, call a corresponding CRDT operation; in the GUI, render an app-specific view of the CRDT’s state.More generally, you can create a new CRDT by wrapping multiple CRDTs in a single API. I call this object composition. The individual CRDTs (the components) are just used side-by-side; they don’t affect each others’ operations or states.Example: An ingredient like we saw in Part 1 (reproduced below) can be modeled as the object composition of three CRDTs: a text CRDT for the text, an LWW register for the amount, and another LWW register for the units.An ingredient with contents Olive Oil, 15, mL.To distinguish the component CRDTs’ operations, assign each component a distinct name. Then tag each component’s operations with its name:Operations on an ingredient, labeled by component name. "text: insert(...)", "amount: set(15, (5, alice))", "units: set('mL', (6, alice))", "units: set('g', (3, bob))".One way to think of the composed CRDT is as a literal CRDT object - a class whose instance fields are the component CRDTs:class IngredientCRDT extends CRDTObject {text: TextCRDT;amount: LWWRegister<number>;units: LWWRegister<Unit>;setAmount(newAmount: number) {this.amount.set(newAmount);}...}Another way to think of the composed state is as a JSON object mapping names to component states:{text: {<text CRDT state...>},amount: { value: number, time: LogicalTimestamp },units: { value: Unit, time: LogicalTimestamp }}Collabs: CObjectRefs: See Map-Like Object refs below# Nested ObjectsYou can nest objects arbitrarily. This leads to layered object-oriented architectures:class SlideImageCRDT extends CRDTObject {dimensions: DimensionCRDT;contents: ImageContentsCRDT;}class DimensionCRDT extends CRDTObject {position: LWWRegister<{ x: number, y: number }>;size: LWWRegister<{ width: number, height: number }>;}class ImageContentsCRDT ...or to JSON-like trees:{dimensions: {height: { value: number, time: LogicalTimestamp },width: { value: number, time: LogicalTimestamp }},contents: {...}}Either way, tag each operation with the tree-path leading to its leaf CRDT. For example, to set the width to 75 pixels: { path: "dimensions/width", op: "set('75px', (11, alice))" }.# Map-Like ObjectInstead of a fixed number of component CRDTs with fixed names, you can allow names drawn from some large set (possibly infinite). This gives you a form of CRDT-valued map, which I will call a map-like object. Each map key functions as the name for its own value CRDT.Example: A geography app lets users add a description to any address on earth. You can model this as a map from address to text CRDT. The map behaves the same as an object that has a text CRDT instance field per address.The difference from a CRDT object is that in a map-like object, you don’t store every value CRDT explicitly. Instead, each value CRDT exists implicitly, in some default state, until used. In the JSON representation, this leads to behavior like Firebase RTDB, where{ foo: {/* Empty text CRDT */}, bar: {<text CRDT state...>} }is indistinguishable from{ bar: {<text CRDT state...>} }Note that unlike an ordinary map, a map-like object does not have operations to set/delete a key; each key implicitly always exists, with a pre-set value CRDT. We’ll see a more traditional map with set/delete operations later.The map-like object and similar CRDTs are often referred to as “map CRDTs” or “CRDT-valued maps” (I’ve done so myself). To avoid confusion, in this blog series, I will reserve those terms for maps with set/delete operations.Collabs: CLazyMapRefs: Riak Map; Conway et al. 2012; Kuper and Newton 2013# Unique Set of CRDTsAnother composition technique uses UIDs as the names of value CRDTs. This gives the unique set of CRDTs.Its operations are:add(initialState): Adds an operation add(id, initialState) to the history, where id is a new UID. This creates a new value CRDT with the given initial state.Value CRDT operations: Any user can perform operations on any value CRDT, tagged with the value’s UID. The UID has the same function as object composition’s component names.delete(id), where id is the UID of the value CRDT to be deleted.Given an operation history, the unique set of CRDT’s current state consists of all added value CRDTs, minus the deleted ones, in their own current states (according to the value CRDT operations). Formally:for each add(id, initialState) operation:if there are no delete(id) operations:valueOps = all value CRDT operations tagged with idcurrentState = result of value CRDT's semantics applied to valueOps and initialStateAdd (id, currentState) to the set's current stateExample: In a collaborative flash card app, you could represent the deck of cards as a unique set of “flash card CRDTs”. Each flash card CRDT is an object containing text CRDTs for the front and back text. Users can edit the deck by adding a new card (with initial text), deleting an existing card, or editing a card’s front/back text. This extends our earlier flash card example.Observe that once a value CRDT is deleted, it is deleted permanently. Even if another user operates on the value CRDT concurrently, it remains deleted. That allows an implementation to reclaim memory after receiving a delete op - it only needs to store the states of currently-present values. But it is not always the best semantics, so we’ll discuss alternatives below.Like the unique set of (immutable) values, you can think of the unique set of CRDTs as an obvious way of working with UIDs in a JSON tree. Indeed, Firebase RTDB’s push method works just like add.// JSON representation of the flash card example:{"uid838x": {front: {<text CRDT state...>},back: {<text CRDT state...>}},"uid7b9J": {front: {<text CRDT state...>},back: {<text CRDT state...>}},...}The unique set of CRDTs also matches the semantics you would get from normalized database tables: UIDs in one table; value CRDT operations in another table with the UID as a foreign key. A delete op corresponds to a foreign key cascade-delete.Firebase RTDB differs from the unique set of CRDTs in that its delete operations are not permanent - concurrent operations on a deleted value are not ignored, although the rest of the value remains deleted (leaving an awkward partial object). You can work around this behavior by tracking the set of not-yet-deleted UIDs separately from the actual values. When displaying the state, loop over the not-yet-deleted UIDs and display the corresponding values (only). Firebase already recommends this for performance reasons.Collabs: CSetRefs: Yjs’s Y.Array# List of CRDTsBy modifying the unique set of CRDTs to use list CRDT positions instead of UIDs, we get a list of CRDTs. Its value CRDTs are ordered.Example: You can model the list of ingredients from Part 1 as a list of CRDTs, where each value CRDT is an ingredient object from above. Note that operations on a specific ingredient are tagged with its position (a kind of UID) instead of its index, as we anticipated in Part 1.Collabs: CListRefs: Yjs’s Y.Array# Composed ExamplesWe now turn to semantic techniques that can be described compositionally.In principle, if your app needed one of these behaviors, you could figure it out yourself: think about the behavior you want, then make it using the above techniques. In practice, it’s good to see examples.# Add-Wins SetThe add-wins set represents a set of (non-unique) values. Its operations are add(x) and remove(x), where x is an immutable value of type T. Informally, its semantics are:Sequential add(x) and remove(x) operations behave in the usual way for a set (e.g. Java’s HashSet).If there are concurrent operations add(x) and remove(x), then the add “wins”: x is in the set.Example: A drawing app includes a palette of custom colors, which users can add or remove. You can model this as an add-wins set of colors.The informal semantics do not actually cover all cases. Here is a formal description using composition:For each possible value x, store a multi-value register indicating whether x is currently in the set. Do so using a map-like object whose keys are the values x. In pseudocode: MapLikeObject<T, MultiValueRegister<boolean>>.add(x) translates to the operation “set x’s multi-value register to true”.remove(x) translates to the operation “set x’s multi-value register to false”.If any of x’s multi-values are true, then x is in the set (enable-wins flag semantics). This is how we get the “add-wins” rule.Operations A-F with arrows A to B, A to D, B to C, B to E, D to E, E to F. The labels are: "add('red') -> red: set(true)"; "add('blue') -> blue: set(true)"; "remove('blue') -> blue: set(false)"; "add('blue') -> blue: set(true)"; "remove('red') -> red: set(false)"; "add('gray') -> gray: set(true)".Figure 5. Operation history for a color palette's add-wins set of colors, showing (original op) -> (translated op). The current state is { "blue", "gray" }: the bottom add("blue") op wins over the concurrent remove("blue") op.There is a second way to describe the add-wins set’s semantics using composition, though you must assume causal-order delivery:The state is a unique set of entries (id, x). The current state is the set of values x appearing in at least one entry.add(x) translates to a unique-set add(x) operation.remove(x) translates to: locally, loop over the entries (id, x); for each one, issue delete(id) on the unique set.The name observed-remove set - a synonym for add-wins set - reflects how this remove(x) operation works: it deletes all entries (id, x) that the local user has “observed”.Collabs: CValueSetRefs: Shapiro et al. 2011a; Leijnse, Almeida, and Baquero 2019# List-with-MoveThe lists above fix each element’s position when it is inserted. This is fine for text editing, but for other collaborative lists, you often want to move elements around. Moving an element shouldn’t interfere with concurrent operations on that element.Example: In a collaborative recipe editor, users should be able to rearrange the order of ingredients using drag-and-drop. If one user edits an ingredient’s text while someone else moves it concurrently, those edits should show up on the moved ingredient, like the typo fix “Bredd” -> “Bread” here:An ingredients list starts with "Bredd" and "Peanut butter". One user swaps the order of ingredients. Concurrently, another user corrects the typo "Bredd" to "Bread". In the final state, the ingredients list is "Peanut butter", "Bread".In the example, intuitively, each ingredient has its own identity. That identity is independent of the ingredient’s current position; instead, position is a mutable property of the ingredient.Here is a general way to achieve those semantics, the list-with-move:Assign each list element a UID, independently of its position. (E.g., store the elements in a unique set of CRDTs, not a list of CRDTs.)To each element, add a position property, containing its current position in the list.Move an element by setting its position to a new list CRDT position at the intended place. In case of concurrent move operations, apply LWW to their positions.Sample pseudocode:class IngredientCRDT extends CRDTObject {position: LWWRegister<Position>; // List CRDT positiontext: TextCRDT;...}class IngredientListCRDT {ingredients: UniqueSetOfCRDTs<IngredientCRDT>;move(ingr: IngredientCRDT, newIndex: number) {const newPos = /* new list CRDT position at newIndex */;ingr.position.set(newPos);}}Collabs: CList.moveRefs: Kleppmann 2020# Internally-Mutable RegisterThe registers above (LWW register, multi-value register) each represent a single immutable value. But sometimes, you want a value that is internally mutable, but can still be blind-set like a register - overriding concurrent mutations.Example: A bulletin board has an “Employee of the Month” section that shows the employee’s name, photo, and a text box. Coworkers can edit the text box to give congratulations; it uses a text CRDT to allow simultaneous edits. Managers can change the current employee of the month, overwriting all three fields. If a manager changes the current employee while a coworker concurrently congratulates the previous employee, the latter’s edits should be ignored.An internally-mutable register supports both set operations and internal mutations. Its state consists of:uSet, a unique set of CRDTs, used to create value CRDTs.reg, a separate LWW or multi-value register whose value is a UID from uSet.The register’s visible state is the value CRDT indicated by reg. You internally mutate the value by performing operations on that value CRDT. To blind-set the value to initialState (overriding concurrent mutations), create a new value CRDT using uSet.add(initialState), then set reg to its UID.Creating a new value CRDT is how we ensure that concurrent mutations are ignored: they apply to the old value CRDT, which is no longer shown. The old value CRDT can even be deleted from uSet to save memory.The CRDT-valued map (next) is the same idea applied to each value in a map.Collabs: CVar with CollabID valuesRefs: true blind updates in Braun, Bieniusa, and Elberzhager (2021)# CRDT-Valued MapThe map-like object above does not have operations to set/delete a key - it is more like an object than a hash map.Here is a CRDT-valued map that behaves more like a hash map. Its state consists of:uSet, a unique set of CRDTs, used to create the value CRDTs.lwwMap, a separate last-writer-wins map whose values are UIDs from uSet.The map’s visible state is: key maps to the value CRDT with UID lwwMap[key]. (If key is not present in lwwMap, then it is also not present in the CRDT-valued map.)Operations:set(key, initialState) translates to { uSet.add(initialState); lwwMap.set(key, (UID from previous op)); }. That is, the local user creates a new value CRDT, then sets key to its UID.delete(key) translates to lwwMap.delete(key). Typically, implementations also call uSet.delete on all existing value CRDTs for key, since they are no longer reachable.Note that if two users concurrently set a key, then one of their set ops will “win”, and the map will only show that user’s value CRDT. (The other value CRDT still exists in uSet.) This can be confusing if the two users meant to perform operations on “the same” value CRDT, merging their edits.Example: A geography app lets users add a photo and description to any address on earth. Suppose you model the app as a CRDT-valued map from each address to a CRDT object { photo: LWWRegister<Image>, desc: TextCRDT }. If one user adds a photo to an unused address (necessarily calling map.set first), while another user adds a description concurrently (also calling map.set), then one CRDT object will overwrite the other:The address 5000 Forbes Ave starts with a blank description and photo. One user adds the description "Looks like a school?". Concurrently, another user adds a photo of a building. In the final state, the description is "Looks like a school?" but the photo is blank again.To avoid this, consider using a map-like object, like the previous geography app example.More composed constructions that are similar to the CRDT-valued map:Same, except you don’t delete value CRDTs from uSet. Instead, they are kept around in an archive. You can “restore” a value CRDT by calling set(key, id) again later, possibly under a different key.A unique set of CRDTs where each value CRDT has a mutable key property, controlled by LWW. That way, you can change a value CRDT’s key - e.g., renaming a document. Note that your display must handle the case where multiple value CRDTs have the same key.Collabs: CMap, CollabIDRefs: Yjs’s Y.Map; Automerge’s Map# Archiving CollectionsThe CRDT-valued collections above (unique set, list, map) all have a delete operation that permanently deletes a value CRDT. It is good to have this option for performance reasons, but you often instead want an archive operation, which merely hides an element until it’s restored. (You can recover most of delete’s performance benefits by swapping archived values to disk/cloud.)Example: In a notes app with cross-device sync, the user should be able to view and restore deleted notes. That way, they cannot lose a note by accident.To implement archive/restore, add an isPresent field to each value. Values start with isPresent = true. The operation archive(id) sets it to false, and restore(id) sets it to true. In case of concurrent archive/restore operations, you can apply LWW, or use a multi-value register’s displayed value.Alternate implementation: Use a separate add-wins set to indicate which values are currently present.Collabs: CList.archive/restore# Update-Wins CollectionsIf one user archives a value that another user is still using, you might choose to “auto-restore” that value.Example: In a spreadsheet, one user deletes a column while another user edits some of that column’s cells concurrently. The second user probably wants to keep that column, and it’s easier if the column restores automatically (Yanakieva, Bird, and Bieniusa 2023).To implement this on top of an archiving collection, merely call restore(id) each time the local user edits id’s CRDT. So each local operation translates to two ops in the history: the original (value CRDT) operation and restore(id).To make sure that these “keep” operations win over concurrent archive operations, use an enable-wins flag to control the isPresent field. (I.e., a multi-value register whose displayed value is true if any of the multi-values are true.) Or, use the last section’s alternate implementation: an add-wins set of present values.Collabs: supported by CList.restoreRefs: Yanakieva, Bird, and Bieniusa 2023# Spreadsheet GridIn a spreadsheet, users can insert, delete, and potentially move rows and columns. That is, the collection of rows behaves like a list, as does the collection of columns. Thus the cell grid is the 2D analog of a list.It’s tempting to model the grid as a list-of-lists. However, that has the wrong semantics in some scenarios. In particular, if one user creates a new row, while another user creates a new column concurrently, then there won’t be a cell at the intersection.Instead, you should think of the state as:A list of rows.A list of columns.For each pair (row, column), a single cell, uniquely identified by the pair (row id, column id). row id and column id are unique IDs.The cells are not explicitly created; instead, the state implicitly contains such a cell as soon as its row and column exist. Of course, until a cell is actually used, it remains in a default state (blank) and doesn’t need to be stored in memory. Once a user’s app learns that the row or column was deleted, it can forget the cell’s state, without an explicit “delete cell” operation - like a foreign key cascade-delete.In terms of the composition techniques above (objects, list-with-move, map-like object):class CellCRDT extends CRDTObject {formula: LWWRegister<string>;...}rows: ListWithMoveCRDT<Row>;columns: ListWithMoveCRDT<Column>;cells: MapLikeObject<(rowID: UID, columnID: UID), CellCRDT>;// Note: if you use this compositional construction in an implementation,// you must do extra work to forget deleted cells' states.# Advanced TechniquesThe techniques in this section are more advanced. This really means that they come from more recent papers and I am less comfortable with them; they also have fewer existing implementations, if any.Except for undo/redo, you can think of these as additional composed examples. However, they have more complex views than the previous composed examples.# Formatting Marks (Rich Text)Rich text consists of plain text plus inline formatting: bold, font size, hyperlinks, etc. (This section does not consider block formatting like blockquotes or bulleted lists, discussed later.)Inline formatting traditionally applies not to individual characters, but to spans of text: all characters from index i to j. E.g., atJSON (not a CRDT) uses the following to represent a bold span that affects characters 5 to 11:{type: "-offset-bold",start: 5,end: 11,attributes: {}}Future characters inserted in the middle of the span should get the same format. Likewise for characters inserted at the end of the span, for certain formats. You can override (part of) the span by applying a new formatting span on top.The inline formatting CRDT lets you use formatting spans in a CRDT setting. (I’m using this name for a specific part of the Peritext CRDT (Litt et al. 2021).) It consists of:An append-only log of CRDT-ified formatting spans, called marks.A view of the mark log that tells you the current formatting at each character.Each mark has the following form:{key: string;value: any;timestamp: LogicalTimestamp;start: { pos: Position, type: "before" | "after" }; // anchorend: { pos: Position, type: "before" | "after" }; // anchor}Here timestamp is a logical timestamp for LWW, while each Position is a list CRDT position. This mark sets key to value (e.g. "bold": true) for all characters between start and end. The endpoints are anchors that exist just before or just after their pos:"Some text" with before and after anchors on each character. The middle text "me te" is bold due to a mark labeled 'Bold mark from { pos: (m's pos), type: "before" } to { pos: (x's pos), type: "before" }'.LWW takes effect when multiple marks affect the same character and have the same key: the one with the largest timestamp wins. In particular, new marks override (causally) older ones. Note that a new mark might override only part of an older mark’s range.Formally, the view of the mark log is given by: for each character c, for each format key key, find the mark with the largest timestamp satisfyingmark.key = key, andthe interval (mark.start, mark.end) contains c’s position.Then c’s format value at key is mark.value.Remarks:To unformat, apply a formatting mark with a null value, e.g., { key: "bold", value: null, ... }. This competes against other “bold” marks in LWW.A formatting mark affects not just (causally) future characters in its range, but also characters inserted concurrently: Text starts as "The cat jumped on table.", unbold. One user highlights the entire range and bolds it. Concurrently, another user inserts " the" after "on". The final state is "The cat jumped on the table.", all bold.Anchors let you choose whether a mark “expands” to affect future and concurrent characters at the beginning or end of its range. For example, the bold mark pictured above expands at the end: a character typed between e and x will still be within the mark’s range because the mark’s end is attached to x.The view of the mark log is difficult to compute and store efficiently. Part 3 will describe an optimized view that can be maintained incrementally and doesn’t store metadata on every character.Sometimes a new character should be bold (etc.) according to local rules, but existing formatting marks don’t make it bold. E.g., a character inserted at the beginning of a paragraph in MSWord inherits the following character’s formatting, but the inline formatting CRDT doesn’t do that automatically.To handle this, when a user types a new character, compute its formatting according to local rules. (Most rich-text editor libraries already do so.) If the inline formatting CRDT currently assigns different formatting to that character, fix it by adding new marks to the log.Fancy extension to (5): Usually the local rules are “extending” a formatting mark in some direction - e.g., backwards from the paragraph’s previous starting character. You can figure out which mark is being extended, then reuse its timestamp instead of making a new one. That way, LWW behaves identically for your new mark vs the one it’s extending.Collabs: CRichTextRefs: Litt et al. 2021 (Peritext)# Spreadsheet FormattingYou can also apply inline formatting to non-text lists. For example, Google Sheets lets you bold a range of rows, with similar behavior to a range of bold text: new rows at the middle or end of the range are also bold. A cell in a bold row renders as bold, unless you override the formatting for that cell.In more detail, here’s an idea for spreadsheet formatting:Use two inline formatting CRDTs, one for the rows and one for the columns. Also, for each cell, store an LWW map of format key-value pairs; mutate the map when the user formats that individual cell. To compute the current bold format for a cell, consider:The current (largest timestamp) bold mark for the cell’s row.The current bold mark for its column.The value at key “bold” in the cell’s own LWW map.Then render the mark/value with the largest timestamp out of these three.This idea lets you format rows, columns, and cells separately. Sequential formatting ops interact in the expected way: for example, if a user bolds a row and then unbolds a column, the intersecting cell is not bold, since the column op has a larger timestamp.# Global ModifiersOften you want an operation to do something “for each” element of a collection, including elements added concurrently.Example: An inline formatting mark affects each character in its range, including characters inserted concurrently (see above).Example: Suppose a recipe editor has a “Halve the recipe” button, which halves every ingredient’s amount. This should have the semantics: for each ingredient amount, including amounts set concurrently, halve it. If you don’t halve concurrent set ops, the recipe can get out of proportion:An ingredients list starts with 100 g Flour and 80 g Milk. One user edits the amount of Milk to 90 g. Concurrently, another user halves the recipe (50 g Flour, 40 g Milk). The final state is: 50 g Flour, 90 g Milk.I’ve talked about these for-each operations before and co-authored a paper formalizing them (Weidner et al. 2023). However, the descriptions so far query the causal order (below), making them difficult to implement.Instead, I currently recommend implementing these examples using global modifiers. By “global modifier”, I mean a piece of state that affects all elements of a collection/range: causally prior, concurrent, and (causally) future.The inline formatting marks above have this form: a mark affects each character in its range, regardless of when it was inserted. If a user decides that a future character should not be affected, that user can override the formatting mark with a new one.To implement the “Halve the recipe” example:Store a global scale alongside the recipe. This is a number controlled by LWW, which you can think of as the number of servings.Store each ingredient’s amount as a scale-independent number. You can think of this as the amount per serving.The app displays the product ingrAmount.value * globalScale.value for each ingredient’s amount.To halve the recipe, merely set the global scale to half of its current value: globalScale.set(0.5 * globalScale.value). This halves all displayed amounts, including amounts set concurrently and concurrently-added ingredients.When the user sets an amount, locally compute the corresponding scale-independent amount, then set that. E.g. if they change flour from 50 g to 55 g but the global scale is 0.5, instead call ingrAmount.set(110).In the recipe editor, you could even make the global scale non-collaborative: each user chooses how many servings to display on their own device. But all collaborative edits affect the same single-serving recipe internally.Refs: Weidner, Miller, and Meiklejohn 2020; Weidner et al. 2023# Forests and TreesMany apps include a tree or forest structure. (A forest is a collection of disconnected trees.) Typical operations are creating a new node, deleting a node, and moving a node (changing its parent).Examples: A file system is a tree whose leaves are files and inner nodes are folders. A Figma document is a tree of rendered objects.The CRDT way to represent a tree or forest is: Each node has a parent node, set via LWW. The parent can either be another node, a special “root” node (in a tree), or “none” (in a forest). You compute the tree/forest as a view of these child->parent relationships (edges) in the obvious way.When a user deletes a node - implicitly deleting its whole subtree - don’t actually loop over the subtree deleting nodes. That would have weird results if another user concurrently moved some nodes into or out of the subtree. Instead, only delete the top node (or archive it - e.g., set its parent to a special “trash” node). It’s a good idea to let users view the deleted subtree and move nodes out of it.Everything I’ve said so far is just an application of basic techniques. Cycles are what make forests and trees advanced: it’s possible that one user sets B.parent = A, while concurrently, another user sets A.parent = B. Then it’s unclear what the computed view should be.A tree starts with root C and children A, B. One user moves A under B (sets A.parent = B). Concurrently, another user moves B under A. The final state has C, and A-B cycle, and "??".Figure 6. Concurrent tree-move operations - each valid on their own - may create a cycle. When this happens, what state should the app display, given that cycles are not allowed in a forest/tree?Some ideas for how to handle cycles:Error. Some desktop file sync apps do this in practice (Kleppmann et al. (2022) give an example).Render the cycle nodes (and their descendants) in a special “time-out” zone. They will stay there until some user manually fixes the cycle.Use a server to process move ops. When the server receives an op, if it would create a cycle in the server’s own state, the server rejects it and tells users to do likewise. This is what Figma does. Users can still process move ops optimistically, but they are tentative until confirmed by the server. (Optimistic updates can cause temporary cycles for users; in that case, Figma uses strategy (2): it hides the cycle nodes.)Similar, but use a topological sort (below) instead of a server’s receipt order. When processing ops in the sort order, if an op would create a cycle, skip it (Kleppmann et al. 2022).For forests: Within each cycle, let B.parent = A be the edge whose set operation has the largest LWW timestamp. At render time, “hide” that edge, instead rendering B.parent = "none", but don’t change the actual CRDT state. This hides one of the concurrent edges that created the cycle.To prevent future surprises, users’ apps should follow the rule: before performing any operation that would create or destroy a cycle involving a hidden edge, first “affirm” that hidden edge, by performing an op that sets B.parent = "none".For trees: Similar, except instead of rendering B.parent = "none", render the previous parent for B - as if the bad operation never happened. More generally, you might have to backtrack several operations. Both Hall et al. (2018) and Nair et al. (2022) describe strategies along these lines.Refs: Graphs in Shapiro et al. 2011a; Martin, Ahmed-Nacer, and Urso 2011; Hall et al. (2018); Nair et al. 2022; Kleppmann et al. 2022; Wallace 2022# Undo/RedoIn most apps, users should be able to undo and redo their own operations in a stack, using Ctrl+Z / Ctrl+Shift+Z. You might also allow selective undo (undo operations anywhere in the history) or group undo (users can undo each others’ operations) - e.g., for reverting changes.A simple way to undo an operation is: perform a new operation whose effect locally undoes the target operation. For example, to undo typing a character, perform a new operation that deletes that character.However, this “local undo” has sub-optimal semantics. For example, suppose one user posts an image, undoes it, then redoes it; concurrently to the undo/redo, another user comments on the image. If you implement the redo as “make a new post with the same contents”, then the comment will be lost: it attaches to the original post, not the re-done one.Exact undo instead uses the following semantics:In addition to normal app operations, there are operations undo(opID) and redo(opID), where opID identifies a normal operation.For each opID, consider the history of undo(opID) and redo(opID) operations. Apply some boolean-value CRDT to that operation history to decide whether opID is currently (re)done or undone.The current state is the result of applying your app’s semantics to the (re)done operations only.Top: Operations A-F with arrows A to B, A to D, B to C, B to E, D to E, E to F. The labels are: "op1x7: add(red)"; "op33n: add(blue)"; "undo(op33n)"; "undo(op1x7)"; "op91k: add(green)"; "redo(op1x7)". Bottom: Only operations A and D, with an arrow A to D.Figure 7. Top: Operation history for an add-wins set with exact undo. Currently, op1x7 is redone, op33n is undone, and op91k is done.Bottom: We filter the (re)done operations only and pass the filtered operation history to the add-wins set's semantics, yielding state { "red", "green" }.For the boolean-value CRDT, you can use LWW, or a multi-value register’s displayed value (e.g., redo-wins). Or, you can use the maximum causal length: the first undo operation is undo(opID, 1); redoing it is redo(opID, 2); undoing it again is undo(opID, 3), etc.; and the winner is the op with the largest number. (Equivalently, the head of the longest chain of causally-ordered operations - hence the name.)Maximum causal length makes sense as a general boolean-value CRDT, but I’ve only seen it used for undo/redo.Step 3 is more difficult that it sounds. Your app might assume causal-order delivery, then give weird results when undone operations violate it. (E.g., our multi-value register algorithm above will not match the intended semantics after undos.) Also, most algorithms do not support removing past operations from the history. But see Brattli and Yu (2021) for a multi-value register that is compatible with exact undo.Refs: Weiss, Uros, and Molli 2010; Yu, Elvinger, and Ignat 2019# Other TechniquesThis section mentions other techniques that I personally find less useful. Some are designed for distributed data stores instead of collaborative apps; some give reasonable semantics but are hard to implement efficiently; and some are plausibly useful but I have not yet found a good example app.# Remove-Wins SetThe remove-wins set is like the add-wins set, except if there are concurrent operations add(x) and remove(x), then the remove wins: x is not in the set. You can implement this similarly to the add-wins set, using a disable-wins flag instead of enable-wins flag. (Take care that the set starts empty, instead of containing every possible value.) Or, you can implement the remove-wins set using:an append-only log of all values that have ever been added, andan add-wins set indicating which of those values are currently removed.In general, any implementation must store all values that have ever been added; this is a practical reason to prefer the add-wins set instead. I also do not know of an example app where I prefer the remove-wins set’s semantics. The exception is apps that already store all values elsewhere, such as an archiving collection: I think a remove-wins set of present values would give reasonable semantics. That is equivalent to using an add-wins set of archived values, or to using a disable-wins flag for each value’s isPresent field.Refs: Bieniusa et al. 2012; Baquero, Almeida, and Shoker 2017; Baquero et al. 2017# PN-SetThe PN-Set (Positive-Negative Set) is another alternative to the add-wins set. Its semantics are: for each value x, count the number of add(x) operations in the operation history, minus the number of remove(x) operations; if it’s positive, x is in the set.This semantics give strange results in the face of concurrent operations, as described by Shapiro et al. (2011a). For example, if two users call add(x) concurrently, then to remove x from the set, you must call remove(x) twice. If two users do that concurrently, it will interact strangely with further add(x) operations, etc.Like the maximum causal length semantics, the PN-Set was originally proposed for undo/redo.Refs: Weiss, Urso, and Molli 2010; Shapiro et al. (2011a)# Observed-Reset OperationsAn observed-reset operation cancels out all causally prior operations. That is, when looking at the operation history, you ignore all operations that are causally prior to any reset operation, then compute the state from the remaining operations.Six +1 operations, and one reset() operation that is causally greater than three of the +1 operations (underlined).Figure 8. Operation history for a counter with +1 and observed-reset operations. The reset operation cancels out the underlined +1 operations, so the current state is 3.Observed-reset operations are a tempting way to add a delete(key) operation to a map-like object: make delete(key) be an observed-reset operation on the value CRDT at key. Thus delete(key) restores a value CRDT to its original, unused state, which you can treat as the “key-not-present” state. However, then if one user calls delete(key) while another user operates on the value CRDT concurrently, you’ll end with an awkward partial state:In a collaborative todo-list with observed-reset deletes, concurrently deleting an item and marking it done results in a nonsense list item with no text field.Image credit: Figure 6 by Kleppmann and Beresford. That paper describes a theoretical JSON CRDT, but Firebase RTDB has the same behavior.I instead prefer to omit delete(key) from the map-like object entirely. If you need deletions, instead use a CRDT-valued map or similar. Those ultimately treat delete(key) as a permanent deletion (from the unique-set of CRDTs) or as an archive operation.Refs: Deletes in Riak Map# Querying the Causal OrderMost of our techniques so far don’t use the causal order on operations (arrows in the operation history). However, the multi-value register does: it queries the set of causally-maximal operations, displaying their values. Observed-reset operations also query the causal order, and the add-wins set/remove-wins set reference it indirectly.One can imagine CRDTs that query the causal order in many more ways. However, I find these too complicated for practical use:It is expensive to track the causal order on all pairs of operations.Semantics that ask “is there an operation concurrent to this one?” generally need to store operations forever, in case a concurrent operation appears later.It is easy to create semantic rules that don’t behave well in all scenarios.(The multi-value register and add-wins set occupy a special case that avoids (1) and (2).)As an example of (3), it is tempting to define an add-wins set by: an add(x) operation overrules any concurrent remove(x) operation, so that the add wins. But then in Figure 9’s operation history, both remove(x) operations get overruled by concurrent add(x) operations. That makes x present in the set when it shouldn’t be.Operations A-D with arrows A to B, C to D. The labels are: add(x), remove(x), add(x), remove(x).Figure 9. Operation history for an add-wins set. One user calls add(x) and then remove(x); concurrently, another user does likewise. The correct current state is the empty set: the causally-maximal operations on x are both remove(x).As another example, you might try to define an add-wins set by: if there are concurrent add(x) and remove(x) operations, apply the remove(x) “first”, so that add(x) wins; otherwise apply operations in causal order. But then in the above operation history, the intended order of operations contains a cycle:Operations A-D with "causal order" arrows A to B, C to D, and "remove-first rule" arrows B to C, D to A. The labels are: add(x), remove(x), add(x), remove(x).I always try out this operation history when a paper claims to reproduce/understand the add-wins set in a new way.# Topological SortA topological sort is a general way to “derive” CRDT semantics from an ordinary data structure. Given an operation history made out of ordinary data structure operations, the current state is defined by:Sort the operations into some consistent linear order that is compatible with the causal order (o < p implies o appears before p). E.g., sort them by Lamport timestamp.Apply those operations to the starting state in order, returning the final state. If an operation would be invalid (e.g. it would create a cycle in a tree), skip it.The problem with these semantics is that you don’t know what result you will get - it depends on the sort order, which is fairly arbitrary.However, topological sort can be useful as a fallback in complex cases, like tree cycles or group-chat permissions. You can think of it like asking a central server for help: the sort order stands in for “the order in which ops reached the server”. (If you do have a server, you can use that instead.)Ref: Kleppmann et al 2018# CapstonesLet’s finish by designing novel semantics for two practical but complex collaborative apps.# Recipe EditorI’ve mentioned a collaborative recipe editor in several examples. It’s implemented as a Collabs demo: live demo, talk slides, talk video, source code.Recipe editor screenshot showing a recipe for roast broccoli.The app’s semantics can be described compositionally using nested objects. Here is a schematic:{ingredients: UniqueSetOfCRDTs<{text: TextCRDT,amount: LWWRegister<number>, // Scale-independent amountunits: LWWRegister<Unit>,position: LWWRegister<Position>, // List CRDT position, for list-with-moveisPresent: EnableWinsFlag // For update-wins semantics}>,globalScale: LWWRegister<number>, // For scale opsdescription: {text: TextCRDT,formatting: InlineFormattingCRDT}}(Links by class name: UniqueSetOfCRDTs, TextCRDT, LWWRegister, EnableWinsFlag, InlineFormattingCRDT.)Most GUI operations translate directly to operations on this state, but there are some edge cases.The ingredient list is a list-with-move: move operations (the arrow buttons) set position.Delete operations (the red X’s) use update-wins semantics: delete sets isPresent to false, while each operation on an ingredient (e.g., setting its amount) additionally sets isPresent to true.Ingredient amounts, and the “Double the recipe!” / “Halve the recipe!” buttons, treat the scale as a global modifier.# Block-Based Rich TextWe described inline rich-text formatting above, like bold and italics. Real rich-text editors also support block formatting: headers, lists, blockquotes, etc. Fancy apps like Notion even let you rearrange the order of blocks using drag-and-drop:Notion screenshot of moving block "Lorem ipsum dolor sit amet" from before to after "consectetur adipiscing elit".Let’s see if we can design a CRDT semantics that has all of these features: inline formatting, block formatting, and movable blocks. Like the list-with-move, moving a block should not affect concurrent edits within that block. We’d also like nice behavior in tricky cases - e.g., one user moves a block while a concurrent user splits it into two blocks.This section is experimental; I’ll update it in the future if I learn of improvements (suggestions are welcome).Refs: Ignat, André, and Oster 2017 (similar Operational Transformation algorithm); Quill line formatting; unpublished notes by Martin Kleppmann (2022); Notion’s data model# CRDT StateThe CRDT state is an object with several components:text: A text CRDT. It stores the plain text characters plus two kinds of invisible characters: block markers and hooks. Each block marker indicates the start of a block, while hooks are used to place blocks that have been split or merged.format: An inline formatting CRDT on top of text. It controls the inline formatting of the text characters (bold, italic, links, etc.). It has no effect on invisible characters.blockList: A separate list CRDT that we will use to order blocks. It does not actually have content; it just serves as a source of list CRDT positions.For each block marker (keyed by its list CRDT position), a block CRDT object with the following components:blockType: An LWW register whose value is the block’s type. This can be “heading 2”, “blockquote”, “unordered list item”, “ordered list item”, etc.indent: An LWW register whose value is the block’s indent level (a nonnegative integer).placement: An LWW register that we will explain later. Its value is one of:{ case: "pos", target: <position in blockList> }{ case: "origin", target: <a hook's list CRDT position> }{ case: "parent", target: <a hook's list CRDT position>, prevPlacement: <a "pos" or "origin" placement value> }# Rendering the App StateLet’s ignore blockCRDT.placement for now. Then rendering the rich-text state resulting from the CRDT state is straightforward:Each block marker defines a block.The block’s contents are the text immediately following that block marker in text, ending at the next block marker.The block is displayed according to its blockType and indent.For ordered list items, the leading number (e.g. “3.”) is computed at render time according to how many preceding blocks are ordered list items. Unlike in HTML, the CRDT state does not store an explicit “list start” or “list end”.The text inside a block is inline-formatted according to format. Note that formatting marks might cross block boundaries; this is fine.Top: "text: _Hello_Okay" with underscores labeled "Block marker n7b3", "Block marker ttx7". Bottom left: 'n7b3: { blockType: “ordered list item”, indent: 0 }', 'ttx7: { blockType: “blockquote”, indent: 0 }'. Bottom right: two rendered blocks, "1. Hello" and "(blockquote) Okay".Figure 10. Sample state and rendered text, omitting blockList, hooks, and blockCRDT.placement.Now we need to explain blockCRDT.placement. It tells you how to order a block relative to other blocks, and whether it is merged into another block.If case is "pos": This block stands on its own. Its order relative to other "pos" blocks is given by target’s position in blockList.If case is "origin": This block again stands on its own, but it follows another block (its origin) instead of having its own position. Specifically, let block A be the block containing target (i.e., the last block marker prior to target in text). Render this block immediately after block A.If case is "parent": This block has been merged into another block (its parent). Specifically, let block A be the block containing target. Render our text as part of block A, immediately after block A’s own text. (Our blockType and indent are ignored.)Top: "blockList: [p32x, p789]". Middle: "text: _Hel^lo_Ok^_ay" with underscores labeled "Block marker n7b3", "Block marker ttx7", "Block marker x1bc", and carets labeled "Hook @ pbj8", "Hook @ p6v6". Bottom left: 'n7b3.placement: { case: “pos”, target: p32x }', 'ttx7.placement: { case: “origin”, target: pbj8 }', 'x1bc.placement: { case: “parent”, target: p6v6 }'. Bottom right: two rendered blocks, "1. Hello" and "(blockquote) Okay".Figure 11. Repeat of Figure 10 showing blockList, hooks, and blockCRDT.placement. Observe that "Okay" is the merger of two blocks, "Ok" and "ay".You might notice some dangerous edge cases here! We’ll address those shortly.# Move, Split, and MergeNow we can implement the three interesting block-level operations:Move. To move block B so that it immediately follows block A, first create a new position in blockList that is immediately after A’s position (or its origin’s (origin’s…) position). Then set block B’s placement to { case: "pos", target: <new blockList position> }.If there are any blocks with origin B that you don’t want to move along with B, perform additional move ops to keep them where they currently appear.If there are any blocks with origin A, perform additional move ops to move them after B, so that they aren’t rendered between A and B.Edge case: to move block B to the start, create the position at the beginning of blockList.Split. To split a block in two, insert a new hook and block marker at the splitting position (in that order). Set the new block’s placement to { case: "origin", target: <new hook's position> }, and set its blockType and indent as you like.Why do we point to a hook instead of the previous block’s block header? Basically, we want to follow the text just prior to the split, which might someday end up in a different block. (Consider the case where the previous block splits again, then one half moves without the other.)Merge. To merge block B into the previous block, first find the previous block A in the current rendered state. (This might not be the previous block marker in text.) Insert a new hook at the end of A’s rendered text, then set block B’s placement to { case: "parent", target: <new hook's position>, prevPlacement: <block B's previous placement value> }.The “end of A’s rendered text” might be in a block that was merged into A.# Edge CasesIt remains to address some dangerous edge cases during rendering.First, it is possible for two blocks B and C to have the same origin block A. So according to the above rules, they should both be rendered immediately after block A, which is impossible. Instead, render them one after the other, in the same order that their hooks appear in the rendered text. (This might differ from the hooks’ order in text.)More generally, the relationships “block B has origin A” form a forest. (I believe there will never be a cycle, so we don’t need our advanced technique above.) For each tree in the forest, render that tree’s blocks consecutively, in depth-first pre-order traversal order.Top: "blockList: [p32x, p789]. Bottom: "Forest of origins" with nodes A-E and edges B to A, C to B, D to A. Nodes A and E have lines to p32x and p789, respectively.Figure 12. A forest of origin relationships. This renders as block order A, B, C, D, E. Note that the tree roots A, E are ordered by their positions in blockList.Second, it is likewise possible for two blocks B and C to have the same parent block A. In that case, render both blocks’ text as part of block A, again in order by their hooks.More generally, we would like the relationships “block B has parent A” to form a forest. However, this time cycles are possible!Staring state: A then B. Top state: AB. Bottom state: (B then A) with an arrow to (BA). Final state: A and B with arrows in a cycle and "??".Figure 13. One user merges block B into A. Concurrently, another user moves B above A, then merges A into B. Now their parent relationships form a cycle.To solve this, use any of the ideas from forests and trees to avoid/hide cycles. I recommend a variant of idea 5: within each cycle, “hide” the edge with largest LWW timestamp, instead rendering its prevPlacement. (The prevPlacement always has case "pos" or "origin", so this won’t create any new cycles. Also, I believe you can ignore idea 5’s sentence about “affirming” hidden edges.)Third, to make sure there is always at least one block, the app’s initial state should be: blockList contains a single position; text contains a single block marker with placement = { case: "pos", target: <the blockList position> }.Why does moving a block set its placement.target to a position within blockList, instead of a hook like Split/Merge? This lets us avoid another cycle case: if block B is moved after block A, while concurrently, block A is moved after block B, then blockList gives them a definite order without any fancy cycle-breaking rules.# ValidationI can’t promise that these semantics will give a reasonable result in every situation. But following the advice in Composition Techniques, we can check all interesting pairs of concurrent operations, then trust that the general case at least satisfies strong convergence (by composition).The following figure shows what I believe will happen in various concurrent situations. In each case, it matches my preferred behavior. Lines indicate blocks, while A/B/C indicate chunks of text.Split vs Split: ABC to (A BC / AB C) to A B C. Merge vs Merge: A B C to (A BC / AB C) to ABC. Split vs Merge 1: AB C to (A B C / ABC) to A BC. Split vs Merge 2: A BC to (A B C / ABC) to AB C. Move vs Merge: A B C to (B A C / A BC) to BC A. Move vs Split: A BC to (BC A / A B C) to B C A.# ConclusionA CRDT-based app’s semantics describe what the state of the app should be, given a history of collaborative operations. Choosing semantics is ultimately a per-app problem, but the CRDT literature provides many ideas and examples.This blog post was long because there are indeed many techniques. However, we saw that they are mostly just a few basic ideas (UIDs, list CRDT positions, LWW, multi-value register) plus composed examples.It remains to describe algorithms implementing these semantics. Although we gave some basic CRDT algorithms in-line, there are additional nuances and optimizations. Those will be the subject of the next post, Part 3: Algorithmic Techniques.This blog post is Part 2 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further TopicsHome • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub
# Document TitleReorient GitHub Pull Requests Around ChangesetsI've had the experience of using GitHub as a maintainer for very large open source projects (1000+ contributors), as an engineer for very large closed source corporate projects, and everything smaller. Through those experiences up to today, GitHub pull requests is where I spend almost all of my time while on GitHub, and to me its also unfortunately the most frustrating part of GitHub.There are a lot of improvements I would love to see with pull requests, but a massive chunk of my problems would be solved through one major feature: changesets. This blog post describes this suggestion and what I would love to see.Disclaimer: My ideas here are not original! I do not claim to have come up with these ideas. My suggestions here are based on well-explored Git workflows and also are partially or in full implemented by other products such as Gerrit, Phabricator, or plain ol' email-based patch review.The Problem TodayThe lifecycle of a GitHub pull request today is effectively one giant mutable changeset. This is a mess!Here is a typical PR today: A contributor pushes a set of commits to a branch, opens a PR, and the PR now represents that branch. People discuss the PR through comments. When the contributor pushes new changes, they show up directly on the same PR, updating it immediately. Reviewers can leave comments and the contributor can push changes at the same time, and it all updates the same PR.This has many problems:A reviewer can leave a review for a previous state the PR was in and it can become immediately outdated because while the review was happening the contributor pushed changes.Worse, a review can become partially outdated and the other feedback may not make sense in the context of the changes a contributor pushed. For example, a line comment may say "same feedback as the previous comment" but the previous comment is now gone/hidden because the contributor pushed a change that moved those lines.1Reviews don't contain any metadata about the commit they were attached to, only a timestamp they were submitted. A user can roughly correlate timestamp to commit but it isn't totally accurate because if a commit comes in during a review, the timestamp will seem to imply it was after the most recent commit but your review may have been against the prior commit. 😕Work-in-progress commits towards addressing review feedback become visible as soon as the branch is pushed. This forces contributors to address all feedback in a single commit, or for reviewers to deal with partially-addressed feedback.You can't easily scrub to prior states of the PR. If you want to review a set of earlier commits while ignoring later commits, you either have to manually build a "compare" view or use a local checkout (I do the latter). But through either approach you only get the code changes, you don't also get the point-in-time reviewer feedback!Similar to the above, if a contributor pushes multiple new commits, you can't easily compare the new set of commits to the old. You can only really scrub one commit at a time. For this, you again have to fallback to local git to build up a diff manually.And more... I talk about some more later, but I think I've made my point.I'm sure I'm wrong about some detail about some of the points above. Someone is likely to say "he could've just done this to solve problem 5(a)". That's helpful! But, the point I'm trying to make is that if you step back the fundamentals causing these problems are the real issue. Namely, a single mutable changeset tracking a branch on a per-commit basis.ChangesetsThe solution is changesets: A pull request is versionable through a monotonic number (v1, v2, ...). These versions are often called "changesets."Each changeset points to the state of a branch at a fixed time. These versions are immutable: when new commits are pushed, they become part of a new changeset. If the contributor force pushes the branch, that also becomes part of a new changeset. The previous changeset is saved forever.A new changeset can be published immediately (per commit) or it can be deferred until the contributor decides to propose a new version for review. The latter allows a contributor to make multiple commits to address prior feedback and only publish those changes when they feel ready.In the world of changesets, feedback is attached to a changeset. If a reviewer begins reviewing a changeset and a new changeset is published, that's okay because the review as an atomic unit is attached to the prior changeset.In future changesets, it is often useful to denote that a file or line has unresolved comments in prior changesets. This ensures that feedback on earlier changesets is not lost and must be addressed before any changeset is accepted.Typically, each changeset is represented by a different Git ref. For example, GitHub pull requests today are usually refs/pr/1234 and you can use git locally to check out any pull request this way. A changeset would be something like refs/pr/1234/v2 (hypothetical) so you can also check out individual changesets.Instead of "approving" a PR and merging, reviewers approve a changeset. This means that the contributor can also post multiple changesets with differing approaches to a problem in a single PR and the maintainer can potentially choose a non-latest changeset as the set of changes they want to merge.GitHub, Please!Changesets are a well-established pattern across many open source projects and companies. They're already a well-explored user experience problem in existing products like Gerrit and Phabricator. I also believe changesets can be introduced in a non-breaking way (since current PRs are like single-mutable-changeset mode).Changesets would make pull requests so much more scalable for larger projects and organizations. Besides the scalability, they make the review process cleaner and safer for both parties involved in pull requests.Of course, I can only speak for myself and my experience, but this single major feature would dramatically improve my quality of life and capabilities while using GitHub2.FootnotesThis is a minor, inconvenient issue, but this issue scales up to serious problem. ↩"Just don't use GitHub!" I've heard this feedback before. There are many other reasons I use GitHub today, so this is not a viable option for me personally right now. If you can get away with not using GitHub, then yes you can find changeset support in other products. ↩
# Document TitleCRDT Survey, Part 4: Further TopicsMatthew Weidner | Feb 13th, 2024Home | RSS FeedKeywords: CRDTs, bibliographyThis blog post is Part 4 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further Topics# Further TopicsThis post concludes my CRDT survey by describing further topics that are outside my focus area. It is not a proper survey of those topics, but instead a lightweight bibliography, with links to a relevant resource or two. To find more papers about a given topic, see crdt.tech/papers.html.My previous posts made several implicit assumptions, which I consider part of the “traditional” CRDT model:All operations only target eventual consistency. There is no mechanism for coordination/strong consistency (e.g. a central server), even for just a subset of operations.Users always wish to synchronize all data and all operations, as quickly as possible. There is no concept of partial access to a document, or of sync strategies that deliberately omit operations.All devices are trustworthy and follow the given protocol.The further topics below concern collaborative apps that violate at least one of these assumptions. I’ll end with some alternatives to CRDT design and an afterword.# Table of ContentsIncorporating Strong ConsistencyMixed Consistency • Server-Assisted CRDTsIncomplete SynchronizationUndo • Partial Replication • Versions and BranchesSecurityByzantine Fault Tolerance • Access ControlAlternatives to CRDT DesignOperational Transformation • Program SynthesisAfterword: Why Learn CRDT Theory?# Incorporating Strong ConsistencyStrong consistency is (roughly) the guarantee that all replicas see all operations in the same order. This contrasts with CRDTs’ eventual consistency guarantee, which allows different replicas to see concurrent operations in different orders. Existing work explores systems that mix CRDTs and strong consistency.“Consistency in Non-Transactional Distributed Storage Systems”, Viotti and Vukolić (2016). Provides an overview of various consistency guarantees.# Mixed ConsistencySome apps need strong consistency for certain operations but allow CRDT-style optimistic updates for others. For example, a calendar app might use strong consistency when reserving a room, to prevent double-booking. Several papers describe mixed consistency systems or languages that are designed for such apps.“Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary”, Li et al. (2012). Describes a system that offers “RedBlue consistency”, in which operations can be marked as needing strong consistency (red) or only eventual consistency (blue).# Server-Assisted CRDTsBesides providing strong consistency for certain operations, a central server can provide strong consistency for parts of the protocol while clients use CRDTs for optimistic local updates. For example, the server can decide which operations are valid or invalid, assign a final total order to operations, or coordinate “garbage collection” actions that are only possible in the absence of concurrent operations (e.g., deleting tombstones).“Replicache”, Rocicorp (2024). Client-side sync framework that uses a server to assign a final total order to operations, but allow clients to perform optimistic local updates using a “rebase” technique.“Making Operation-Based CRDTs Operation-Based”, Baquero, Almeida, and Shoker (2014). Defines causal stability: the condition that there can be no further operations concurrent to a given operation. Causal stability is usually necessary for garbage collection actions, but tricky to ensure in a collaborative app where replicas come and go.# Incomplete Synchronization# UndoPart 2 described exact undo, in which you apply your normal semantics to the operation history less undone operations. The same principle applies to other situations where you deliberately omit operations: rejected/reverted changes, partial replication, “cherry-picking” across branches, etc.However, applying your normal semantics to “the operation history less undone operations” is more difficult that it sounds, because many semantics and algorithms assume causal-order delivery. For example:There are two common ways to describe an Add-Wins Set’s semantics, which are equivalent assuming causal-order delivery but inequivalent without it.To compare list CRDT positions, you often need some metadata (e.g., an underlying Fugue tree), which is not guaranteed to exist if you are missing prior operations.It is an open problem to formulate undo-tolerant variants of various CRDTs.“Supporting Undo and Redo for Replicated Registers in Collaborative Applications”, Brattli and Yu (2021). Describes an undo-tolerant variant of the multi-value register.# Partial ReplicationPrevious posts assumed that all collaborators always load an entire document and sychronize all operations on that document. In partial replication, replicas only store and synchronize parts of a document. Partial replication can be used as part of access control, or as an optimization for large documents - clients can leave most of the document on disk or on a storage server.Existing CRDT libraries (Yjs, Automerge, etc.) do not have built-in partial replication, but instead encourage you to split collaborative state into multiple documents if needed, so that you can share or load each document separately. However, you then lose causal consistency guarantees between operations in different documents. It is also challenging to design systems that perform the actual replication, e.g., intelligently deciding which documents to fetch from a storage server.“Conflict-Free Partially Replicated Data Types”, Briquemont et al. 2015. Describes a system with explicit support for partial replication of CRDTs.“Automerge Repo”, Automerge contributors (2024). “Automerge Repo is a wrapper for the Automerge CRDT library which provides facilities to support working with many documents at once”.# Versions and BranchesIn a traditional collaborative app, all users want to see the same state, although they may temporarily diverge due to network latency. Version control systems instead support multiple branches that evolve independently, except during explicit merges. CRDTs are a natural fit for version control because their concurrency-aware semantics may lead to better merge results (operations in parallel branches are considered concurrent).The original local-first essay already discussed versions and branches on top of CRDTs. Recent work has started exploring these ideas in practice.“Upwelling: Combining real-time collaboration with version control for writers”, McKelvey et al. (2023). A Google Docs-style editor with added version control features that uses CRDTs.“Proposal: Versioned Collaborative Documents”, Weidner (2023). A workshop paper where I propose an architecture that combines features of git and Google Docs.# Security# Byzantine Fault ToleranceTraditional CRDTs assume that all devices are trustworthy and follow the given protocol. A malicious or buggy device can easily deviate from the protocol in a way that confuses other users.In particular, a malicious device could assign the same UID to two versions of an operation, then broadcast each version to a different subset of collaborators. Collaborators will likely process only the first version they receive, then reject the second as redundant. Thus the group will end up in a permanently inconsistent state.A few papers explore how to make CRDTs Byzantine fault tolerant so that these inconsistent states do not occur, even in the face of arbitrary behavior by malicious collaborators.“Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases”, Kleppmann and Howard (2020).“Making CRDTs Byzantine fault tolerant”, Kleppmann (2022).# Access ControlIn a purely local-first setting, access control - controlling who can read and write a collaborative document - is tricky to implement or even define. For example:If a user loses access to a collaborative document, what happens to their local copy?How should we handle tricky concurrent situations, such as a user who performs an operation concurrent to losing access, or two admins who demote each other concurrently?A few systems implement protocols that handle these situations.“Matrix Decomposition: Analysis of an Access Control Approach on Transaction-based DAGs without Finality”, Jacob et al. (2020). Describes the access control protocol used by the Matrix chat network.“@localfirst/auth”, Caudill et al. (2024). “@localfirst/auth is a TypeScript library providing decentralized authentication and authorization for team collaboration, using a secure chain of cryptographic signatures.”# Alternatives to CRDT Design# Operational TransformationIn a collaborative text document, when one users inserts or deletes text, they shift later characters’ indices. This would cause trouble for other users’ concurrent editing operations if you interpreted their original indices literally:The *gray* cat jumped on **the** table.Alice typed " the" at index 17, but concurrently, Bob typed " gray" in front of her. From Bob's perspective, Alice's insert should happen at index 22.The CRDT way to fix this problem is to use immutable list CRDT positions instead of indices. Another class of algorithms, called Operational Transformation (OT), instead “transform” indices that you receive over the network to account for concurrent operations. So in the above figure, Bob would receive “Index 17” from Alice, but add 5 to it before processing the insertion, to account for the 5 characters that he concurrently inserted in front of it.I personally consider list CRDT positions to be the simpler mental model, because they stay the same over time. They also work in arbitrary networks. (Most deployed OT algorithms require a central server; decentralized OT exists but is notoriously complicated.) Nonetheless, OT algorithms predate CRDTs and are more widely deployed - in particular, by Google Docs.“An Integrating, Transformation-Oriented Approach to Concurrency Control and Undo in Group Editors”, Ressel, Nitsche-Ruhland, and Gunzenhäuser (1996). A classic OT paper.“Enhancing rich content wikis with real-time collaboration”, Ignat, André, and Oster (2017). Describes a central-server OT algorithm for block-based rich text.# Program SynthesisTraditional CRDT papers are algorithm papers that design a CRDT by hand and prove eventual consistency with pen-and-paper. Some more recent papers instead try to synthesize CRDTs automatically from some specification of their behavior - e.g., a sequential (single-threaded) data structure plus some invariants that it should preserve even in the face of concurrency.Note that while synthesis may be easier than designing a CRDT from scratch, there is little guarantee that the synthesized semantics are reasonable in the eyes of your users. Indeed, some existing papers choose rather unusual semantics.“Katara: Synthesizing CRDTs with Verified Lifting”, Laddad et al. 2022. Synthesizes a CRDT from a specification by searching a space of composed state-based CRDTs until it finds one that works.# Afterword: Why Learn CRDT Theory?I hope you’ve enjoyed reading this blog series. Besides satisfying your curiosity, though, you might wonder: Why learn this in the first place?Indeed, one approach to CRDTs is to let experts implement them in a library, then use those implementations without thinking too hard about how they work. That is the approach we take for ordinary local data structures, like Java Collections. It works well when you use a CRDT library for the task it was built for - e.g., Yjs makes it easy to add central-server live collaboration to various rich-text editors.However, in my experience so far - both using and implementing CRDT libraries - it is hard to use a CRDT library in any way that the library creator didn’t anticipate. So if your app ends up needing some of the Further Topics above, or if you need to tune the collaborative semantics, you may be forced to develop a custom system. For that, CRDT theory will come in handy. (Though still consider using established tools when possible - especially for tricky parts like list CRDT positions.)Even when you use an existing CRDT library, you might have practical questions about it, like:What invariants can I expect to hold?What happens if <…> fails?How will it perform under different workloads?Understanding a bit of CRDT theory will help you answer these questions.This blog post is Part 4 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further TopicsHome • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub
# Document TitleCRDT Survey, Part 1: IntroductionMatthew Weidner | Sep 26th, 2023Home | RSS FeedKeywords: CRDTs, collaborative appsThis blog post is Part 1 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further Topics# What is a CRDT?Suppose you’re implementing a collaborative app. You’ve heard that Conflict-free Replicated Data Types (CRDTs) are a good fit and you want to know more about them.If you look up the definition of CRDT, you will find that there are two main kinds, “op-based” and “state-based”, and these are defined using mathematical terms like “commutativity”, “semilattice”, etc. This is probably already more complicated than your mental model of a collaborative app, and I imagine it can be intimidating.Let’s step back a bit and think about what you’re trying to accomplish.In a collaborative app, users expect to see their own operations immediately, without waiting for a round-trip to a central server. This is especially true in a local-first app, where users can make edits even when they are offline, or when there is no central server.Immediate local edits make it possible for users to perform operations concurrently: logically simultaneously, with no agreement on the order of operations. Those users will temporarily see different states. Eventually, though, they will synchronize with each other, combining their operations.At that point, the collaborative app must decide: What is the state resulting from these concurrent operations? Because the operations were logically simultaneous, we shouldn’t just pretend that they happened in some sequential order. Instead, we need to combine them in a way that matches users’ expectations.# Example: Ingredient ListHere is a simple example. The app is a collaborative recipe editor (Collabs demo), which includes an ingredient list:A list of ingredients: "Head of broccoli", "Oil", "Salt".Suppose that:One user deletes the first ingredient (“Head of broccoli”).Concurrently, a second user edits “Oil” to read “Olive Oil”.We could try broadcasting the non-collaborative version of these operations: Delete ingredient 0; Prepend "Olive " to ingredient 1. But if another user applies those operations literally in that order, they’ll end up with “Olive Salt”:The wrong result: "Oil", "Olive Salt".Instead, you need to interpret those operations in a concurrency-aware way: Prepend 'Olive' applies to the “Oil” ingredient regardless of its current index.The intended result: "Olive Oil", "Salt".# SemanticsIn the above example, the users’ intended outcome was obvious. You can probably also anticipate how to implement this behavior: identify each ingredient by a unique ID instead of its index.Other situations can be more interesting. For example, starting from the last recipe above, suppose the two users perform two more operations:The first user increases the amount of salt to 3 mL.Concurrently, the second user clicks a “Halve the recipe” button, which halves all amounts.We’d like to preserve both users’ edits. Also, since it’s a recipe, the ratio of amounts is more important than their absolute values. Thus you should aim for the following result:An ingredients list starts with 15 mL Olive Oil and 2 mL Salt. One user edits the amount of Salt to 3 mL. Concurrently, another user halves the recipe (7.5 mL Olive Oil, 1 mL Salt). The final state is: 7.5 mL Olive Oil, 1.5 mL Salt.In general, a collaborative app’s semantics are an abstract description of what the app’s state should be, given the operations that users have performed. This state must be consistent across users: two users who are aware of the same operations must be in the same state.Choosing an app’s semantics is more difficult than it sounds, because they must be well-defined in arbitrarily complex situations. For example, if a third user performs a bunch of operations concurrently to the four operations above, the collaborative recipe editor must still end up in some consistent state - hopefully one that makes sense to users.# AlgorithmsOnce you’ve chosen your collaborative app’s semantics, you need to implement them, using algorithms on each user device.For example, suppose train conductors at different doors of a train count the number of passengers boarding. When a conductor sees a passenger board through their door, they increment the collaborative count. Increments don’t always synchronize immediately due to flaky internet; this is fine as long as all conductors eventually converge to the correct total count.The app’s semantics are obvious: its state should be the number of increment operations so far, regardless of concurrency. Specifically, an individual conductor’s app will display the number of increment operations that it is aware of.# Op-Based CRDTsHere is one algorithm that implements these semantics:Per-user state: The current count, initially 0.Operation inc(): Broadcast a message +1 to all devices. Upon receiving this message, each user increments their own state. (The initiator also processes the message, immediately.)Top user performs inc(), changing their state from 0 to 1. They broadcast a "+1" message to other users. Upon receipt, those two users each change their states from 0 to 1.It is assumed that when a user broadcasts a message, it is eventually received by all collaborators, without duplication (i.e., exactly-once). For example, each user could send their messages to a server; the server stores these messages and forwards them to other users, who filter out duplicates.Algorithms in this style are called op-based CRDTs. They work even if the network has unbounded delays or delivers messages in different orders to different users. Indeed, the above algorithm matches our chosen semantics even under those conditions.# State-Based CRDTsOp-based CRDTs can also be used in peer-to-peer networks without any servers. However, usually each user needs to store a complete message history, in case another user requests old messages. (It is a good idea to let users forward each other’s messages, not just their own.) This has a high storage cost: O(total count) in our example app.A state-based CRDT is a different kind of algorithm that sometimes reduces this storage cost. It consists of:A per-user state. Implicitly, this state encodes the set of operations that the user is aware of, but it is allowed to be a lossy encoding.For each operation (e.g. inc()), a function that updates the local state to reflect that operation.A merge function that inputs a second state and updates the local state to reflect the union of sets of operations:(Set of +1s mapping to "Current state") plus (set of +1s mapping to "Other state") becomes (set of +1s mapping to "Merged state"). Each state has a box with color-coded +1 ops: Current state has orange, red, green; Other state has orange, red, purple, blue, yellow; Merged state has orange, red, green, purple, blue, yellow.We’ll see an optimized state-based counter CRDT later in this blog series. But briefly, instead of storing the complete message history, you store a map from each device to the number of inc() operations performed by that device. This encodes the operation history in a way that permits merging (entrywise max), but has storage cost O(# devices) instead of O(total count).Unless you need state-based merging, the counting app doesn’t really require specialized knowledge - it just counts in the obvious way. But you can still talk about the “op-based counter CRDT” if you want to sound fancy.# Defining CRDTsOne way to define a CRDT is as “either an op-based CRDT or a state-based CRDT”. In practice, we can be more flexible. For example, CRDT libraries usually implement hybrid op-based/state-based CRDTs, which let you use both op-based messaging and state-based merging in the same app.I like this broader, informal definition: A CRDT is a distributed algorithm that computes a collaborative app’s state from its operation history. This state must depend only on the operations that users have performed: it must be the same for all users that are aware of the same operations, and it must be computable without extra coordination or help from a single source-of-truth. (Message forwarding servers are okay, but you’re not allowed to put a central server in charge of the state like in a traditional web app.)Of course, you can also use CRDT techniques even when you do have a central server. For example, a collaborative text editor could use a CRDT to manage the text, but a server DB to manage permissions.# Outline of this SurveyIn this blog series, I will survey CRDT techniques for collaborative apps. My goal is to demystify and summarize the techniques I know about, so that you can learn them too without a PhD’s worth of effort.The techniques are divided into two “topics”, corresponding to the sections above:Semantic techniques help you decide what a collaborative app’s state should be.Algorithmic techniques tell you how to compute that state efficiently.Part 2 (the next post) covers semantic techniques; I’ll cover algorithmic techniques in Part 3.Since I’m the lead developer for the Collabs CRDT library, I’ll also mention where various techniques appear in Collabs. You can find a summary in the Collabs docs.This survey is opinionated: I omit or advise against techniques that I believe don’t translate well to collaborative apps. (CRDTs are also used in distributed data stores, which have different requirements.) I also can’t hope to mention every CRDT paper or topic area; for that, see crdt.tech. If you believe that I’ve treated a technique unfairly - including your own! - please feel free to contact me: mweidner037 [at] gmail.com.I thank Martin Kleppmann, Jake Teton-Landis, and Florian Jacob for feedback on portions of this survey. Any mistakes or bad ideas are my own.# SourcesI’ll cite specific references in-line, but here are some sources of general inspiration.Pure Operation-Based Replicated Data Types, Carlos Baquero, Paulo Sergio Almeida, and Ali Shoker (2017).How Figma’s multiplayer technology works, Evan Wallace (2019)Tackling Consistency-related Design Challenges of Distributed Data-Intensive Systems - An Action Research Study, Susanne Braun, Stefan Deßloch, Eberhard Wolff, Frank Elberzhager, and Andreas Jedlitschka (2021).A Framework for Convergence, Matt Wonlaw (2023).Next Post - Part 2: SemanticsThis blog post is Part 1 of a series.Part 1: IntroductionPart 2: Semantic TechniquesPart 3: Algorithmic TechniquesPart 4: Further TopicsHome • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub
so uh yeah my unpopular opinion is thatgit is awful and it needs to die die diedie wowwhy why why why why yeah well because itdoesn't scale uh among other things alsoit's completely unintuitive and it'shonestly it's it'soh God you're gonna get me ranting uhlook first of all I've seenI don't get the unintuitive part I guessbecause I used SVN and perforce thosedon't feel very intuitive eitheras fian emotionally hurt me I did lovetortoise thoughtortoise SVNand a company that bisco and I bothworked atcomputers unlimited which sounds like a1970s computer repair shop so theyrenamed it Tim's which isnow sounds like a fishing shopumthey had a evenrapper around everything calledtrebuchet which everyone just calledtree bucketand I hated it I hated my life I hatedit I hate it I don't want to go back toSVNthe thing is is that I don't mindspecifically on something like this Idon't mind if you on it but I wantto hear a good alternative all rightlet's let's see some Alternatives herevisions of control systems come and goright I started with RCS and then CVSand then SVN and then pert force and itwent on and on Piper at Google and andthen get in Mercurio and I mean get wasjust another one and it had greatmarketing it had great it had some sortof great virality it's really kind ofgarbage the whole thing is just it'svery powerful and flexible but uh and itdoesn't scale like fundamentally all thecompanies that we work with that use githave you know maybe a hundred thousandgit repos what what are you gonna dowith a hundred thousand git repos youknow Android struggled with thismightily when I was uh on the Androidteam at Google right I mean just we hadall these huge rappers around git todeal with multiple repos because of theopen source problem and you know theinternal stuff and I just learned tohate git but is that a git problem wasthat a git problem or was that not a gitproblem you have part of first off youhave part of your thing that's opensource part of it that's not open sourceby the way the not open source partthat's the spying part that's the NSAbackdoor issue you know what I'm talkingabout that's the back door partum anyway so when you're hiding yourback door now you're trying to have liketwo repos that are going togetherI mean that's always difficult like isthat ever easy I don't think it issub modulesdid you just say the word sub modulesub module is the world's greatest ideathat is emotionally painful every timeit happensnobody likes sub moduleseverybody thinks sub modules are goingto be good nobody loves sub moduleswe still use sub modulesI hate it hereI hate git and I think that there's anunhealthy dependence on both git andGitHub I think that's fair I think theGitHub Reliance thing is actually a realthat that's a fair critique which is nowwe we really just wrapped up a singlepoint of failureum and it's also owned by Microsoft so Idon't trust it at all now I'm justwondering when the next monetization isgoing to come out from it other thanusing all my sweet sweet programswhether or not my license says not tofor co-pilot but besides for that I'mjust sayingGitHub is a monopoly they're closedfundamentally I think that that microMicrosoft under Satya has been a veryopen ecosystem right look at vs codeyeah right look at I mean they've beensome really cool stuff but GitHub Idon't I don't trust this guy's opinionanymore this is getting hard for me toreally love this opinionbecause I'm just saying that's that's ahard opinion to believe that it's justreally open when a copilot X Works onlyfor vs code uh they're building these SPthese like spaces online and everythingthey're just trying to take yourcomputer take everything and just chargeyou monthly for itI don't trust Microsoft one bit it's anacquisition and they're still very ohwe're just doingfree and we love[Music]we look really when has a company everbeen really great for no reason come onvery very closed off and you know uh andI I don't I I think developers likeGitHub too much I mean maybe theAlternatives aren't that great but likeI I see this uh you like it just toomuch too much liking going on it willstop from here on outum I do get what he's saying though I doget what he's saying it's a great Pro Imean the thing is it's a good producthave you used stash stash is not thatmuch fun okay I haven't used gitlab Idon't have enough use with gitlab toreally understand itit's attachment to it and I'm like stuffchanges and you're not holding a highenough bar it's not good enough youshould never be satisfied with the stateof the art if it's killing you and rightnow the state of the art is killing ushow's it killing us yeah 100 000 reposevery company okay that's the killing usbut why is that badI gotta understand why is havingmultiple repos badwhy is having Netflix'spython slash whatever the hell thelanguage is repo that is designed tohelp produce algorithmic recommendationsseparated from the UI Library why why isthat good or badrightto me I don't I don't I don't thinkI don't see why they have to be the samecode base right I don't even see whythey're related but who's feeling thatpain I'm not feeling that pain well youdon't have that many repos is anybodyworking on this who's working on thiswell Facebook just launched what is itsapling right which looks kind ofpromising although it didn't starttaking the getting the acceleration thatI was looking at but changing code hostsis a big deal for people you guys wantan unpopular opinion on giving you likepotentially the most no no we're we'rewith you this is part of the game herewe're playing the game I'm enjoying thisI'm considering it I do like GitHub I'mwondering you said maybe they like ittoo much and I'm thinking the product isgood though so that's why I like it likeit's good it's decent you haven't usedGoogle's tools truecheck mate I I I'm curious aboutGoogle's internal tool I've never workedat Googleit'd be fun to actually use like whatdoes a company that builds everythingthemselves what does it look like rightI know you've never used Google toolsI.E you never passed Google's interviewI.E wreckedumdraw the little squareproved uh but real talkI would like to try out Google's toolsright I've never worked at Google Ithink it'd be a lot of funsooh no could be coolI think I'd get bored pretty quick atGoogle though real talk I think I thinkI'd get prettyI think I get pretty pretty bored so ifGoogle's tools are that much betterthen whydoesn't Google make a betterrevisioning system and then why doesn'tGoogle create the hosting for it and whydoesn't Google just simply make all themonies from itrightthere has to be a lot of monies on itand you got Google barred the subpar AIwhy not just have a hosting service forfreeand then slurp up all the data from itsome people need a source graph anothersource graph uh swords graph Sourcegraph I would argue is you know becausethe the folks at source graph actuallylove GitHub and haven't used Google'stools Source staff is you know I meanSource graph is better than GitHub in alot of ways but you know Source graphdoesn't try to be GitHub with all theworkflows and all that stuff rightTJ wrecked TJ TJ is in shamblesTJ are you in shambles right nowshambled TJshout out can we get a shout out for TJhe works at sourcecraft why would I bein shambles uh because you're just likea not as good GitHub as what I take fromthis and github's killing people so thatmeans you're killing people faster whatI mean by that is they Beyond work thereand Quinn worked there the history ofknowing Google tools using Google Googletools and then being an expat of Googleand then doing something without it iswhat I mean by that they were theinspiration right we need yeah we needsomebody to say okay I've been in Googleand I've used Google tooling and we needa non-github that is Google toyingthat's better a startup that knowsGoogle's tools but I can then recreatethem yeah who's who's gonna do that whoI think that doing that sounds unfun I'mjust gonna throw it out thereI don't think I would like doing thatpersonally who would you bet onto do thatyou mean to to come up with a Googlestyle tool so well that's that'sI really wish you would qualify what aGoogle tool isbecause I think that would help a lot ofus try to understand exactly what doeshe mean bywhy Google is better like I love like anexample or a a diagram or somethingsomething that I understand why it'sbetter because I I guess I'm a littleconfused as to why it's better thinkangular I don't want to think about thatcome on man that's mydreamsoh I see so this is not the beating thismight be the beginning of a new storysorry the Vault by the way his volume isvery low it's been it's been hard I'mtrying to get it as good as I can itcould be I like that so git isn't goodenough and I think you said GitHub isbad I'm just trying to think of howI would love a compare feature thatdoesn't require me to have a pullrequest and then manipulate the URL justsaying that'd be kind of nice just giveme just a button that says compare giveme a button that says diff like if I wasgoing to put this into it we're notgoing to try to uh if you've been tryingto read the two leaves if I'm not tryingto tackle GitHub or anything like thatright now Karen unfortunately it's youknow it's the least bad right it's theleast of all the bad options least badof all right there but I truly believeit could be a lot better and that andthat AI is going to make it a lot betterand um so you know I love being atsource graph because we can actuallybring that experience across there arestill bit bucket users in the world I'moneI mean to be fair I use stash which islikethat's this bit bucket so I get it I'mI'm I I Netflix is all on bitbucketsoyou know I deal with them honestly bitbucket not all that badwe have a good setupyou know if I wanna if I want someautomation I click a bunch of buttons onuh Jenkins for about 20 minutes until Ifind the job that I actually wantbecause their search is really confusingand we have multiple sub domains todifferent builds once I find the build Iwant I copy the build and theconfiguration then when I get theconfiguration I updated the point at thenew stash URL and then boom I got CICI you know what I mean it's pretty neatone oh yeah not not to get not the bitlook it's great either but have you usedfossil fossil SCM from Richard Hipp froma sqlite it's a completely different wayof thinking about it maybe give that alook fossil you never commit right likeit's always you commit but everything'salways synchronized around every pieceit's still distributed but it's alwayssynchronized it's never on your machineonly you never have to get push Masterit's just there I don't know if itscales or not but they use it for sqliteyeah I think fundamentally we needsomebody who who comes at this from theperspective of we need to make thisscale up to world scale code bases okayand I think that will ultimately comeout of Industry I think Facebook saplingmight be the closest but we'll see orask Google or somebody who leaves Googleand says I need a company you knowwhat's good was Google's tools you knowwhat's bad is git and I'm gonna try totackle this that could happenstill want to know what makes it betterjust give me a three second explanationwhat's the thingyou know gosh I'm so curious I'm so isthis actually how there is this Googlehiringa guy to secretly get people to want towork at Googleby not telling them but telling them howgreat their tools are is this like apsyop of the greatest amount because I'mfeeling very psyopped right now I'mfeeling super Psy often all of a suddenI have this great desire to go work atGoogle to go touch their tools I'mgetting foreplayedthis is foreplay I'mI'mwhat is Google's Tools in this case likewhat do they have that would be betterthe scale version of it why can youdescribe it or is it under like NDA andyou can't tell anything about it foreverI got four played for five minutesstraight and now we're hearing about itall rightpre-watchedno yeah it's just that like the the thethe code graph you know gets exposedacross the entire workflow so you knowon GitHub if you see a symbol in a pullin a pull request you know you can'tclick on it or hover it or get you knowgraph information about it you know likein your IDEuh you know they don't they don't haveIDE quality indexing when you're youknow looking at like when you're lookingat Trace logs or debug views or whateverall of that stuff is completely andfully instrumented at Google where youknow so in other words their ability todiscover and diagnose problems uh isjust unprecedented there's a Gestalt toit that's really hard to get acrossthat sounds like what sourcecraft'sdoing he works at sourcegraph rightI mean that's I I love that idea I meanI would like that but that's not I meanto me what he just described isn't a gitproblem specifically isn't it a tool ontop of gitbecause like is the version is therevision the real thing that's needed isthat what causes it or is this actuallyjust a tool on top of codeto be smarter about everythingit feels like it's it's the things uhWe've abstracted overkid you can useother Version Control Systems as wellyeah that's what I mean is that is theVersion Control really the place thatthis should live to begin with it kindof seems like this is something biggerright this is LSP for the futurehow has there been three golden kappasin this chat why are people gettinggolden caplets I never get a goldenKappa why is my Kappa never golden okayyes that's also my team works on it uhwe've talked about this for a while Iknow I know we've talked about this andwe've talked about but that's what I'msaying is why I don't I don't get thethe dislike forget I guess that's whereI'm struggling is I don't see theconnection between gitand what he just described maybe onecould argue that you couldI mean could you store language levelfeatures in arevision systemyeah because that's kind of what I seeis that there's a revision systemin which would have to storelike stuff about what's happeningbecause how is it supposed to understandJavaScript or the linking between twolibraries that are internal like whywould you want to do that that's just itjust it fully that's why I lovesourcegraph rightI want more Source graph in my life notless but more I want more TJ but it's avery comfy environment by the way TJ ifI ever decide to try to get a differentjob I would apply for Source graphjust let them know that they've won meoverokay and it was youif I were to apply somewhere else I'dapplySource graphit's like a world scale IDE almostexcept it's distributed internally can Iuse some of your words from what youwrote you this is in regards to codesearch at Google so I would imaginethere's some similarity in uhsatisfaction score potentially for thisum this intelligence I suppose you saidby the way this guy's voice is the onlynormalized voice in this whole thingGoogle codes searches like the Matrixexcept for developers it is it has nearperfect satisfaction score on Google'sinternal surveys and pretty much everyDev who leaves Google misses it this isregards to code search this is thereason why Source graph exists becausethis was only a Google and everybodyelse needs it too you want to say GoogleEngineers today can navigate andunderstand their own multi-billion linecode base better than perhaps any othergroup of devs in such a largeenvironment so are you saying that theyhave a tool like gator GitHub that'sthat gives net intelligence this betterand no one else has access to this thingthis is their proverbial Secret Saucebehind the scenes to be more efficientas an engineering team despite everyonenow having to work on AI and kind ofbeing behind the ball that's right okaythat's right their tooling environmentin fact Google's entire infrastructurestack not just the tools but everythingyou use as a developer there even thedocs and stuff are just uh justunbelievably good unstopidly good likeyou just come in and you're like whatlike it just it makes no sense like therest of the world just feels like peoplejust bang on rocks compared to Googlestuffso I have long kind of on I thinkhow Google Google does stuff they'retechnical artifacts how they dopromotions I think it creates a lot ofperverse incentives but based on whathe's saying maybe there are some thingsthat are goodfrom it this is Huli propaganda I I willI do I think everyone is correct whichis if they're so good at all this why dothey kill everything they create Netflixtooling we do not have anything likethis nothing at all nothing I couldnever convince them to give it to therest of the developers if killed byGoogle Google domains Google optimizeGoogle Cloud iot Core Google albumarchive YouTube stories grasshopperconversational conversational actionsGoogle Currents Google Street ViewStandalone app a Jacquard Google codecompletions Google stadia rip stadia uhGoogle on HUBum okay YouTube Originals a threadedYouTube Originals rap that was not uhwow oh my goodness thank you oh gosh Ican't read all thisoh my goodness okay some of these arefair thoughsome of these have to be fair likeGoogle killed a Google uh desktop bardesktop bar was a small inset window onthe Windows Toolbar to allow users toperform search without leaving thedesktop I could see why this would bekilled right I think some of this is alittle unfairon some of theirs right that was 17years ago I know but I'm just saying afew loads I mean obviously it's fun touh make fun of Google killing tons ofprojects but let's be serious Google'smarket cap is 1.7 trillion dollars likethey are shipping stuff that power suchan incredible amount to the world yesand so you have to explore a bunch andyou have to create stuff that peoplelike but if it doesn't make a materialdifference you have to kill it becauseyou don't want to have to have staff andpeople hired around stuff that isn'tproducing anything like I get it I'm notagainst what they're doing butthe point still stands they kill stuff Imean Netflix kills a bunch of featurestoo you just don't hear about it becauseour features are all within a playeronly Steve Baumer had been therescreaming about developers right okayFalcor Falco hey hey TJ you know whatproblem I had just yesterdaymy Falcor client couldn't con couldn'tconnect to the database server on thecurrent latest build of tvi so guesswhat falcore still alive Felker stillalive it has been developed on inliterally a half decade but it's stillalivethat would do it that would do it wellthis has been a very good unpopularopinion maybe I like it yeah Iappreciate it I'm still thinking aboutit yeahI would love to see it I'd love to seelike a demo you know for the rest of theworld because sometimes you don't knowyou're banging on rocks until you seesomebody who has like a moresophisticated tool and you're like kindof like when you're in vs code and youdon't realize how sophisticated it canbe so then you see someone in neovim andyou're just like wowup and banging on rocks this whole timelike oh I could do that yeah I mean youcan see you can see Google code searchif you just type chromium code searchthey've indexed it was I actually did Iactually liked I do like the chromiumcode search stuff I used it quite a bitto explore the V8 engine it's very verygood their chromium code search stuff isincredible this is on my way out it wasmy Swanson at Google uh at least in thecode search team it was was was indexingthe Android and chromium code basis soyou can play with it it doesn't have allof the functionality but it has a lotand you can see it's very slick uh youknow navigation it's just it's reallyreally good but that's only the searchstuff and that's that's actually noteven used that much compared to some ofthe other things like their their quickeditor uh and uh so cider they have andthen sit-see they're clients in thecloud they have basically cloud-basedclients and oh my God the stuff theyhave is like science fiction still thestuff they had 15 years ago is stillscience fiction for the rest of theworld like they have high speed networksand they can make they can they can douh incred edibleit's really nice it's really nice soyeah the rest of the world is kind ofhurting and that's that's why I'm stillin this space because I think the restof the world needs to get to whereGoogle's at wow thank you changelog thatwas awesomehey that's me in my own video beingpromoted to meum thank you that was actually reallygood um I really like that I okay so I'mstill confused by the get things still II don't understand why it gets theproblem on this stuff but I I understandwhat he's trying to say which is thatthe tools we use are absolutely awfulcomparatively to what say Google haseither way this is it's very interestingI would love I mean I really want tojust get like an exp I want to goexperience the Google tools just to feelthemand then take that experienceand go that's what they mean this is whyit's bad I would love it I would lovethat experienceeither way the nameis I don't understand why it gets likeso battle it's killing I don't it seemskind of intense but I guess I haven'tseen the other side maybe I'm just theone with rocks here okay a gin
# Document TitleIn a git repository, where do your files live?• git •Hello! I was talking to a friend about how git works today, and we got onto the topic – where does git store your files? We know that it’s in your .git directory, but where exactly in there are all the versions of your old files?For example, this blog is in a git repository, and it contains a file called content/post/2019-06-28-brag-doc.markdown. Where is that in my .git folder? And where are the old versions of that file? Let’s investigate by writing some very short Python programs.git stores files in .git/objectsEvery previous version of every file in your repository is in .git/objects. For example, for this blog, .git/objects contains 2700 files.$ find .git/objects/ -type f | wc -l2761note: .git/objects actually has more information than “every previous version of every file in your repository”, but we’re not going to get into that just yetHere’s a very short Python program (find-git-object.py) that finds out where any given file is stored in .git/objects.import hashlibimport sysdef object_path(content):header = f"blob {len(content)}\0"data = header.encode() + contentdigest = hashlib.sha1(data).hexdigest()return f".git/objects/{digest[:2]}/{digest[2:]}"with open(sys.argv[1], "rb") as f:print(object_path(f.read()))What this does is:read the contents of the filecalculate a header (blob 16673\0) and combine it with the contentscalculate the sha1 sum (e33121a9af82dd99d6d706d037204251d41d54 in this case)translate that sha1 sum into a path (.git/objects/e3/3121a9af82dd99d6d706d037204251d41d54)We can run it like this:$ python3 find-git-object.py content/post/2019-06-28-brag-doc.markdown.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54jargon: “content addressed storage”The term for this storage strategy (where the filename of an object in the database is the same as the hash of the file’s contents) is “content addressed storage”.One neat thing about content addressed storage is that if I have two files (or 50 files!) with the exact same contents, that doesn’t take up any extra space in Git’s database – if the hash of the contents is aabbbbbbbbbbbbbbbbbbbbbbbbb, they’ll both be stored in .git/objects/aa/bbbbbbbbbbbbbbbbbbbbb.how are those objects encoded?If I try to look at this file in .git/objects, it gets a bit weird:$ cat .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54x^A<8D><9B>}s<E3>Ƒ<C6><EF>o|<8A>^Q<9D><EC>ju<92><E8><DD>\<9C><9C>*<89>j<FD>^...What’s going on? Let’s run file on it:$ file .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54.git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54: zlib compressed dataIt’s just compressed! We can write another little Python program called decompress.py that uses the zlib module to decompress the data:import zlibimport syswith open(sys.argv[1], "rb") as f:content = f.read()print(zlib.decompress(content).decode())Now let’s decompress it:$ python3 decompress.py .git/objects/8a/e33121a9af82dd99d6d706d037204251d41d54blob 16673---title: "Get your work recognized: write a brag document"date: 2019-06-28T18:46:02Zurl: /blog/brag-documents/categories: []---... the entire blog post ...So this data is encoded in a pretty simple way: there’s this blob 16673\0 thing, and then the full contents of the file.there aren’t any diffsOne thing that surprised me here is the first time I learned it: there aren’t any diffs here! That file is the 9th version of that blog post, but the version git stores in the .git/objects is the whole file, not the diff from the previous version.Git actually sometimes also does store files as diffs (when you run git gc it can combine multiple different files into a “packfile” for efficiency), but I have never needed to think about that in my life so we’re not going to get into it. Aditya Mukerjee has a great post called Unpacking Git packfiles about how the format works.what about older versions of the blog post?Now you might be wondering – if there are 8 previous versions of that blog post (before I fixed some typos), where are they in the .git/objects directory? How do we find them?First, let’s find every commit where that file changed with git log:$ git log --oneline content/post/2019-06-28-brag-doc.markdownc6d4db2d423cd76a7e91d7d0f105905ab6d23643998a46dd67a26b04d9999f17026c0f5272442b67Now let’s pick a previous commit, let’s say 026c0f52. Commits are also stored in .git/objects, and we can try to look at it there. But the commit isn’t there! ls .git/objects/02/6c* doesn’t have any results! You know how we mentioned “sometimes git packs objects to save space but we don’t need to worry about it?“. I guess now is the time that we need to worry about it.So let’s take care of that.let’s unpack some objectsSo we need to unpack the objects from the pack files. I looked it up on Stack Overflow and apparently you can do it like this:$ mv .git/objects/pack/pack-adeb3c14576443e593a3161e7e1b202faba73f54.pack .$ git unpack-objects < pack-adeb3c14576443e593a3161e7e1b202faba73f54.packThis is weird repository surgery so it’s a bit alarming but I can always just clone the repository from Github again if I mess it up, so I wasn’t too worried.After unpacking all the object files, we end up with way more objects: about 20000 instead of about 2700. Neat.find .git/objects/ -type f | wc -l20138back to looking at a commitNow we can go back to looking at our commit 026c0f52. You know how we said that not everything in .git/objects is a file? Some of them are commits! And to figure out where the old version of our post content/post/2019-06-28-brag-doc.markdown is stored, we need to dig pretty deep into this commit.The first step is to look at the commit in .git/objects.commit step 1: look at the commitThe commit 026c0f52 is now in .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4 after doing some unpacking and we can look at it like this:$ python3 decompress.py .git/objects/02/6c0f5208c5ea10608afc9252c4a56c1ac1d7e4commit 211tree 01832a9109ab738dac78ee4e95024c74b9b71c27parent 72442b67590ae1fcbfe05883a351d822454e3826author Julia Evans <julia@jvns.ca> 1561998673 -0400committer Julia Evans <julia@jvns.ca> 1561998673 -0400brag docWe can also get same information with git cat-file -p 026c0f52, which does the same thing but does a better job of formatting the data. (the -p option means “format it nicely please”)commit step 2: look at the treeThis commit has a tree. What’s that? Well let’s take a look. The tree’s ID is 01832a9109ab738dac78ee4e95024c74b9b71c27, and we can use our decompress.py script from earlier to look at that git object. (though I had to remove the .decode() to get the script to not crash)$ python3 decompress.py .git/objects/01/832a9109ab738dac78ee4e95024c74b9b71c27b'tree 396\x00100644 .gitignore\x00\xc3\xf7`$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\x8e:h\xad100644 README.md\x00~\xba\xec\xb3\x11\xa0^\x1c\xa9\xa4?\x1e\xb9\x0f\x1cfG\x96\x0bThis is formatted in kind of an unreadable way. The main display issue here is that the commit hashes (\xc3\xf7$8\x9b\x8dO\x19/\x18\xb7}|\xc7\xce\…) are raw bytes instead of being encoded in hexadecimal. So we see \xc3\xf7$8\x9b\x8d instead of c3f76024389b8d. Let’s switch over to using git cat-file -p which formats the data in a friendlier way, because I don’t feel like writing a parser for that.$ git cat-file -p 01832a9109ab738dac78ee4e95024c74b9b71c27100644 blob c3f76024389b8d4f192f18b77d7cc7ce8e3a68ad .gitignore100644 blob 7ebaecb311a05e1ca9a43f1eb90f1c6647960bc1 README.md100644 blob 0f21dc9bf1a73afc89634bac586271384e24b2c9 Rakefile100644 blob 00b9d54abd71119737d33ee5d29d81ebdcea5a37 config.yaml040000 tree 61ad34108a327a163cdd66fa1a86342dcef4518e content <-- this is where we're going next040000 tree 6d8543e9eeba67748ded7b5f88b781016200db6f layouts100644 blob 22a321a88157293c81e4ddcfef4844c6c698c26f mystery.rb040000 tree 8157dc84a37fca4cb13e1257f37a7dd35cfe391e scripts040000 tree 84fe9c4cb9cef83e78e90a7fbf33a9a799d7be60 static040000 tree 34fd3aa2625ba784bced4a95db6154806ae1d9ee themesThis is showing us all of the files I had in the root directory of the repository as of that commit. Looks like I accidentally committed some file called mystery.rb at some point which I later removed.Our file is in the content directory, so let’s look at that tree: 61ad34108a327a163cdd66fa1a86342dcef4518ecommit step 3: yet another tree$ git cat-file -p 61ad34108a327a163cdd66fa1a86342dcef4518e040000 tree 1168078878f9d500ea4e7462a9cd29cbdf4f9a56 about100644 blob e06d03f28d58982a5b8282a61c4d3cd5ca793005 newsletter.markdown040000 tree 1f94b8103ca9b6714614614ed79254feb1d9676c post <-- where we're going next!100644 blob 2d7d22581e64ef9077455d834d18c209a8f05302 profiler-project.markdown040000 tree 06bd3cee1ed46cf403d9d5a201232af5697527bb projects040000 tree 65e9357973f0cc60bedaa511489a9c2eeab73c29 talks040000 tree 8a9d561d536b955209def58f5255fc7fe9523efd zinesStill not done…commit step 4: one more tree….The file we’re looking for is in the post/ directory, so there’s one more tree:$ git cat-file -p 1f94b8103ca9b6714614614ed79254feb1d9676c.... MANY MANY lines omitted ...100644 blob 170da7b0e607c4fd6fb4e921d76307397ab89c1e 2019-02-17-organizing-this-blog-into-categories.markdown100644 blob 7d4f27e9804e3dc80ab3a3912b4f1c890c4d2432 2019-03-15-new-zine--bite-size-networking-.markdown100644 blob 0d1b9fbc7896e47da6166e9386347f9ff58856aa 2019-03-26-what-are-monoidal-categories.markdown100644 blob d6949755c3dadbc6fcbdd20cc0d919809d754e56 2019-06-23-a-few-debugging-resources.markdown100644 blob 3105bdd067f7db16436d2ea85463755c8a772046 2019-06-28-brag-doc.markdown <-- found it!!!!!Here the 2019-06-28-brag-doc.markdown is the last file listed because it was the most recent blog post when it was published.commit step 5: we made it!Finally we have found the object file where a previous version of my blog post lives! Hooray! It has the hash 3105bdd067f7db16436d2ea85463755c8a772046, so it’s in git/objects/31/05bdd067f7db16436d2ea85463755c8a772046.We can look at it with decompress.py$ python3 decompress.py .git/objects/31/05bdd067f7db16436d2ea85463755c8a772046 | headblob 15924---title: "Get your work recognized: write a brag document"date: 2019-06-28T18:46:02Zurl: /blog/brag-documents/categories: []---... rest of the contents of the file here ...This is the old version of the post! If I ran git checkout 026c0f52 content/post/2019-06-28-brag-doc.markdown or git restore --source 026c0f52 content/post/2019-06-28-brag-doc.markdown, that’s what I’d get.this tree traversal is how git log worksThis whole process we just went through (find the commit, go through the various directory trees, search for the filename we wanted) seems kind of long and complicated but this is actually what’s happening behind the scenes when we run git log content/post/2019-06-28-brag-doc.markdown. It needs to go through every single commit in your history, check the version (for example 3105bdd067f7db16436d2ea85463755c8a772046 in this case) of content/post/2019-06-28-brag-doc.markdown, and see if it changed from the previous commit.That’s why git log FILENAME is a little slow sometimes – I have 3000 commits in this repository and it needs to do a bunch of work for every single commit to figure out if the file changed in that commit or not.how many previous versions of files do I have?Right now I have 1530 files tracked in my blog repository:$ git ls-files | wc -l1530But how many historical files are there? We can list everything in .git/objects to see how many object files there are:$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | wc -l20135Not all of these represent previous versions of files though – as we saw before, lots of them are commits and directory trees. But we can write another little Python script called find-blobs.py that goes through all of the objects and checks if it starts with blob or not:import zlibimport sysfor line in sys.stdin:line = line.strip()filename = f".git/objects/{line[0:2]}/{line[2:]}"with open(filename, "rb") as f:contents = zlib.decompress(f.read())if contents.startswith(b"blob"):print(line)$ find .git/objects/ -type f | grep -v pack | awk -F/ '{print $3 $4}' | python3 find-blobs.py | wc -l6713So it looks like there are 6713 - 1530 = 5183 old versions of files lying around in my git repository that git is keeping around for me in case I ever want to get them back. How nice!that’s all!Here’s the gist with all the code for this post. There’s not very much.I thought I already knew how git worked, but I’d never really thought about pack files before so this was a fun exploration. I also don’t spend too much time thinking about how much work git log is actually doing when I ask it to track the history of a file, so that was fun to dig into.As a funny postscript: as soon as I committed this blog post, git got mad about how many objects I had in my repository (I guess 20,000 is too many!) and ran git gc to compress them all into packfiles. So now my .git/objects directory is very small:$ find .git/objects/ -type f | wc -l14
# Document TitleLazygit Turns 5: Musings on Git, TUIs, and Open SourceWritten on August 5, 2023This post is brought to you by my sponsors. If you would like to support me, consider becoming a sponsorLazygit, the world’s coolest terminal UI for git, was released to the world on August 5 2018, five years ago today. I say released but I really mean discovered, because I had taken a few stabs at publicising it in the weeks prior which fell on deaf ears. When I eventually posted to Hacker News I was so sure nothing would come of it that I had already forgotten about it by that afternoon, so when I received an email asking what license the code fell under I was deeply confused. And then the journey began!In this post I’m going to dive into a bunch of topics directly or tangetially related to Lazygit. In honour of the Hacker News commenters whose flamewar over git UIs vs the git CLI likely boosted the debut post to the frontpage, I’ve been sure to include plenty of juicy hot-takes on various topics I’m underqualified to comment on. It’s a pretty long post so feel free to pick and choose whatever topics interest you.Contents:Where are we now?Lessons learntWhat comes next?Is git even that good?Weighing in on the CLI vs UI debateWeighing in on the terminal renaissanceCreditsWhere are we now?StarsLazygit has 37 thousand stars on GitHub, placing it at rank 26 in terms of Go projects and rank 263 across all git repos globally.What’s the secret? The number one factor (I hope) is that people actually like using Lazygit enough to star the repo. But there were two decisions I made that have nothing to do with the app itself that I think helped.Firstly, I don’t have a standalone landing page site or docs site. I keep everything in the repo, which means you’re always one click away from starring. You can add a GitHub star button to your external site, but it doesn’t actually star the repo; it just links to the repo and it’s up to you to realise that you actually need to press the star button again. I suspect that is a big deal.Secondly, Lazygit shows a popup when you first start it which at the very bottom suggests staring the repo:Thanks for using lazygit! Seriously you rock. Three things to share with you:1) If you want to learn about lazygit's features, watch this vid:https://youtu.be/CPLdltN7wgE2) Be sure to read the latest release notes at:https://github.com/jesseduffield/lazygit/releases3) If you're using git, that makes you a programmer! With your help we canmake lazygit better, so consider becoming a contributor and joining the fun athttps://github.com/jesseduffield/lazygitYou can also sponsor me and tell me what to work on by clicking the donatebutton at the bottom right.Or even just star the repo to share the love!I know this all sounds machiavellian but at the end of the day, a high star count lends credibility to your project which makes users more likely to use it, and that leads to more contributors, which leads to more features, creating a virtuous cycle.It’s important to note that GitHub stars don’t necessarily track real world popularity: magit, the de facto standard git UI for emacs, has only 6.1k stars but has north of 3.8 million downloads which as you’ll see below blows Lazygit out of the water.DownloadsDownloads are harder to measure than stars because there are so many sources from which to download Lazygit, and I don’t have any telemetry to lean on.GitHub tells me we’ve had 359k total direct downloads.4.6% of Arch Linux users have installed Lazygit.Homebrew ranks Lazygit at 294th (two below emacs) with 15k installs-on-request in the last year (ignoring the tap with 5k of its own). For comparison tig, the incumbent standalone git TUI at the time of Lazygit’s creation, ranks at 480 with 8k installs.I’m torn on how to interpret these results: being in the top 300 in Homebrew is pretty cool, but 15k installs feels lower than I would expect for that ranking. On the other hand, having almost 1 in 20 Arch Linux users using Lazygit seems huge.Lessons LearntI’ve maintained Lazygit for 5 years now and it has been a wild ride. Here’s some things I’ve learnt.Ask for helpI don’t know why this didn’t occur to me sooner, but there is something unique and magical about writing open source software whose users are developers: any developer who raises an issue has the capacity to fix the issue themselves. All you need to do is ask! Simply asking ‘are you up to the challenge of fixing this yourself?’ and offering to provide pointers goes a long way.I’ve gotten better over time at identifying easy issues and labelling them with the good-first-issue label so that others can help out, with a chance of becoming regular contributors.Get feedbackIf your repo is popular enough, you’ll get plenty of feedback through the issues board. But issues are often of the form ‘this is a problem that needs fixing’ or ‘this is a feature that should be added’ and the demand for rigour is a source of friction. There are other ways you can reduce the friction on getting feedback. I pinned a google form to the top of the issues page to get general feedback on what people like/dislike about Lazygit.Something that the google form made clear was that people wanted to know what commands were being run under the hood, so I decided to add a command log (shown by default) that would tell you which commands were being run. This made a huge difference and it’s now one of the things people like best about Lazygit.Something that surprised me was how big of a barrier the language of the project is in deciding whether somebody contributes. And Go of all languages: the one that’s intended to be dead-easy to pick up. Maybe I need to do a rewrite in javascript to attract more contributors ;)MVP is the MVPThis is not much a ‘lesson learnt’ as it was a ‘something I got right’. When I first began work on Lazygit I had a plan: hit MVP (Minimum Viable Product) and then release it to the world to see if the world had an appetite for it. The MVP was pretty basic: allow staging files, committing, checking out branches, and resolving merge conflicts. But it was enough to satisfy my own basic needs at the time and it was enough for many others as well. Development was accelerated post-release thanks to some early contributors who joined the team (shoutout to Mark Kopenga, Dawid Dziurla, Glenn Vriesman, Anthony Hamon, David Chen, and other OG contributors). This not only sped up development but I personally learned a tonne in the process.Tech debt is a perennial threatIn your day job, tech debt is to be expected: there are deadlines and customers to appease and competitors to race against. In open source, then, you would think that the lack of urgency would mean less tech debt. But I’ve found that where time is the limiting factor at my day job, motivation is the limiting factor in my spare time, and the siren song of tech debt is just as alluring. Does anybody want to spend their weekend writing a bunch of tests? Does anybody want to spend a week of annual leave on a mind-numbing refactoring? Not me, but I have done those things in order to improve the health of the codebase (and there is still much to improve upon).Thankfully, open source has natural incentives against tech debt that are absent from proprietary codebases. Firstly, if your codebase sucks, nobody will want to contribute to it. Contrast this to a company where no matter how broken and contemptible a codebase is, there is an amount you can pay a developer to endure it.Secondly, because your code is public, anybody who considers hiring you in the future can skim through it to get a feel for whether you suck or not. You want your codebase to be a positive reflection on your own skills and values.So, tech debt is still a problem, but for different reasons than in a proprietary codebase.Get your testing patterns right as soon as possibleThe sooner you get a good test pattern in place with good coverage, the easier life will be.In the beginning, I was doing manual regression tests before releasing each feature. Although I had unit tests, they didn’t inspire much confidence, and I had no end-to-end tests. Later on I introduced a framework based on recorded sessions: each test would have a bash script to prepare a repo, then you would record yourself doing something in Lazygit, and the resultant repo would be saved as a snapshot to compare against when the test was run and the recording was played back. This was great for writing tests but terrible for maintaining them. Looking at a minified JSON containing a sequence of keypresses, it was impossible to glean the intent, and the only way to make a change to the test was to re-record it.I’ve spent a lot of time working on an end-to-end test framework where you define your tests with code, and although I still shiver thinking about the time it took to migrate from the old framework to the new one, every day I see evidence that the effort was worth it. Contributors find it easy to write the tests and I find it easy to read them which tightens the pull request feedback loop.Here’s an example to give you an idea:// We call them 'integration tests' but they're really end-to-end tests.var RewordLastCommit = NewIntegrationTest(NewIntegrationTestArgs{Description: "Rewords the last (HEAD) commit",SetupRepo: func(shell *Shell) {shell.CreateNCommits(2)},Run: func(t *TestDriver, keys config.KeybindingConfig) {t.Views().Commits().Focus().Lines(Contains("commit 02").IsSelected(),Contains("commit 01"),).Press(keys.Commits.RenameCommit).Tap(func() {t.ExpectPopup().CommitMessagePanel().Title(Equals("Reword commit")).InitialText(Equals("commit 02")).Clear().Type("renamed 02").Confirm()}).Lines(Contains("renamed 02"),Contains("commit 01"),)},})I wish I had come up with that framework from the get-go: it would have saved me a lot of time fixing bugs and migrating tests from the old framework.What comes next?If I could flick my wrist and secure funding to go fulltime on Lazygit I’d do it in a heartbeat, but given the limited time available, things move slower than I would like. Here are some things I’m excited for:Bulk actions (e.g. moving multiple commits at once in a rebase)Repo actions (e.g. pulling in three different repos at once)Better integration with forges (github, gitlab) (e.g. view PR numbers against branches)Improved diff functionalityMore flexibility in deciding which args are used in a commandMore performance improvementsA million small enhancementsI’ve just wrapped up worktree support, and my current focus is on improving documentation.If you want to be part of what comes next, join the team! There are plenty of issues to choose from and we’re always up to chat in the discord channel.Okay, you’ve listened to me ramble about me and my project for long enough. Now onto the juicy stuff.Is git even that good?I’m not old enough to compare git with its predecessors, and from what I’ve heard from those who are old enough, it was a big improvement.There are many who criticize git for being unnecessarily complex, in part due to its original purpose in serving the needs of linux development. Fossil is a recent (not-recent: released in 2006 as commenter nathell points out) git alternative that optimises for simplicity; serving the needs of small, high-trust teams. I disagree with a few of its design choices, but it might be perfect for you!My beef with git is not so much its complexity (I’m fine dealing with multiple remotes and the worktree/index distinction) but its UX, including:lacking high-level commandsno undo featuremerge conflicts aren’t first-classLacking high-level commandsConsider the common use case of ‘remove this file from my git status output’. Depending on the state of the file, the required command is different: for untracked files you do rm <path>, for tracked it’s git checkout -- <path>, and for staged files it’s git reset -- <path> && git checkout -- <path>. One of the reasons I made Lazygit was so that I could press ‘d’ on a file in a ‘changed files’ view and have it just go away.No undo featureGit should have an undo feature, and it should support undoing changes to the working tree. Although Lazygit has an undo feature, it depends on the reflog, so we can’t undo anything specific to the working tree. If git treated the working tree like its own commit, we would be able to undo pretty much anything.Merge conflicts aren’t first classI also dislike how merge conflicts aren’t first-class: when they show up you have to choose between resolving all of them or aborting an entire rebase (which may have involved other conflicts), and you can’t easily switch to another task mid-conflict (though worktrees make this easier).One project that addresses these concerns is Jujutsu. I highly recommend reading through its readme to realise how many problems you took for granted and how a few structural tweaks can provide a much better experience.Unlike Fossil which trades power for simplicity, Jujutsu feels more like a reboot of git, representing what git could have been from the start. It’s especially encouraging that Jujutsu can use git as a backend. I hope that regardless of Jujutsu’s success, git incorporates some of its ideas.Weighing in on the CLI-UI debateIf my debut hacker news post hadn’t sparked a flamewar on the legitimacy of git UIs, it probably would have gone unnoticed and Lazygit would have been relegated to obscurity forever. So thanks, Moloch!I’ve had plenty of time to think about this endless war and I have a few things to say.Here are the main arguments against using git UIs:git UIs sometimes do things you didn’t expect which gets you in troublegit UIs rarely give you everything you need and you will sometimes need to fall back to the command linegit UIs make you especially vulnerable when you do need to use the CLIgit UIs obscure what’s really happeningThe CLI is fasterI’m going to address each of these points.Git UIs sometimes do things you didn’t expectThis is plainly true. Lazygit works around this by logging all the git commands that it runs so that you know what’s happening under the hood. Also, over time, lazygit’s ethos has changed to be less about compensating for git’s shortcomings via magic and more making it easier to do the things that you can naturally do in git, which means there are fewer surprises.Git UIs don’t cover the full APIThis is indeed an issue. However, as a git UI matures, it expands to cover more and more of git’s API (until you end up like magit). And the fact you need to fall back to git is not really a point against the UI: when given the choice between using the CLI 100% of the time and using it 1% of the time, I pick the latter. If you forgive the shameless plug (is it really a plug given the topic of the post?) Lazygit also works around this with a pretty cool custom commands system that lets you invoke that bespoke git command from the UI; making use of the selection state to spare you from typing everything out yourself.Git UIs make you vulnerable when you need to use the CLII’ve conceded the first two points. Now I go to war.What people envision with a seasoned CLI user is that they come across some situation they haven’t seen before and using their strong knowledge of the git API and the git object model they craft an appropriate solution. The reality that I’ve experienced is that you instead just look for the answer on stack overflow, copy+paste, and then forget about it until next time when you google it again. With the advent of ChatGPT this will increasingly become the norm.Whenever a new technology comes along that diminishes the need for the previous one, there are outcries that it will make everybody dumber. Socrates was famously suspicious of the impact that writing would have on society, saying:Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant…The argument perfectly applies to UIs, and is just as misguided. The truth is that some people have good memory and some people (i.e. me) have shockingly bad memory and it has little to do with technology (unless the technology is hard drugs in which case yes that does make a difference). I think that many debates about UX are actually debates between people with differing memory ability who therefore have different UX needs. UIs make things more discoverable so you don’t need to remember as much, and people with shocking memory who stick to the git CLI have no guarantee of actually remembering any of it. Yes, all abstractions are leaky, but that doesn’t mean that we should go without abstractions, any more than we should all revert to writing code in assembly.What’s especially peculiar is that many complex git commands involve a visual component whether you like it or not: the git CLI by default will open up a text editor to prepare for an interactive rebase which is visual in the sense that you’re shown items whose position is meaningful and you can interact with them (e.g. shuffling commits around). The question is whether that interface is easy to use or not, and I find the default behaviour very difficult to use.For the record, I’m good at helping colleagues fix their git issues, but if I’m in their terminal trying to update their remote URL I have no idea what the command is. Not to worry: I do know how to run brew install lazygit.Harry potter memeGit UIs obscure what’s really happeningAgain, strong disagree. Compared to the CLI, there’s nothing to obscure!When I create a commit, several things happen:my staged files disappear from my list of file changesa new commit is appended to my git logmy branch ends up with a new head commit, diverging from its upstreamIf you create a commit from the command line, you see none of this. You can query for any of this information after the fact, for example by running git status, but it only gives you one piece of information. If you’re a beginner using the git CLI, you want to be learning the relationship between the different entities, and it’s almost impossible to do that without seeing how these entities are changed as a direct result to your actions. Lazygit has helped some people better understand git by providing that visual context.Perhaps UIs aren’t visually obscuring the entities, but they are obscuring the commands. Okay, fine, I concede that point. But my caveats in the Git UIs sometimes do things you didn’t expect section above still apply.The CLI is fasterIf you’re a CLI die-hard you probably have some aliases that speed you up, but when it gets to complex use cases like splitting an old commit in two or only applying a single file from a stash entry to the index, it helps to have a UI that lets you press a few keys to select what you want and then perform an action with it. In fact I’d love to pit a CLI veteran against a UI veteran in a contrived gauntlet of git challenges and see who reaches the finish line first. You could also determine the speed of light for each approach i.e. the minimum number of keypresses required to perform the action and then see which approach wins. Even if you had a thousand aliases, I still think a keyboard-centric UI (with good git API coverage) would win.ConclusionAs somebody who maintains a git UI, I’m clearly partial. But I also feel for the devs who I see stage files by typing git status, dragging their mouse over the file they want to add, and then typing git add <path>. It makes my stomach turn. There are some pros out there who are happy using the CLI for everything, but the average CLI user I see is taking painstakingly slow approaches to very simple problems.Weighing in on the terminal renaissanceThe terminal is making a comeback. Various companies have sprung up with the intention of improving the developer experience in the terminal:Warp: a terminal emulator whose killer feature is allowing you to edit your command as if you were in vscode/sublimeFig: a suite of tools including autocomplete, terminal plugin manager, and some UI helpers for CLI toolsCharm: various tools and libraries for terminals including a terminal UI ecosystemWarp and Fig both add original elements to the exterior of a terminal emulator to improve the UX, whereas charm is all about improving things on the inside. All of these projects have the same general idea: rather than replace the terminal, embrace it.I’m interested to see where this goes.TUI vs CLII would say I’m pro-terminal, but borderline anti-CLI. When I’m interfacing with something I want to know the current state, the available actions, and once I’ve performed an action, I want to see how the state changes in response. So it’s state -> action -> new state. You can use a CLI to give you all that information, but commands and queries are typically separated and it’s left as an exercise for the user to piece it all together. The simplest example is that after you run cd, you have to run ls to know which files are in the directory. Compare this to nnn which mimics your OS’s file explorer. Another example is running docker compose restart mycontainer and then having to run a separate command to see whether or not your container died as soon as it started (compared to using Lazydocker). Even programs like npm can benefit from some visualisation when it comes to linking packages (which is why I created Lazynpm). CLI interfaces are great for scripts and composition but as a direct interface, the lack of feedback about state is jarring.All this to say that when I see demos that show slick auto-complete functionality added to a CLI tool, I can see that it solves the problem of knowing what actions are available, but I’d rather solve the issue of exposing state.I want to drive home how easy it is to improve on the design of many CLIs. It’s not hard to pick an existing CLI and think about what entities are involved and how they could be represented visually. A random example: asdf is an all-in-one version manager that can manage versions of multiple programs. So you have programs, the available versions of each program, the currently selected version, and you have some actions like CRUD operations and setting a given version as the default. This is perfectly suited to a UI! It just so happens that somebody has gone and made a TUI for it: lazyasdf (I’m proud to have started a trend with the naming convention!).TUI vs WebSo, I’ve said my piece about how TUIs can improve upon CLIs, but what about this separate trend of re-imagining web/desktop applications in the terminal?nsf, the author of the now unmaintained terminal UI framework termbox, says the following at the top of the readme (emphasis mine):This library is no longer maintained. It’s pretty small if you have a big project that relies on it, just maintain it yourself. Or look for forks. Or look for alternatives. Or better - avoid using terminals for UIWhen you think about it, the only thing that separates terminals and standalone applications is that terminals only render text. The need for terminals was obvious when there were literally no alternatives. Now that we have shiny standalone applications for many things that were once confined to the terminal, it’s harder to justify extending our terminals beyond CLI programs. But there are some reasons:TUIs guarantee a keyboard-centric UXThere is nothing stopping a non-TUI application from having a keyboard centric UX, but few do. Likewise, TUIs can be mouse-centric, but I’ve never encountered one that is.TUIs have minimalistic designsIn a TUI, not only can you only render text, but you’re often space-constrained as well. This leads to compact designs with little in the way of superfluous clutter. On the other hand, it’s nice when your UI can render bespoke icons and images and render some text in a small font so that the info is there if you need it but it’s not taking up space. It is interesting that Charm’s UI library seems to go for more whitespace and padding than the typical TUI design: I suspect that trend will be shortlived and in the long run terminal apps will lean compact and utilitarian (No doubt Charm has room for both designs in its ecosystem).TUIs are often faster than non-TUI counterpartsIn one sense, this is a no-brainer: all you’re rendering is text, so your computer doesn’t need to work as hard to render it. But I don’t actually think that’s the main factor. Rather, terminal users expect TUIs to be fast, because they value speed more than other people. So TUI devs put extra effort in towards speed in order to satisfy that desire. I’ve spent enough time on Lazygit’s performance to know that it doesn’t come for free.ConclusionSo, let’s see where this TUI renaissance goes. Even if the renaissance’s only long-term impact is to support more keyboard-centric UIs in web apps, it will have been worth it.CreditsWell, that wraps up this Anniversary mega-post. Now I’d like to thank some people who’ve helped Lazygit become what it is today.First of all, a HUGE thankyou to the 206 people who have contributed to Lazygit over the years, and those who have supported me with donations.I’d like to shoutout contributors who’ve been part of the journey at different stages: Ryoga, Mark Kopenga, Dawid Dziurla, Glenn Vriesman, Anthony Hamon, David Chen, Flavio Miamoto, and many many others. Thankyou all so much. I also want to thank loyal users who’ve given lots of useful feedback including Dean Herbert, Oliver Joseph Ash, and others.I want to give a special shoutout to Stefan Haller and Luka Markušić who currently comprise the core team. You’ve both been invaluable for Lazygit’s development, maintenance, and direction. I also hereby publicly award Stefan the prize of ‘most arguments won against maintainer’ ;)I also want to shoutout Appwrite who generously sponsored me for a year. It warms my heart when companies donate to open source projects.As for you, dear reader: if you would like to support Lazygit’s development you can join the team by picking up an issue or expressing your intent to help out in the discord channelAnd as always, if you want to support me, please consider donating <3I now leave you with a gif of our new explosion animation when nuking the worktree.Discussion linksHacker NewsShameless plug: I recently quit my job to co-found Subble, a web app that helps you manage your company's SaaS subscriptions. Your company is almost certainly wasting time and money on unused subscriptions and Subble can fix that. Check it out at subble.com
thank youmy name is I works on way too manyprojects at the same time right now I'mworking on Cotton X which is a way toshare renewable electricity with yourneighbors efficiently and usingmathematics to do that and I've beenworking for a few years now on which isa version the version control systembased also on mathematics that I'm goingto talk about todayso I have way too many things to tellyou I'll start with talking about whatversion control is and do a brief recapon uh like where it comes from how itstarted where we're at now then I'lltalk about our solutionand the principles behind it and thenI'll talk about implementation of ofthat Version Control System includingone of the fastest database backends uhin the world that I've been forced towrite in order to implement payrollcorrectlyand then finally I'll have a some someannouncements some surpriseannouncements to make about the hostingplatform to host uh repositoriesso first version control is actuallyvery simple and it's not it's notspecific to to uh to uh coders it's whenone or more co-authors edit three ofdocuments concurrently and one keyfeature a Version Control compared tothings like Google docs for example isthe ability to do asynchronous edits sothis sounds like it's it should be uheasier uh when when you're doing whenyou're going asynchronous because you'regiving more flexibility your users it'sactually the opposite so when you allowco-authors to choose when they want tosync or merge there are there arechanges or their their work then uhThings become much more complicated uhthe main reason is because edits Mayconflicts and and that their likeconflicts happen in in uh in human worklike when I don't have I'm not claiminghere to have a universal solution to allconflicts human humans may haveum but I'm merely trying to uh help themmodel their conflicts in a proper waythen finally another like not finallybut yet another feature Version Controlthat we might like is to be able toreview a Project's history to tell whena feature or when a bug was introducedwho introduced it and uh sometimes itgives indication on how to fix itso many of you here or many people Italk to about this project think uh thatversion control is a sold problem uhbecause our tools like gets MercurialSVN CVS uh people sometimes mentionfossil per force like we have a hugecollection of toolsbutthe our tools like we're they'reprobably considered one of the greatestachievements of our of our industry andyet there are nobody outside of us usesthem so these days we have silverwareprovided by uh NASA with materials thatcan go on Mars but our greatestachievements cannot be used even by uheditors of legal documents orparliaments or any and even even thevideo game industry doesn't use themso it's not because they're they're tooyoung uh they've been around for quitequite a number of decades nowum lately there's been a trend of doingdistributed version controls that's coolthere's no no Central Central serverexcept that the tools are unusable ifyou don't use a central server and evenactually worse or well sorry uh worsethan the central server a global Centralserver Universal to all projectsand uh our current tools require strongwork discipline and planning you have toplan your your uh things in advance sothe picture on the right is a simpleexample of a workflow considered usefuland I personally don't understand itum well I actually do bothumbut but yeah onboarding people andletting like diverse people uh fromoutside of like without for example aformal training in computer science uhknow about uh about this tool they'd belike uh are you crazy or whatever andthat's just a small part of what I'm I'mabout to say because there's flows thatany other any engineer in any otherindustry in the world would just laughat us if they knew about that and so myclaim is that by using these tools weare wasting significant human work timeat a global scale I don't know how manymillions of engineer hours are wastedevery year into fixing that rebase thatdidn't work or refixing that conflictagain or wait wait re-refixing it or Idon't know a re-refixing it there'sactually a command in git called rearumuh some improvements have been proposedusing uh using mathematics physics likedarks for example but unfortunately TheyDon't Really scale uh well just a notebefore I go any further it's likepuregold is open source and I I'm notconsidering uh non-open version shortsystems because get git is good enoughuh it's it's an okay system it's it'sphenomenal in some ways but uh well ifyou if you're going for a commercialsystem uh then you'd better be like atleast as good as that so and I don'tknow any system that achieved out anywayso our demands for a version controlsystem so we want associative merges sowe want like what what that means in soassociativity is a mathematical term andit means that when you take changes Aand B togetherthey should be the same as a followed byB so that sounds like somethingabsolutely trivial and I'll give you apicture in a minute you'll uh that'llmake it even clearer next we wantcommutative mergers so we want theproperty that if A and B can be producedindependently there are ordered theorder in which you apply them doesn'tmatter you have Alice and Bob workingtogether if Alice pools Bob's changes itshould result in the same thing as ifBob pulls Alice's changesthen we want branches well or maybe notwe have branches in pihood but they areless fundamental than in in gits to inorder not to uh do the same kind ofworkflows considered useful that Ishowed in previous slideand obviously we're going to labelalgorithmic complexity an ideally fastimplementations and actually here wehave bothso more about uh like the two propertiesstarting with associative merges so thisis really easy so this is like you haveAlice producing a comments a and Bobproducing your comets B in paralleland Alice wants to First review Bob'sfirst comments merges and then reviewBob's second comments and merge it andthis should do the same as if she uh hadmerged both comments at the same like atonce I'm not reordering comments hereI'm just merging and actually in git orSVN or Mercury or CBS or RCS or like anysystem based on freeware Merch this isnot the casesuch a simple property isn't evenum isn't even Satisfied by by three-waymerge so here's an example of why it's acounter example of the assistant you'veget so you start with a documents withonly two lines in theum Alice this she's following the toppath she will start by introducing a gand thenshe will later add another comment withtwo new lines above that G A and B andBob in parallel to that we'll just addan X between the original A and B and ifyou try that if you tried that scenarioon git today you will see that Bob's newline is like gets merged into Alice'snew lines so that's a giant linereshuffling happening and git that justdoes that silently uh so this is thisisn't even a conflict the reason forthat is that it's trying to run a hackto optimize some Metric and uh it turnsout there may be several uh Solutionssometimes and it just doesn't know aboutthem it just picks one and said ah okaydoneso I don't know about you but if I wererunning high security applications andif I were writing code related securitythis was absolutely terrify me so thefact that your tool can just silentlyreshuffle your lines even if it doesn'thappen often uh it's just super scary italso means that the code your review isnot the code that gets merged so youshould review your your tool requestbefore merge and after merge so that'sdouble the workwell you should test them after mergeanyway but you shouldn't be as carefulin your review and tests don't catch allbugsnow the community emerges that's aslightly less trivial thing because allour all the tools other than darks andpeopleum explicitly prevent thatso community of merges means exactlywhat's in this diagram you have Aliceproducing your comments a bob producingyour comments or a change B and thenthey want to be able to pool eachother's changes and they should justhappen transparently uh and withoutwithout anything special happening andthey should like the order should theorder is important because theyobviously your local local repositoryhistory is something super important butit shouldn't matter in the sense thatum like there should there there is noGlobal way to order things that happenin parallel so uh so that's this shouldbe reflected in in how the tool handlesuh parallelismall right so why do we why would we evenwant that uh Beyond just academiccuriosity because our the tools we arecurrently using right now are neverCommunity even they explicitly preventthat so why would you we won't want thiswell one reason would be that you mightwant to unapply all changes for exampleyou pull the changeand then went on like you you or youpush the change into into productionum because you've you tested itthoroughly and it seemed to work thenyou push more changesand then after a while you realize thatyour initial change was wrong and youwant to unprove it quickly uh withouthaving to change the entire uh theentire sequence of patches that cameafterwards so you might be able to toyou might you want you might want to beable to do that and uh well if you ifyou disallow uh commutation you can'tand here uh Community allows you to likechange that likemove that change that buggy change to uhthe uh latest uh like to the top of theuh to the top of the list and and thenand apply it simply then you might wantto do uh cherry picking so cherrypicking is like oh my colleague produceda nice a nice bug fix while working infeature I want that bug fix but thefeature is not quite ready so how do Ido that without changing the entireidentity uh and solving conflicts andresolving them and re-resolving them andre-re-resolving them so another reason Imight want that is because I wantpartial clones so I have a giant monoRipple and I want to pull justdispatches related to Tiny sub-projectsand so that's the way we handle monorepos in people and you don't need submodules you don't need hacks you don'tneed lfs you don't need any of that itjust it just works and it's just astandard situationokay so how do we do thatum well first we have to changeperspective and take some uh like on onZoom uh the the the the the space andand and try to look at what we're doingfundamentally uh what What's what is itthat we're doing when we workum so we'll start talking about Statesor snapshots and changes also calledpatches sometimes so all our tools todayare storing States snapshots and theyonly compute changes when needed likefor example in three-way merge computeschanges and changes between the lists ofchangesbut now what if we did the opposite whatis if we change perspective and startedconsidering uh changes as the as firstclass citizens why would we want to dothat well because my claim is and it'snot backed by anything but my claim isthat when we work what we'refundamentally producing is not a newversion of the world what or a newversion of a project when we work whatwe're producing is changes to a projectand so this seems to match the way wework or the way we think about workcloser and so it probably will be ableto get some benefits out of that and sonow what if we did a hybrid system wherewe stored both it actually that'sactually what we doall right so this has been looked atbeforeum I'll just give you two examples of ofideas in the in that space that some ofyou may already know about so the firstone is operational transforms it's theidea behind the Google docs for exampleso in Google doc like this is this is anexample so in operational transforms youhave transforms and or or changes on onan initial State here a document withthree letters ABC and thenwhat you do in operational transforms isthat when you have two changes two twotransforms coming in constantlyum you change they might change eachother in order to be to be able to applythem in a sequence so for example hereum on the path down downwards we'rewe're we're inserting an X at the verybeginning of the file so that's changeT1 and on the path to the right we'redoing T2 which deletes the letter cand what happens when you combine thesetwo changes well if you follow the pathon the top you're you're first deletingthe C and then while T1 was at thebeginning of the file so you don't needto do anything because your previouschange uh the deviation changed the theend of five so that's okay nothingnothing special going on there on theother path uh going first downwards andthen to the rightyou have to uh well you you're firstinserting something and then that sothat shifts shifts the index of yourdeletion so now you're instead ofdeleting the character that was at uhposition R2 you're deleting thecharacter that's at position three sodarks for example does this uh itchanges uh it changes it's edit like itchanges uh uh patches as as it as itgoesand it actually does does somethingreally clever to detect conflicts theydon't have time to uh get into thedetails but that's what what they'redoing there is really coolum unfortunatelythis technique leads to uh quadraticexplosion of cases because for if youhave like n different types of changesyou have n times n minus one over two uhdifferent cases to consider and whenyou're just doing insertions anddeletions that's easy when you're doinganything uh worse than that or morecomplicated that's that becomes anightmare to implements and I'm actuallyhere I'm I'm quoting uh like saying anightmare I should implements actually aquote by Google Engineers who try toimplement that for for uh Google Docs soso it actually is a nightmareum all right uh another approach that uhsome of you may have heard about iscrdts or config-free replicated datatypes the general principle is to designa structure and operations at the sametime together so that all operationshave the properties we want so they areassociative commutativenatural examples and the the easiestexamples uh that you might come acrosswhen you're learning about trdts areincrement only counters where the onlyoperation on the counter is just toincrement it and in certainly sets orappend only sets so these are easy nowwhat happens when you want to dodeletions then you get into the moresubtle examples of crdts then you startneeding uh tomestones and Lamport clocksand all these things from distributedprogramming and so I've done the naturalthe subtle now let's move on to theuseless if you consider a full gitrepository that's a crdt so what are weeven doing hereum well the thing is why my claim is whyI claim this is useless is becausesaying git repository the crdt justmeans that you can clone it and you candesign a protocol to clone it and andthat's just itum now if you just if you consider ahead which is the thing we're interestedin which is the current state of yourrepository then that's not a serologythat's like absolutely not one uh simplybecause as I said uh concurrent changesdon't computesothat was like a really brief regard ofthe literature on that thing now let'smove on to uh our our solution or peopleso this all started because we werelooking at conflicts and because theyeasy cases the cases where you can justmerge in everything uh goes goes rightthen that's not super interesting sowhat happens when you look at conflictswhere that's where we need a good toolthe most because conflicts are confusingand you want to be able to just talkabout the fundamental things behind theconflict like we disagree on somethingand and not about how your tools resultlike model the conflicts so the exactdefinition depends on the tool differenttools have different definitions of whatwhat a conflict is so for example onecommonly accepted definition is thatwhen Alice and Bob write the same fileat the same place so that's obviouslyconflict there's no way to order theiruh there are there are changesanother example is when Alice renames afile from F to G and Bob in parallelrenames it to H so that's also aconflict again that depends on the toolanother example which actually very fewsystems handle and people doesn't handlethis uh that's when Alice renames thefunction f while bobco adds a call toALF so that's extremely tricky uh darkstries to do that unfortunately uh it'sundecidable to tell whether Bob actuallyadded a call to F or I did somethingelse so that's one of the reasons wedon't handle this uh there's also manyother reasons but that's good enoughreason for meokay so how do we how do we do that sowhy are reflection and conflicts helpedus shape a new toolbecause we were inspired by a paper bySamuel and mimraim and Cynthia di Gustoabout uh using category to solve thatproblem so category theory is a verygeneral theory in mathematics thatallows you to model many different kindsof proofs in this particular 2Dframework with points and arrows betweenbetween the points that's that's most ofwhat we have in category Theory it's avery it's it's very uh very simpleand very abstract at the same timeso what we want is that for any twopatches f and g produced from an initialState X so F leads to Y and G leads to Zwe want the states p and we want aunique State P such that anything we doin the future so for any state q that wecan reach after both f and gso for anything at least and Bob coulddo to reach a common state in the futurethey could start by uh merging nowreaching a minimal common state uh p andthen and then they can reach Q so we wewhat we want is that for any two patchesyou can start by finding a minimumcommon state and and then and then doingsomething to reach any other futurecommon Statesso I realize I'm going a bit fast onthis slide but a category theorists havethe tool to handle that uh they they saythat if P exists which implies itsuniqueness we call P the push out of FNGso why is why is this important wellbecauseas you can imagine push outs like it'snot that simple so push outs don'talways existand this is this is strictly equivalentto saying that sometimes uh there areconflicts in in between our edits so howdo we how do we uh deal with thatthen well category Theory tells you thatthe quest that gives you a new questionto Lucas so now the question becomes howto generalize the representation ofStates so states are like X Y uh z p orQ so that all pairs of changes like fand g uh have a push outwell the solution is that uh your theminimal like the the minimal uhextension of files that can tolerateconflict so that's what we're actuallylooking at so the minimal extension offiles that can model conflicts isum uh directed graphs where vertices arebytes or byte intervals and edgesrepresent the union of all known orderbetween bytes so I know that so probablysounds a little abstract but I'll giveyou a few examples so for examplelet's let's see how we uh how we dealwith uh bytes with insertions like let'sadd some bytes to an existing file soum well first some details so verticesin people are labeled by change numberuh that's the the change that introducedthe introduced the the vertex and thenthe interval within that change and theedges are labeled by the chains thatintroduce them so for example here we'restarting with just one vertex uh C 0 0 nso that's the first n byte in change Gzeroand we're turn we're trying to insert uhM bytes between positions I minus 1 andI of that vertex so what we do is westart by splitting the vertex in so weget two vertices C 0 0 I and C 0 i n andnow we're inserting a new vertex uhbetween these two halves of the split sothat's super easy and now we can we canwe can tell from that graph that ourfile has three blocks so one is thefirst I bytes of c0 followed by thefirst M bytes of C1 and then uh somebytes in in c0 so bytes I I to nokay so uh that was easy enough so nowhow do we delete bytes well a good thingabout Version Control is that we need touh keep the history of the entirerepository anyway so it doesn't costcost more to uh to just uh keep the keepthe deleted parts so that's what we dohereso starting from the the graph weobtained in the last slide what I'mdoing is uh I'm I'm now deleting byteslike a contiguous interval uh by itsfirst j2i from c0 and then 0 to K fromC1 so that's by it's starting from uh Jand then uh I minus J plus K I'mdeleting I minus J plus K bytes fromthat from there so the way I do I do itis exactly the same thing the same wayas as for insertionsI start by splitting my verticessplitting the relevant vertices at therelevant positions and and then the wayto mark them as deleted is justum is just modifying the the label ofthe edges so here I'm marking my edgesas deleted by uh turning them into Dashdashed linesand and that's all we need that's that'sit so that payroll is not morecomplicated it's a bit more complicatedthan that but that's fundamentally thethese are the two constructs we uh weneed and then there's a lot of stuffabove above that but that's a like atthe very base the very basic it's justjust that so other vertex in the contextthe context of parents and children ofthe vertex then change the night andedges labelsoum how does that handle conflict I won'tdive into that too uh too too deep uhfor uh regions of time but I'll justState the definition of conflicts andI'll stop there so umthey like first between getting intobefore getting into conflicts first livevertices I call the live vertices that'sthe definition uh there are verticeswith incoming edges are all alive anddead vertices are vertices with incomingedges are all that and all the othervertices so verses that have both aliveand dead uh Edge is pointing to themthey're called zombies and now I'm readyto State my definition of complex so agraph has no conflicts if and only if ithas no zombie and all its alive verticesare totally ordered so that's that's mydefinition of conflict and actually itactually matches what you expect sothat's just a a sequence of bytes that'sthat can be ordered unambiguously and uhit can be you can tell for each bytethat it is either alive or dead but doneboth at the same timeand well there's an extension to that anextension of that to uh files anddirectories and so on but that's uhthat's a significant significantly moreinvolved so I won't talk about thatokaysoum just some concluding remarks on thatpart uh changes are so I said I I wantedthem to be commutative so I I can getthat uh using using this uh thisframeworkuh they're not completely commutative inthe sense that changes are partiallyordered by their dependencies on otherchanges so each change hasum encodes explicitly a number of uhdependency dependencies that arerequired in order to apply the changelike for example you can you cannotwrite to a file before introducing thatfile or you cannot delete a line thatdoesn't exist yet so that's like basicdependenciesso now cherry picking well there's nothere's there there isn't even a cherrypick command in people because chirpingis the same as applying a patch we don'tneed to do anything specialthere's no git driver so git rather isabout uh solving a conflict like havingto solve the conflict several times Idon't know if many of you have used thatcommands but uh the goal of I think it'ssomewhat automated now but the goal islike once you've solved the conflict yourecord the conflict resolution and thentry maybe if git allows to maybe replayit sometime in sometimes in the futureif it works and it doesn't always workso now conflicts are just like thenormal case and they're sold by changesand changes can be Cherry Picked so ifyou've solved the conflict in onecontext you don't need to solve it againin another contextum for partial clones and monoreepos soI already mentioned that but there's athey're easy to implement as long aswhite patches are disallowed so forexample if you do Global reformattinglike a patch a patch that reformats allum all of your repository at once well Idon't know we want to do that but if youdo that obviously then you introducedependencies like unwanted dependenciesbetween changes so if you want to dothis Global reformatting one thing youcan do is just just make like one patchby uh one reformatting patch by uh by asub projectand then you can keep goingfor large filesum well one thing I haven't meant like Ihaven't really talked about uh in detailis the way we handle large files is thatlike patches patches are actually havetwo partsone part is the description of what theydo so insert inserting some bytesdeleting some bytes and the other thingis uh the actual bytes that are insertedor deletedand the way we handle large files is bysplitting patches into the descriptionof what they do and that's like theoperational parts and theand the the actual contents and theoperational part can be exponentiallysmaller than the actual content so forexample if one of you are if you work ata video game company and uh one of yourartists has produced 10 version of twogigabyte assets during the day you don'tknow you don't need to download all all20 or all 10 versions you only need todownload the bytes that end up beingstill alive at the end of the dayso that allows you to just handle largefiles easily while you still need todownload some some contents but muchless content than deleting all theversionsall right so let's move on to uh someimplementation tricks and some like coollike some things I like like there's alot there's a lot to say aboutimplementation but I'm just going totell you about some things I I like andsome things I'm proud of and and theimplementation of this systemum the main the main challenge wasworking with large graphs on disk soobviously when you're doing any uh kindof like more complicated data structurethan just filesthe question arised of of how you shouldstore them on disk so you don't have toload the entire thing each time becausethat would be like the the cost would beproportional to the size of history andthat's just an acceptableso we want it to be actually logarithmicin the size of history and that's whatwe achieve so we can we can upload theentire graph each time so we have tokeep it on disk and and manipulate itfrom there so the trick is to storeedges in a key value store so verticesand edges vertices mapping to uh theiruh edges to their surrounding edgesanother thing we absolutely want istransactions we want passive crashsafety uh if like the goal with pihul isto be much more intuitive than anythingelse than all the existing tools my goalis to introduce it to lawyers artistsmaybe Lego builders or sonic Picomposers or the these kinds of peopleanduh these people cannot tolerate uh non-plike active crash safety they don't theycannot possibly uh they cannot possiblytolerate like some operation on the logthat should be done after you'veunplugged the machine for example orafter a crash happens so we absolutelywant thatand next another feature is that we wantuh branches so they're not as useful asin gits but we still want them and so wewant an efficiently forkable store so wewant to be able to take a database andthen just clone it without copying asingle byteand so in order to solve these problemsI've written a library called Santaclaria so there's a Finnish word uhmeaning dictionary and it's an on thistransactional key value store but it'snot just that actually it's a more likea general purpose file block allocatorso it'sumit allocates blocks in the file in atransactional way so if you unplug themachine at any time but I really do meananytimeum your your all your equations will goaway and memory will be automaticallyfreedand so it uses crash safety usingreferential transparency and copy andwrite tricks so it never modifies theprevious version it just creates a newversion and with that comes at a no costbecause you don'tbecause you you already like you alreadyneed to so when you're working with diskuh with this files you you already needto read them and so this is such anexpensive operation that just a fewcopies even like one I do only one copyat most but just a few copies each timeyou read a block from a file don't costanything more than uh just like don'tcut don't cost like the the cost of thatis just negligible compared to the costof reading reading a blockum it's workable in bigoof login butlogin is like an approximation it's anabsolute worst case is logarithmic inthe size in the total number of keys andvaluesand it's written in Rust whichmight make some of you feel that it'sprobably safe to use and so on but itactually it's actually it actually usesa super tricky API because it's way toogeneric and it's actually super hard touse and anyone anyone wants to uh usesanakedia often has to write a layer ontop of it in order to just provide thenormal safety guarantees that we mightwant from a rust library and it uses ageneric underlying storage layer so Ican store stuff in an M map file but Ican also do my read and writeindependently individually and manuallyor I can I can use a IOU ring like thenew fancy uh i o system in Linux or Ican do well other thing I'll talk aboutin the rest of this talkso now just a really brieflike a really brief description of how Ilike how like I I won't okay just are-brief description of how I how Imanage crash safety using uh using thissystem and using multiple beat trees androotsso B trees are these magical datastructure that always stay balancedwithout having to do anything special uhthe reason is that in a b treeinsertions so they're a search tree withmore than just one element in each ineach node so there can be usuallythere's like my nodes for example inSanta claria are limited to their to thesize of one memory page or one disksector so four kilobytesand I store as many keys and values as Ican in these blocks so here for the sakeof this example I've just limited my uhblock size to just two elementsto to keep the to keep the picturesimple so for example let's say I wantto insert a five uh so I first I I firststart by uh deciding where I want toinsert it so routing from the toplike I know I need to insert it betweenbetween three and seven because five isbetween between three and seven uh so Igo down to this children to this childand now I know that I need to insert thefive between the four and six so thisnode is already full because I told youthe limit is two elements so this causesa split in this node so now I get two uhblocks two uh two leaves 4 and 6 and Iwasn't able to insert the 5 in any ofthem so this means that I have to insertit uh in the parents so between thethree and seven but then again that nodeis full it's it's already at maximumcapacity so I need to split it and nowthis is what I get and so this this ismagical and because it's super it's asuper simple way of doing insertionsthat keep the tree balancedbecause the only way the depth canincrease is by splitting the roots andthis gives you automatically theguarantee that all paths will have thesame length so I really love that ideait's one of the oldest data structuresuh but it's still really cool and it'svery suitable for storing stuff on diskuh so now just a bit about uh crash 15how how we do how we use that uh to uhto uh keep our data safethe way we do it is by having a numberof sectors at the beginning of the filepointing each to one copy of the entiredatabase so for example here in thefirst page I'm pointing to the oldversion of my my B tree which is thisversion hereand on the next one I'm building the newversion by modifying some stuff and andum the new version well I don't have toto just to copy everything I can just Ican just copy just the bits I editsand the old uh it will share uh most ofit's like everything that hasn't beenmodified will be shared with theprevious versionso that's that's all we do and so whathappens when you unplug the machine atany time really uh well that part willnot get written like the the the pagesat the beginning of the file would notget written and so nothing will happenthe the allocations will get back towhat they were before you start thetransactionand the the Comets of a transactionactually happens when we're uh changingthe first eight bytes of the file so uhhard drives usually guarantee that youcan write a full sector they have alittle battery inside that keeps goingto write at least like one full sectorwould often they tell you well it's bestefforts so there's no actual guaranteethat they do that so they guarantee itbut with no actual guarantee I don'treally know what that means but what Iknow is thatum writing eight bytes should be okay soif they try to do best effort for uhfour thousand and Ninety Six bytes thenprobably there's eight bytes they theycan they can certainly do it with highprobability another feature of thissystem is that Riders don't blockreaders because the old versions arestill available so if you start atransaction while a read-onlytransaction while you're writing uhwriting writing something you can stillread the old version and so that'sthat's really cool as well it's notsuper useful in Imperial well unless youstart running people in the cloud asI'll share in a minuteand while this sounds like somethingsuper uh fancy and with lots of likeredundancy crash safety copy and writeand should be super expensive butactually it's the fastest key ValueStore I've testedso this these are two curvesum showing how long it takes to retrievethings so get and insert things into myB trees this is not specific to peopleit's not particularly optimized forpeople the only thing that's related topeople is that I not implementing longvalues yet just because I have I'venever needed to do that but so here I'mcomparing four systems so four differentimplementations of key valuesum the the most like the slowest one isa rust driver equals sled so sled issuper slow but it's it's also reallyreally cool it's using state-of-the-arttechnology to um to to do Lock Freetransactions on the database so you canhave a giant computer with thousands ofuh of course or maybe hundreds of coursemore realistically and your transactionswon't block each other and there willstill be uh have acid guarantees so thisis super cool but unfortunately it'sstill a research prototype and so forthe kind of stuff I'm doing in a singlecore uh it's not super relevant so thegreen line is the fastest uh C libraryum lmdb it's battle tested and all thatand uh it's claimed to be the like thefastest possibleum in many placesand now this is uh Santa claria thesystem I've just introduced and this islike the orange line is a benchmark ofsomething that cannot be achieved sothis is the standard Library theimplementation of B trees in thestandard library of rust and so itdoesn't store anything on disk so ifyou're storing stuff on disk I willobviously obviously take more time sothis is just like the reason I've I'veadded it I added it there is to just uhsee how close we are to uh doing that sowe are not paying a lot you know tostores well this is an SSD driveobviously but we're not paying a lotbecause we're not we're minimizing thenumber of times we're writing andreading uh to the disk so so that's itwhile the puts thing has a similar uhperformanceremoving sleds we can see it moreclearly so this is about twice as fastas the fastest uh C equivalenceokayso and this was actually no unexpectedlike performance was never the goal uhthe goal was to just to be able to Forkum and initially I contacted the authorof lmdb to get him to introduce a forkprimitive but uh it wasn't it wasapparently Impossible on the design so Ihad to write my ownum all right so now some announcementsso a hosting platform so we we all likeworking together but we don't likesetting up servers so how do we uhcollaborate and share repositories uhone one way to do it in bihull andthat's been the case since the beginningis to useum self-hosted repositories using SSHbut uh it's not often convenient youhave to set up a machine in order towork together so I've wanted to build ahosting platform and I actually built itthe first version was released quite awhile ago in 2016. it's using uh it'swritten entirely in Rust just likebihull and uh and postgresql to dealwith all the user accounts and uhdiscussions and text and so onuh there was running for a while andseeing on a single machine it wentthrough all the iterations of the rustuh asynchronous ecosystem so that's alot of refactoring and rewrite and so onand it's never been really stable reallyuh but the worst time for stability wasdefinitely ovh to Strasbourg data centerfire in March March 2021 where mymachines so I've seen a slide yesterdayin one of the talks where someone talkedabout your server being on fire but Idon't think they really mean it likehere I do really mean it uh like thereis an actual fire in the actual datacenter and so the machines were down foruh two weeks and because it was anexperimental prototype uh he had no uhreal backups replications or anything ofthe kind in place so during these twoweeks I took advantage of that littlebreak in my work to uhto rebuild something to rebuild areplicated setup using well the factthat people is a crdt itself so it'seasy to replicate and then I've usedraft the raft protocol to replicatepostgres and at the time it was alsoconvenience because my two largestcontributors like the two largestcontributors to people who were usingthe South and cross cable if you guysknow what that means so they were theywere communicating with the server instrasbour by first going from NewZealand and Australia to uh SanFrancisco and then across the us acrossthe Atlantic Ocean uh to uh uh start inacross France to to Salisbury so theyhad absolutely unbearable latenciesand so I was able to give them so thiswas cool and convenient because it was Iwas finally able to give them a properserver with short uh short latenciesshort response times but it's beenworking okay for two years now a littlebit over two years but the problem isthat this is a at the moment it's apersonal project it's totallyyarn-funded so the machines are reallysmall and I'm using postgres in likeways that aren't really intended becausemy like the core of my database isactually Santa Clara and pihul it's notit's not installed in in Plus in uhpostgres so I need to communicatebetween these two databases and so Ineed the databases to be located closetoum they're not like the replica are notjust backups they're they're backups andcaches at the same time so theconsequence of that is that when themachines are under a high at like toohigh a load it causes a failure ofpostgres so postgres takes a little moretime to answerand and so the raft uh thing understandsthat as a total failure and triggers aswitchover of the main uh like theleader of the cluster and that would beokay uh just having some down time rightbut actually the consequence of that isway worse than downtime is data loss sohaving small smaller machines is fine Idon't mind uh if some of my users areusing my experimental system and it justcrashes sometimes or is down for alittle while that it doesn't reallymatter but when they're starting to losedata that's that's a problem so I'vedecided to uh rewrite it uh and becauseI were working on with cloudflareworkers and function as a service and inother projects and my renewable energyprojects I started thinking about how wecould use people uh to do that so thelike really quickly function as aservice is uh different from traditionalarchitecture where you have a bigprocess overhead or a virtual machineoverhead for each uh for for like eachlittle piece of server you're runninginstead of doing that you're justsharing the machine and sharing just asingle giant JavaScript runtime withlots of different uh processes orfunctions and even but even from otherusers so cloudflare uses on like eachmachine uh this giant runtime shared byall its customersso that's really cool because uh you cananswer from all of like cloudflare's 250data centers and it gives you optimallatencyit's also very easy to writeso that's uh the the minimal exampletaken from their documentation whereyou're just answering a hello workerfrom like you're responding now to arequestand now the question becomes like can werun or at least simulate peoplerepository in a pure functional serviceframework like the storage options arefairly Limited in function as a serviceyou don't have access to to a hard driveyou don't even have an actual machine sohow do you do that or at least how doyou pretend to be a full-fledged beholdrepository where in fact you're just alike some key Value Store somereplicatedeventually consistent key value store inthe cloud so that's the main challengeit's completely unlike everything I hadbeen like completely at odds with myhypothesis when I first wrote SantaClara and NP hole has not no hard driveat allso the solution is to compile Santaclaria to wasm because you can run oneof them on on cloudflare workersum and you are storing sudo memory pagesand storage engine so instead of insteadof using uh these sectors I'm using keykeys and values in the in their storageenginethe mainthe main problem now becomes uh theeventually the eventual consistency I'msolving that problem by using themultiple heads I talked about earlierlike the multiple routes so I keep theolder Roots because I know that maybelike the changes I'm making to uh to mykey value store I haven't propagated toall of data centers so I keep the oldroutes while they haven't propagated socloudflare guarantees for example a oneminute propagation time so that's what Iuse to keep to keep my older olderbranches in order to avoid like steppingon each other's Foodsso and we don't need a full period likechecking dependencies and maintainingmaintaining a list of fetches is enoughokay so some technical details to aslike almost my conclusion um so thisthis service is using typescript for webPartsum as well for the UI and then Rustinwasm for the people parrots um it can beself-hosted although I've never testedthat yes uh using cloudflare's workersso they've released their runtime uh inlike as an open source projects it'sopen source uh agpl license and it willbe released progressively becausethere's a lot a lot of stuff to releasethat's currently just experimental inprototypal and it's starting today soI've just opened it just before thebeginning of this talk uh so now youguys can connect to a nest.phool.org andstart black creating an accounts andlike there's no documentation uh thingsmay crash there's probably lots of bugsbut this will come in the next few daysor weeksuh okay so as a conclusion uh this is anew open source version control systembased in proper algorithms rather thancollections of hacks like uh we we'vehad uh for some time uh it's scalable toMono repos and large files it'spotentially usable by non-coders thelike the craziest like the the fartheststretch I I've seen uh in discussions inthat project is using it uh as uh as atool to help parliaments do their jobsso parliaments are giants VersionControl Systems operated manually by uhhighly qualified and highly paid lawyerswho are paid to like check theconsistency of The Logical consistencyof the of the law but actually spend asignificant share of their time actuallyediting Word documents to apply changesthat have been voted voted by a memberof parliaments so they're doing manualVersion Control and they're wasting lotsof time on that and I've collaboratedwith the French parliaments uh whichwould have been a good test case becausewe're not actually using our Parliamentat the time like the cabinet passestheir bills as they wishso it's like the test mode of uh of anAPIuh it can be usable by artists by uhI've talked to lawyers as wellum by maybe Sonic pie composers we had areally cool discussions last night aboutthat and maybe why not buy uh Legobuilders wanting to build largerprojectsthe hosting service is available sincetoday I've said thatand another conclusion uh is is apersonal conclusion of mine so I have atendency to do work and way too manythings at the same time but uh and andit never works well until it does likefor example here working on uhelectricity sharing at the same time asum as a Version Control help me see howthese would fit together and share ideasacross across projects so to conclude Iwould like to acknowledge uh some someof my co-authors and contributors ofFlorence Baker for all the uhdiscussions Inspirations and earlycontributions so tank feeder so that'sthe most patient tester I've ever seenum he's still there after many years uhpassion patiently checking all my bagsso a huge thanks to him Rohan Hart anduh Chris Bailey uh though Hart and AngusFinch are actually the two folks usingthe Southern Cross cable and they'vecontributed like really cool stuff topeople who increase uh Bailey who havebridged the gap between lawyers andlegal people and and uh what what I'mdoing all right so thanks for yourattention
uh hi everyone first of all thank you somuch for comingthis is a project I've been thinkingabout and working on for a couple yearsand this is the first time I'mpresenting it at a conference and I'mexcited and nervous and I reallyappreciate your interest um if you'rewatching this at home thank you so muchfor watching this video I watched tonsof these NDC videos myselfumso before I startthere is if this wouldn't work oh I gotit there we go you've seen thisuh disclaimer on Twitter and whateverelse like I need to kind of double downon it before I start because yes I workfor GitHub and yes I'm talking aboutSource control but this is a sideproject this is a personal project I'mnot announcing anything on behalf ofGitHub today at all don't get me introubleand uhmy friend here is watching you so don'tdon't get it wronguh here's what we're going to talk aboutyou know a lot of these uh kind of techpresentations they're either productpresentations where it's about here'swhat the product does and then there'sphilosophypresentations and there's codepresentation that gives you a little bitof all three and talk about why it'simportant to be working on Sourcecontrol right nowI'm going to talk about of course Graceand what I've done with it and talkabout the architecture of it do a littledemo and then I'm going to talk there'sa product for it then I'm going to talkphilosophy about why I chose F sharp andwhy I love it it's my favorite languageand then we're going to look at somecode at the end and and hopefully it allmakes senseso why on Earth am I doing a sourcecontrol system I meangit has won rightlike what's wrong with me uh like withall due respect to get to honorablecompetitors like git has wiped the floorwith everyone and you know we all use itby default it handles almost everySource control scenario like large filesare a problem as you probably knowum I think get one because it'sbranching mechanics or really brilliantcompared to the the old the older othersource control systems in the past thelightweight branches are great I lovethe ephemeral working directory and gitof course GitHub is a big part of whyget one and you know Linus Torvalds I'veheard he's famous so that clearly hadpart of it and andyou know in in terms of doing a newsource control system I I have touh I have to talk about some of thethings I don't like about gitum so I have to say some mean thingsabout giton the next slide but before I doum I just want to really clearly say howmuch I respect get and how much Irespect its maintainers I'm privilegedto be at GitHub and to know some of thesome of the core maintainers of git andthey are brilliant Engineers like I youknow one day I hope to grow up and behalf as good as they are they do worldchanging engineering they're superprofessional there I mean they're justamazing their blog posts if you everwant to read the stuff from Taylor Blauand Derek Stoli on the GitHub blogthey're like mandatory reading they'regreat I I really deeply respect gitumhowevergit has terrible ux just absolutelyterrible terrible no good horrible yeahit's designed by a kernel developerlike I don't want my ux designed by akernel developer everI mean like my my second programminglanguage when I was 11 was 6502 assemblylike I've done three kinds of assemblerlike I get it I love being down to themetal it's super interesting butI don't want them designing my ux andreally get was designed for 2005'sComputing conditions which was you knowmuch lower bandwidth on networks smallercomputers smaller disks smaller networksand likewe don't have those constraints anymorewe all have if you're here at NDC ifyou're watching this video odds are youhave abundant bandwidth abundant diskspace you have connection to the cloud alot of us can't even do our jobs withoutan internet connection anymoreso you know things have moved on I alsoI mean there's tons of research on howconfusing git is but what I Really WannaI kind of pound the table on is that weneed to stop blaming users for notunderstanding it git is really hard tounderstandthere's this really interesting quotefrom this research paper from Googlefrom I don't know 10 years ago orsomethingeven one of the more experienced gitusers requested that someone elseperform an operation because quote itscares the out of me so like that'sit it's not just me saying this by theway Mark racinovich is a pretty smartguygit is the bane of my developmentprocessso so being smart is not the ticket outof understanding it rightum an incantation known only to the gitWizards who share their spellsum you know who else knows git is hardGitHubthis is what you see when you go todownload GitHub desktop focus on whatmatters instead of fighting with getso we knowum here's some interesting questionsfrom quora this is like from a year agoI was doing this research and some ofthese questions are kind of funny likewhy is it so hard to learnwhat makes you hate gitis get poorly designeduhuh kid is awful that's plain enough butyou know things are funny and there'spain in these questionstwo which I really feel but this onethis one broke my heartif I think it is too hard to learndoes it mean that I don't have thepotential to be a developerand and I just like imagine some youngperson who's just getting startedmessing around with some JavaScript orpython or something and they're andthey're thinking wow like I think Icould do this like this seems fun andthen someone hands them gitand and they go what what am I what isthis you know and and like likewe deserve betterlike we need to do better andum and instead of complaining about itI've decided to devote pretty much allmy free time for the last couple yearsto doing something I also just want topoint out that we're in an industrywhere everything changes like nothinglasts forever and git right now is sodominant that it's hard to imaginesomething new coming along and replacingit but if we don't imagine it we can'tget thereum I just want to say that we're drivenby Trends just as much as we are by goodtechnology andumI've had a little bit of an education onTrend analysis I'm I'm very lucky my ummy my partner was a fashion designer wasa clothing designer andum and her job for 15 years was to thinkabout what women would want to wear ayear and a half two years from now havereally to think about what women wouldwant to feel like a year and a half twoyears from now and then design clothesfor it and she was very successful she'svery good at it and just in like theconversations over all the years we'vebeen together I'veI've picked up a few things I picked upthat perspective she sold way betterEarth than I'll ever beum and you know some of our Trends haveshorter Cycles like web UI Frameworksright they come and go every six monthsor whatever I mean it's stabilized nowbutum and some are longer like you know for20 years I think when you said the worddatabase what you meant was relationaldatabaseand now there's key value there'sdocument there's you know I mean reallywith Hadoop we got mapreduce so likethings things do changeum most importantly no product that'sever gotten to 90 has just like gottenthere and stayed there foreverumit is currently according to the stackOverflow developer survey there shouldbe a new developer survey coming out inabout a month or soum gets it 93.9 percent and somethingwith ux that bad is not going to be thething that breaks the trend like git isgoing to gosothere will be Source control after gitlike there will beum and what I want to say about it isthat it won't be like git plus plus I'vehad this discussion a lot over the lastcouple years well can't you put anotherlayer on top of git can't you maybe usegit as a back end and put a new frontend on it like that's been tried anumber of times over the last many yearsand none of them have gotten any marketshare so I feel like people once you gothrough the umchallenge of learning Git You Don't WantTo re-challenge Yourself by learningsomething else just to use the thing youalready know and I want to say addingfeatures to git won't prevent git frombeing replaced it's not about well ifgit make some changes it'll extend itslifespan it just it won't it's too latefor thatum I do think that whatever replaces getwill have to be Cloud native becauseoh look it's 2023.umand I think the thing that that willattract users to use the new thing isthat it has to have features that gitdoesn't have or git can't have and I'vetried to build someum okaylet's talk about GraceumSource control is easy and helpfulimagine thatthat's really what I've tried to do mynorth star from the firstevening that I was sitting on my porchdreaming during pandemic during lockdownjust sitting on my front porch thinkingabout it I thought how can I makesomething super easysomething that actually my inspirationbelieve it or not was the OneDrive syncclient and if you like the OneDrivesyncline used to be problematic a fewyears ago it's like really startingabout three four years ago it's beengreat it's like Rock Solid it just worksI really like it and you substituteDropbox iCloud whatever you want butlike those things just work they're easythey sync I was thinking about like howdo I get stuff off my machine onto intothe cloud so anyway Grace has featuresthat make being a developer easiernot just Source control but try thingsthat try to make your life easier everyday and it feels lightweight andautomatic that's what I'm mostly goingforit hopefully reduces merge conflictswhich in Grace you'll see here calledpromotion conflictsand yes of course it's Cloud nativethanks to dapperso let's just talk about the basic usagereal quick I'm going to show you lots ofpictures and demos I'll show you a shortdemo so the first thing is Grace watchand Grace watch you can think of that aslike the thing that runs in the systemtray and windows in the lower right handcorner on a Mac it runs with that withthat little icon near the clock so justa background process that watches yourworking directory for any changes andevery time you save a file it uploadsthat file to the cloud to the repo andmarks it as a save so that you get afull new version of the repo like in anew root directory version with a newcomputed shawl every single time you seethe file and it's just automatic it justworks in the background I'll show yousome detailed examples of that asidefrom that background process the sort ofmain commands you're going to see hereGrace save save is that thing that Gracewatch is going to run to upload afterevery save on diskcheckpoint and commit so in git commitis an overloaded concept right commitmeans I'm partially done like git commitminus M I'm done with part one of the prI'm done with part two and then you do afinal git commit I'm ready for the prand then we have this debate about do wesquash right there's no squashing ingrease you don't need to checkpoint isthe is that intermediate step and that'sjust for youit's for you to mark your own time uh itdoesn't matter you know it doesn'taffect anyone else but you can mark thisversion as this this version is that andbecause of that we can do some coolthings I'll talk about laterum there is a Grace commit of coursecommit is a candidate for merge or whatI call Promotion and eventually whenyou're ready to promote you do a Gracepromote I'm going to show you how thebranching Works in a little bitthere's also of course a Grace tag whichis the label if you want to label aparticular version like this is versionthree of our production code so thosefive things save checkpoint commitpromote and tag or what I call areference and a reference in Greece issomething that just points to you aspecific root version of the repo andthose root versions just come and comeand go of course there's other commandsyou need a status Grace switch to switchbranches Grace refs to list thereferences that you've been working onall your saves and checkpoints andcommits you can list just yourcheckpoints and gesture commands andwhatever there's of course a diffthere's a rebase there's going to be agray share command gray share is kind ofinteresting because it'll I haven'twritten it yet some of this stuff I'mtalking about I haven't written yet thisis an early Alpha just to be clear Ishould have said that it's an earlyAlphait it kind of works umum but gray shares this idea where likeI'm working on my code I have a problemand I might want to you know get in chatwith a team member and go hey could youlook at this code for me and you saidgray share and it'll spit out a commandline that you can just copy and pasteand give to that person and because withGrace everything's automaticallyuploaded and the status of your repo isalways saved you don't ever need tostashand you can just literally take thatcommand paste it hit enter and now yourversion of the repo is exactly the onethat your teammate was working on andwhen you're ready you just do greatswitch back to your own branchso that's the kind of workflow I'mtrying to enableuma quick like General overview so locallywith Grace Grace watch I said you knowis this background is file systemlauncher that watches your directoryswitching branches is really fast andlightweight I love that about git I keptitum of course like I have a doc GraceGrace directory like a DOT get directoryand I have objects in it and config andstuff so that's the local side yourworking directory is still totallyephemeralum gray server is actually just a webAPI there's nothing fancy there's nospecial tightly packed binary protocolsit's just a web API you can code againstit too I use Dapper if you're familiarwith Dapper to enable it to run on anycloud or any infrastructureum and because if you're everyone'srunning Grace watch and you don't haveto run gracewatch but you really want toum the server kind of has a real-timeview of what everyone's up to and thatenables some really interesting featuresto be builtumthere is because everything that happensin Grace is an eventthat gets savedumit immediately gives you an Eventingmodel that you can automate from and Ijust want to say the system is writtenalmost entirely in F sharp and I'll talkabout that of coursenow some of you might be thinking waitjust a damn minuteyou're uploading every save that seemslike a lot and the answer is yeah but wedon't keep them for very long unless youspecifically checkpoint or commit orpromote your saves afterlike by default I'm starting with sevendays I don't know if that's the rightnumber maybe it's three maybe it's 30whatever it is we'll figure it out andit's going to be settable in the repobut I didn't like to say seven days wejust delete the savesso they're there for you to look back onwhen you want to and then they just goaway commits and promotions are keptforever and that sort of like gets me tothinking what features can we build wellone thing I wanted to build was ifyou've used like mac backup the timemachine and that that interface that wassort of an inspiration I had in my headI'm not going to quite do design it likethat but this idea where you can just goback and look at the the diffs of thechanges that you've made and hopefullydo some cool stuff like reload stayed inyour mind faster you know that is thatthey're often quoted number of it takes18 minutes to reload state in your headafter an interruption well like what ifwe could use Source control to get thatdown to two minutes where you could justquickly just Leaf through the one leafthrough them and go oh I see I see I seewhat I was doingumyou can also do cool stuff like if theserver can see what everyone's doing andit sees that Alice is working in aparticular file and Bob is working inthe same file right now well that's apossible conflictso we can notify both of them in realtime and go hey your teammate is workingin the same file maybe you want tohere's here's a link to look at theversion that your teammates changingmaybe it's in a totally different partof the file you don't care about maybeit's in the same part and now you get onchat and talk about it before before youget to a conflict when you think you'redone with your work like I hate mergeconflicts I hate them so much right I'msure I'm alone in thatum another thing we can do is when thereis a promotion domainI can Auto rebase everyone in the repoimmediately on that new versionso and you'll see that that's superimportant in the way Grace's branchingis designed but it happens withinsecondsum so what is the branching model I callit single step branching and the idea isthat like git I mean there's theseversions of the repo and they have aparticular Shaw value and and all of theyou know commits and labels and get justpoint to a particular version wellthat's what I do with Grace a promotionis just a new reference a databaserecord in other words it's a newdatabase record that points to the sameroot directory that your commit pointedto so I do Grace commit minus M I'mready to promote then I do Grace promoteto Mainand all all that I do when when you doGrace promote all it does is it createsa database record that's all it doesbecause the files are already uploadedyou already did it come ina child Branch cannot promote to theparent branch unless it's based on thelatest promotionso that prevents a lot of conflictso let me just show you a little bitabout what that looks likehere is a diagram that I have oops nonot thatthishere's my repo by the way I want Stars Iwant lots of starsI want all the stars if you're watchingthis home I want all the starsstar of the repoit's the only way I can let people knowso here's a here's a little document Ihave on my branching strategyum I'm going to show you four quickpictures just to walk you through reallyquick here's Alice and Bob when they'renot busy if you're into physics you knowyou see Alice and Bob a lot when they'renot busy holding two entangled particlesthat improbably large distances they'reworking on a repo togetherand here's uh here's like they'restarting it's in good shape Maine isthis particular Shaw value right it'sthis EDF 3F whatever and ed3f and Aliceand Bob are both based on that versioneverything's cool now their Shaw valuesare different because they're they'reupdating different files Alice isupdating her files Bob is updating hisfiles but they're both doing their thingokay now Alice makes one more changeruns some tests and goes cool I'm readyto promote so she types Grace commitminus M I'm done with my PR and then shetypes Grace promote minus M I'mpromoting and that now of course herfiles are already uploaded because shedid the commit that creates a newdatabase record on the main branch thatpoints to that particular version of therepoumso now after she does that let's assumethat it's successfulwe now are in this state where Alice isbased on the version that she justpromoted because in fact in fact she'sidentical to the version she justpromoted because she just promoted itbut Bob is still based on the previousversion of Maineso that means that for the moment Bobcan't promotepoor Bobhoweverheroically craze watch comes to therescueum that's so hacky I'm sorryumbut Grace watch gets a notification thata promotion has happened and immediatelyAuto rebases Bob within secondsso if as long as there are no conflictsif you know the files that are changedthat have changed aren't the ones thathe's changed within seconds those newversions get downloaded Bob Bob's branchis marked as now being rebased on thelatest version in Maine and a new saveis created because now on Bob's Branchhe has the files that he was changingplus whatever ones just came down fromthe promotion so now after that's doneBob has a new shot valuewhich indicates the file T's change plusthe ones that were in the promotionAlice at this point of course has thesame exact Shaw value as mean becauseliterally she just committed andpromoted so that's branching and Graceit's really simpleit's as simple as I could possibly makeit mostly because likeI'm not that bright I mean why make ithard on myself rightum uh but really it's simple tounderstandso that is a little bit about branchinglet's go back to the deckso let's talk about I'm going to showyou some some pictures of how how thisworks and from the server's point ofview so in the server let's just saythat I I'm working on my branch and Ithese dots are directories in astructure right in a repo and let's sayI save a file on this uhthing over here yeah so let's say I savea file in thisbottom oneum Grace watch sees it it uploads thatversion of the file to object storageand it creates two new it has to computenew Shaw values for those twodirectoriesso I create what I call directoryversionI upload those to the server the serverdouble checks them and now we have asave reference that's pointing to thisbrand new version of the repo thatdoesn't exist in anyone else's branchcool at the same time my teammate Mia isworking and she saves a file in thatdirectory that bottom directory and thatfile gets uploaded those three nowversions three directory versions arecomputed all the way up to the root theygot uploaded to the server and doublechecked now she runs some tests and goescool I like this version I'm going tocheckpoint it so at this time she doescheckpoint all it does is it creates adatabase record because there's nothingto upload it's already uploaded so thattakes about a second I've really triedto make a lot of the gestures despitethe fact that there is always a networkhop involved in Grace I've tried to makethem as fast as I possibly could andI've aimed for that sort of one secondtiming there are things that get becausewhen git does local stuff there's thingsthat get is just going to be super fastat that I kind of can't compete at ifthere's a network hop but there's thingsthat I can be faster than get thatinvolve the network so anyway so here'sme again I save a file down there thistime it's four directory versions to getrecomputed did and get uploaded cool nowuh teammate Lorenzo saves the file inthat directoryfile gets uploaded there's threedirectory versions and he likes it he'sgoing to commit it and is thinking aboutpromoting itmeanwhile Mia is working she saves thatfile in this bottom orange directory shelikes it she commits it she runs testseverything's cool and she actually doesit for Grace promotes if she does Gracecommit now she does Grace promote nowthere's a reference pointing to thatsame exact root jaw now we have threereferences pointing at the same exactdirectory versionone of them is on Main two of them areon Mia's branch and in fact we like itso much we're gonna put a label on itand say this is our new productionversion 4.2cool so what do we have from that wehave five different versions of the repoand 10 references pointing at those fiveversionscool okay seven days later right I saidwe keep saves for seven days or so andthen we get rid of them well let's seehow that worksso here's my save that I did these arethe same five five things that we justsawwell let's say it gets deletedokay well now the root directory says doI don't have anything pointing at meanymore it checks and it starts arecursive process to go down and checkif there's anything that was unique tothat version that doesn't appearanywhere elseand deletes it so that version is nowgone now remember Mia did a save and acheckpoint well we delete the savereference but the checkpoint's stillthere so this version of the repo stayscool here's me again I did that save Ididn't do anything else with itsave gets deleted that version goes awayuh here's Lorenzo again save goes awaybut the commit's still there so thisversion of the repo is still therenow we have this oneagain the save gets deleted but we stillhave three references pointing at thatversion of the directory but like let'simagine that that the branch getsdeleted Mia's Branch gets deleted noproblem we still have referencespointing to that from the main branchsonow after seven dayswe now have three versions the repo andfive references pointing at it and andreally like the thing is saves arereally going to be like 25 to one rightI don't know I'm making up a number itdepends on your workflow but you'regonna have like 25 to 1 saves or 50 to 1saves and they'll come and they'll goand it's not a big dealsocongratulations you now understand Gracelike really that's as simple as it isum obviously there's a tiny bit more butlike fundamentallylike in 10 minutes I just explainedgrace to you and if you recall yourexperience learning getI'm gonna guess it took more than 10minutes you know it's one of thequestions I like to ask people is likewould you rather teach somebodyget or would you rather teach them whata monad isand you got to think about thatI'd rather teach them what a monad is tobe honestumokaydemo godsI'm going to do a very a very shortlight demojust to show you what a little bit ofwhat it feels likesoum I'm just going to show you in for oneuser um here is uhhere is my friend AlexiI just pick uh I'm a New York RangersFan I pick players from the Rangers tofor names so here is um here's Alexiwith the file now on the bottom here I'mrunning gracewatch in fact I'm justgonna oopsand here's Grace watch now right nowGrace watch is a command line thing witha lot of debugs for you early Alphabut here it is runningandhere is right now I'm going to run umgray statusand here's the status of Alexi'sumof Alexis uh oh it's funny now that thefont got got changed I'm going to switchtodoing this give me just a momentoops okayso we can see the full screenumso here's the status of the Lexi'sBranch right now soum obviously I was doing a little bit oftesting earlier today there was a save acheckpoint see I checkpointed the samesave so they have the same Shaw value itis based on this particular promotionthat Alexi in fact did and here was themessage from that in Maine here's themain right the parent branch that I'mbased on so you can see the status ofyour branch and the status of the branchthat you're based onumback to Grace watch here now here'sGrace watch I'm just going to upload I'mgoing to update uhI don't know I'm going to update thisfile and I'm going to add a commentsomewhereuhcool new version of the fileI'm going to hit savethere you go files uploaded newdirectory version computedserver has all the information ThisServer by the wayumI'm calling it a server now that was athat was strangely slow I guess becauseI haven't done it let's do another onejust to see in fact let's update thiscomment thank you GitHub thank youCopilotI'm going to hit save againand there you go files uploaded newdirectory that's how fast it should bethis server I'm talking to by the way ismy desktop computer in Seattleso um so it's 4 500 miles from here to 7300 kilometers and you know it's prettyquick it's not bad so so that feeling ofhow quick it is in the background likethat's what I'm trying to do with Graceit's like automatic so so like I'mshowing you what Grace watch is doingbut the truth isyou're just in here doing stuff andyou're just hitting save and Magicstuff's happening in the background sothat version that we just did let's sayI like it and I want toum I like it and I want to checkpoint itso by the way let's do great statusagain you'll notice that previously thesave was the ca21 version now I hit graystatus I've done a couple of saves andnow I have this new cd79 version in factif I do Grace refsI can seeall the references on my branchand again like 1.7 seconds everything'sdebug mode and we're a few miles fromthe server but likeit's fast that I'm aiming for everythingto be fast enough to keep you in flow soum so there's all the things that's coolyou can see everything you've been doingI'm going to do a Grace commit I likethis version I did some testsumNDC[Music]uhoh slow yayright coolthere's my commit now the commit againeverything's uploaded so thecommissioner screening database recordso it's pretty quick I like it I'm goingto do a promoteand this time I'm going to do for motion4 from Alexiagain all I'm doing is creating adatabase record really I'm creating adatabase record and then I'm rebasing MyOwn Branch becausewell I am and now if I do gray statuswhat we'll see isI'm my the save I did was that cd79 thecommit I did was that same version thecheckpoint was something I did fourhours ago whateverI'm the main is now that same versionI'm based on itand like this flow is what I'm aimingfor it's just this quick like most of ithappens in the background most of it'sautomatic when you take an action I'maiming for that one to two secondwindow for all the commandsum that's the quick demo that's reallywhat it feels like and like I said youalready understand it I hope right coolumwell I'm sorry I thank you I wasn't Iwasn't really begging for Applause butthank you so much I really do appreciateafter two years thank you so muchumI'll totally take it[Laughter]oh my all rightlet's keep goingbut wait there's so much more I wish Ihad time to tell you about I'm going togo through some of this really quick I'musing signalr for two-way communicationwith Grace watch so I have that livetwo-way communication from the server toeveryone I mean right now you know I'mshowing you that example of Allison Bobwhat if it's Alice and 20 othercontributors in the repo and thosecontributors and what if it's 100 repopeople and what if they're all over theworldwell now I can do cool stuff like allover the world I can Auto rebase peoplein fact you can sort of like like Idon't even know all the features that Iwant to build on this Eventing modelthat I have I can't I'm not the onewho's going to think of them all but Iknow that I have the architecture to doit and I know that we can do some reallycool things in real time especially nowI mean I know that there's like peoplewho are going back to the office andwhatever they're still a we're stillgoing to be partially to mostly remoteand I feel like there's a there are someinteresting ways we can connect with ourteammates at the source control levelthat have never been done before andagain can't be done in a distributedVersion Control Systemlike like I don't know what's happeningin your distribution I don't know what'shappening in your version to get untilyou do a push but here I always knowwhat you're doingumuh with Grace in an open source capacitythere's no you don't have to Forkanything I don't need to Fork the entirerepo you just create a personal Branchyou own the branch it could be likeimagine you know asp.net coreand all you do is walk off to it andcreate a branch you own the branch theydon't have any control over it you dowhatever you want um you could keep thatyou could keep your branch private orpublic up to you visible or not you cando work in it and do a PR off of itum which is not a pull request in Graceit's a promotion requestumthere there is no poll gesture in Gracethere's no push gesture and Graceumumuh there I have tested Grace so far onrepositories as large as a hundredthousand files and fifteen thousanddirectories as long as you're runningGrace watch you still get that one totwo secondtiming like it just works if you're notrunning gracewatch then there's a lot ofother checks that I have to do when youdo a checkpoint or a commit or whateverI have to if you if you're not runninggracewatch then I have to check that allthe things are uploaded and if I'mwalking through fifteen thousanddirectories to do it it'll take a fewseconds with Grace watches I know thatit's up to date I've also tested it on10 gigabyte filesum because like I'm back because Graceis backed by object storage like in mytesting case I'm I'm an Azure and.netguy so I've been running it on Azureobject on Azure storageit just works 10 I mean I don't think a10 gig file should be in Source controlto be clear but I've tested it ittotally worksum uh I want to build I have not builtapples down to the file level but or Ihave an idea for how to how I'm going todo it I want to do it I think it's superimportant git can't do that customizeBranch permission so that's when we'relike let's say on Main I want to enablepromotions and tags but I want todisablesaves and checkpoints and commits on meand you should never do a checkpoint orcommit on Main okay cool that's alreadythereum like on your branch maybe you want tonever enable it promote because whywould you promote on your own Branchum you can delete version so obviouslylike I'm deleting saves so we have theMechanics for deleting versionsunfortunately we do occasionally allcheck in a secret and like I can speakfor GitHub GitHub has worked extensivelyto create secret scanning so that whenyou do a push we see that oh you'rechecking in a connection string orwhatever still people are going to dothat and deleting that out of git is anightmareumuh in Greece I'll have a gesture I don'thave it yet but I'll have a gesture thatkeeps the audit trail that will say okayyou promoted something with a secret wewant to get we have to get rid of thatversion but here's the thing that saysthis version was this Shaw value it wasdeleted by this person at this time forexactly this reason so there's permanentauditability I've really had to thinkaboutEnterprise features of course likeEnterprises are going to be a big userof any Version Control System I'm aimingfor that for surefile locking is another one so if perforce one of the main reasons usersstill use perforce isscheming like gaming is a big wherethere's big binary files I had aconversation with somebody who doesSource control at Xboxand he gave the example of let's saythat somebody's working on Forza andwhat they're doing right now is they'reediting a track file like the graphicaldefinition of a track and their jobtheir task is to add 10 trees at aroundturn three on some track welllike they have to lock that file becausethere's no way to merge a binary file ifsomeone else edits that file they'vejust thrown away hours of work so wehave to have optional file locking Idon't want like by default it won't beon but I but I want to provide it inGrace as I said they're just oops as Isaid there's a simple web APIum and of course yes it's super fastand like yes it's it's consistently fastmy my goal for performance I'm a bigbeliever inum in perceived performance andperceived performance for me consistedof two things number one is it fast andnumber two is it consistent so like yourmuscle memory becomes if I type Gracecheckpoint and hit enterI know that that command I know that mycommand line is going to come back in asecond a second and a half every singletime that's what I'm trying that'sperceived performance to me that's whatI'm going forum I very much want to build all theinterfaces of course I'm building a CLIum I want to build a I will build aBlazer web API but I'm going to go onrecord and sayI hate electronI hate electrons so muchI hate webviewwe all have these first class pieces ofhardwareand we're running these second-rate appson them I just it drives me nuts Ibelieve in Native apps I'm going to trydoing with avalonia I haven't startedthat yet I hope avalonia will work wellit's I expect it to if not I'll do DonnaMaui but I will have native apps toinclude code browsing to include some ofthese gestures to include that timemachine like history view if we havetime at the end I'll show you a littlesketch of that I didwhat are we doing 20 minutes oh my God Ihave to hurry upof course I'm borrowing stuff from gitagain like I respect get enormouslyumI'm trying I'm really tryinglet's talk the architecture really quickin case it's not obvious Grace is thecentralized Version Control Systemum decentralized Version Control to meis why we have a hard time understandingit doing being centralized simplifieseverything there's been lots of attemptsat making a simple git or a simpledistributed one by the way there's acouple of great other projects going onright now like if you're interested inSource control you should really checkoutum p huel p i j u l is a reallyinteresting distributed version controlsystem based on category theoreticmathematical it's patch based it's infact actually a couple days ago thestack Overflow podcast did an interviewwith the creator of it it's a reallynice project another one that I reallylike is JJum and that is comes from Google MartinVon Z slash JJ on GitHubum that's a uh I actually know that teamI speak to them a little bit they'regreat people they're trying to dosomething really interesting and moveGoogle's internal source control forwardum really like it more friendlycompetition I'm I'm in this to winumbut I just want to point out like ifyou'd like distributedum you're not doing distributed todayanyway like you're you're using GitHubor gitlab or atlassian or whoever you'redoing Hub and spoke you're doingcentralized Version Control like younever push from your machine toproductionyou always push to the center and thenthe center runs some CI CD stuff andthat's how it gets to promote toproduction like you're doing centralizedVersion Control you're just using adistributed Version Control System to doa dance around itwhyand git'll be around for 15 or 20 yearsif you still want to use it as Imentioned you know Grace is all aboutevents event sourced fundamentallyumI really like cqrs event sourcing goesreally well with functional code in areactive perspective on codeand I do a lot of projections off thoseI'll show you a little thing about thatin a second I use the actor patternactors are based on or built into DapperI really like the actor pattern it makesit super easy to reason about everythingthat's happening actors that eachindividual actor is single threaded butyou can have thousands of them runningand they scale upto InfinityI like I use them for event processing Iuse them for an in-member networked inmemory cache they're really great andyou can have multiple Staterepresentations for them that last forhowever long you wantum now Grace is cloud native thanks toDapper I'm going to show you a pictureof dapper Dapper we're currently uhsupports 110 different platforms asservice products and growing all thetime the community's growing usageadapter is growing I'm really happyabout the bet I made on Dapper like indown in like 0.9 or something I startedusing dapper at Azure AWS gcpon-premises open source and containerswhatever you want to use under it andhere's a little picture of it it doesservice mesh it does State ManagementPub sub it has the actors it hasmonitoring and observability pieces youcan talk to it from any language there'ssdks of course I'm using.net itcommunicates with any infrastructurehere's a little example if you weregoing to distribute Graceusing Open Source Products well here'smy grace containers gray server with theDapper sidecar and they're running insome kubernetes cluster wherever andlet's say for object storage I'm usingMin i osomething compatible with an i o I'musing Cassandra as a document database Imight be using Radisson Kafka for Pubsub I'd be using open Telemetry withZipkin for for monitoring and hashicorfault for secrets cool that's greatthat's my open source one now if I'mdeploying Grace to azuresame exact race container same code sameeverything but now the at runtime Iconfigure copper differently I putCosmos DB under it I put service bus andevent hubs for Pub sub I put AzureMonitor and Azure key Vault and againlike fill out your picture because I'man Azure guy I did this fill out yourAWS picture your gcp picture whateveryou want but that's a Dapper it worksvery well I'm happy with itumprojectionsimagine that I have all these updatestreaming in saves checkpoints commitswhatever and they hit the server as soonas they hit the server what I do is I'mgoing to trade some CPU and some diskspace for your performance and your uxso I'm going to do stuff like if whenyou do commit I might do a testpromotion see can I promote if I find aproblem with the promotion I might letyou know in fact what I can do todaythanks to large language models is theymight detect a promotion conflict andactually send that code off to alanguage model to give me a suggestionto give back to you right away so Idon't just tell you that there's aproblem I tell you and give you asuggestionas to how you might think about changedealing with it I might do temp branchesI might generate diffs right away sothat when you ask for the diff they justshow up you know under a second uh cicthat generates if I can do whatever Ican do all these projections off thedata I have and just keep the eventswell again seven days lateruntil the end of timeright I'm keeping check I'm keepingcommits and promotions checkpoints bythe way like you don't need to keepcheckpoints forever maybe you keep themfor six months keep them for a year Idon't think you need to go look back onyour checkpoints from three years agowhatever okaythat's Grace that's the architecture ofGraceum I want to talk a little philosophyandTalk programming language I'm going totalk about why F sharp is not thatradical but likeprogram language set on the Spectrumfrom from imperative to declarativerightum there's like the hardware there'slike tell the hardware exactly what todo step by step and you know it's like Ilike Assembly Language it's reallyinterestingum but then there's the mathematicalcategory theoretic and more declarativeand here's like just a smattering oflanguages and roughly where they fit onthis line you might agree or disagreewith where one of whatever just closeyour eyes at it but there's this ideathat umprogramming languages can be translatedin either directionin fact we do that it's called compilingwhen you move from the right side ofthis to the toward the left side that'scalled compilingum when you move the other way it'scalled decompiling but all I can say islike from age 11 to age 45 I spend myentire life on the left side of thisum and I finally got curious about themathematical sideso why F sharp well you know C sharpcompiles to Ilso does F sharp it's the same dot netrun time it's the same everything thatyou love about c-sharp it just happensto be there and then it gets compiled toassembly and it runs just as fast so soI'm not like an ivory Tower PhD I'm notI'm really not that's not the statementI'm trying to make using F sharp I'mtrying to say that it makes my codebetter and and all I'm doing is I'm likehere and I'm justdoing thatreally and like being over here justgives me access to some cool stuff Istill have all the stuff over here I'mnot going over to Haskell you knowum which is cool if you do but that'snot what that's not what I'm doingum so why F sharpum you know Don Simon describes it assuccinct robust in performance 10 timesthe creator of F sharp and Don it isvery fast on that's like unbelievablyfast that's whyum it's functional first but objects arepart of the language and especially ifyou're integral interfacing with nugeteverything in nuget C sharp it's allbased on classes and methods and andinheriting stuff and well fsharp has allthat it's got a great Community any Fsharp people hereyayumum and I just want like here's myphilosophical statement I think theindustry has just hit a ceiling onquality and comprehensibility withobject oriented-ishcode and I mean like you know C sharpJava but Ruby you know typescript likethere's a point like for small code baseit's fine medium code base it's finewhen you get to a large code base withobject oriented if you don't make aserious investment in keeping that codecleanyou're going to run into major problemsand functional code lasts longer itreally does it stays clean longerumI mean assuming you don't do stupidthingsum uh I want to very quickly about thisI'm going to tell you my story but Iwrote three versions of Grace like likefor my learning curve the first versionI wrote as I was like gettingdeeper into F sharp and functionalthinking first I wrote a version ofGrace where I ended up accidentallywriting writing C sharp and F sharpbecause I hadn't made the mental leap tofunctional thinking yet intocomposability and so umum so I did that and then I was likethis is not this doesn't smell good likethis is not what I was hoping to get outof it so I threw that version Awayum and I and then I was like all rightI'm gonna get into category theorem I'mgonna relearn my category Theory and I'mgoing to moan at all the thingsand that didn't workum that made some really awkward codeum but I kept I learned stuff from itthrough that version away and now theversion you're seeing is actually thethird version which is which is stockwhich is it's more balanced it's morepractical and I use objects I useclasses but I use them very functionallyI think about them functionally so why Fsharpthis is why and like like my fieldreport as someone who spent his wholelife on the object oriented you knowassembly sidemy field report is thinking functionallygives you better safer more beautifulcodelike it really really does it is worththe journeyto learn and it's going to be painfulit's going to take a little bit but butit's so worth doing and it's made likeI'm very happy with the code are therechallenges with F sharp sure likeserializing deserializing records ispainfulum if you add a field to the to a recordnow you can't deserialize anything youever saved beforeumso I I'm thinkingI still have this I'm probably going toswitch them all to objects in otherwords classes but I'm going to treatthem functionally I'm gonna I'm gonnaI'm not gonna likeI'm not going to object Orient them I'mjust going to use objects and treat themas immutable there's an interesting onethat I would never have imagined twoyears ago kodaks is the open API modelthat's underneath GitHub Copilotumcodex is trained on c-sharp it's nottrained on F sharp turns out that thesuggestions I get out of copilot aren'tthat greaton F sharp so I use fuse shot promptingyou know I actually go to chat EPT anduse few shop prompting to to get betterthingsum in F sharp let's be honest there'sfewer samples there's fewer there's lessexample code so what's the solutionwork harder peasantum but it's not that bad you know takingonce you learn it doesn't take itdoesn't take muchum soum I want to close with a little bit ofcodeand what I'm going to do is I'm justgoing to pull up sharp lab really quickand walk you through a little bit ofI have todo that and thenI can go to Sharp labso umso if you ever use sharp lab it's areally cool site it lets you just typecode it compiles it in memory and thenshows you the decompilation on the otherside so I have F sharp code on one sideand C sharp code on the otherum in fact oh whatever I'm just gonna dothis so so here's here's some F sharpcode here's some very basic F-sharp codenow if you've never seen F sharp beforeit's going to feel a little weirdbecause like wait a minute where's theclass you have these let statements whatare what are they just floating in spacelike what are they hanging off of itfeels weird and what I want to show youis like what that compiles to you wellso there's so a module so there's anamespace there's a module and there'sthese lets which are you knowdefinitions of fields or or functions orwhatever what does that compile to in Csharp well here's my name spacecoolthat module is a static classstatic classes are awesome forcomposition I'm going to talk about thatin a bitthe end value well that becomes just aproperty with a getagain like net is fundamentally objectoriented the runtime knows classes andproperties and methods like that's whatit knows so when F-sharp compiles to Ilit has to compile to that so F-sharptranslates again we translate from theright side of that diagram to the leftside of that diagramso it translates it into properties andthis function well it's just a staticfunctioncoollet's add something to that let's add aregular class we've seen this is likeclass from from C sharp it's got twoproperties it's got a great time and anameno big deal this is how you saythis is how you sayclass public class and F sharp this ishow you declare property and if you lookat them well they compile two here'syour backing Fields here's your propertywith a getter and a Setter here's yourproperty you've seen this in C sharpit's no big deallet's add a little more to this let'sadd a method in that class and thismethodumoh what is it what is it compile to wellhere it compiles to a void so it's apublic this is a method you might haveon a class in your code base where itreturns a void and it just takes in avalue and changes something in the onthe object very common use forobject-oriented codelovely but not composable like that'sthe problem this is what I kind of wantto highlightlet's adda couple of functions to this now we'reout of the class we're back in thatmodule which means that we're in staticwe're in static class and I'm going toadd a functional change nameI wish I could move this but I can I'lljust move this over a little I'm goingto add two two classes two functions sothere's a change name and a change timeand you can see what they do they justsort of assign this you pass in a nameand a time and you take in an instanceof that classand you return an instance of the classso that Pro that pattern where I take inthe thing that I want to Output I takein a type and I'm going to Output thattype and I'm going to do something tothat type that is where you start to getcomposable code it's a monaddon't tell anyoneumso what does that compile to in C sharpterms well guess what it compiles to astatic function on a static class staticmethod on a static classand that's how you get composable codemy number one tip for for anyone is ifyou want to use C sharp or Java orwhatever and create more composablefunctional code start thinking stopthinking in terms of methods on classesand start thinking of creating staticclasses with static methods that you useto manipulate your codeand that will get you so far down theroad of being composable and testablelike these functions are super testablethey're pure functions if you give methe same class and the same name everytime I will give you the same exactresult over and over and over these arepure functions so what what now that Ihave these pure functions inin functional code what can I do well Ican compose themright I have this new function calledupdated my class and it updated my classI take in an instance of class and I'vegive it Scott and then to the currenttime and lookcompositionumand I know this might look weird ifyou've never seen F sharp or some of thefunctional languages before but thisstyle of coding gets you much morerepeatable testable beautifuleasy to understand code and I've reallytried hard not just to make the ux ofGrace beautiful but I've tried as hardas I can to make the code beautiful Iwant maintainers to like it tooI have an entirely succeeded but I'vemostly succeeded I'm mostly happy withitum I have a few minutes left I want toshow you a little bit of Grace's codeitselflet's see what I want to show Let's dolet's do a little bit of validation sothis is kind of interestingum so this is the CLI part so I want tovalidate stuff that's on the CLI beforeI even send it to the server you can'trely on that validation the server's gotto check also but I just want to likenot let you send stupid stuff to theserver so here's a bunch of things likewhen you have an owner ID and an ownername and by the way this if you look atthe status here I have an owner an organa repository so Grace is multi-tenantfrom the start and that's because I'mfrom GitHub and I've seen some of thepain of trying to retrofitmulti-tenancyso I just built it in from the start ofsome light multi-tenancy so here'shere's for here's part of my code fromwhen I was in step two of moan addingall the thingsI moan added all the things here I havethese functions and these functionswhich are validation functions take inthis thing called a partial result whichis the which is the parameters on thecommand line and those commit thoseparameters broken into a structure thatunderstands what they are and what doesit return it returns a result type aresult is a successor or failure typeand for when it succeeds it Returns thesame exact two parameters I passed inwhich means again just like I justshowed it's composableum and when it fails it returns thisthing called a Grace error which iswhich is a special error type I have nowall of these functions do the same thingthey all take in a parse result intoparameters and they return a result of aparse result in parameters or an errorso what can I do with thatoh my God I can do bindmonadic bind magic but like I know likethis is new and if you've never seen itbefore it looks a little weird but justisn't it beautifuljust like looking at that codeit's kind of nice it's just like allthese things flow and I gave thesereally nice verbose names to myvalidations just like you would for atest caseand like if they fail at any point thewhole thing kicks out the error of thefailed and stops processing that's howthat's monadic bindit's really nice now four reasons thatare that take a few minutes to explainand I won't I couldn't quite do it thisway on the server side but I want toshow you the same stuff I did on theserver side that's inspired bywhat I did herehere is uh let's see which function isthis here's Commit This is this is Gracecommit on the serverand I want to validate all the things Iwant to validate the owner ID and ownername you might pass in I want tovalidate the org ID the repository IDthe branch ID these are actuallyparameters that get passed in by thecommand line in the background that youdidn't see but like it does it for you Iwant to check uh or you've given mevalidines or does the organization andowner and repository in the branch dothey even existI want to check all these things so whatdo I have here I have a list ofvalidations and if you actually look atall of my endpoint these are webendpointsthis is like the equivalent ofcontrollers and actions this is a webendpointand I just have this pattern over andover again where I have thesevalidations and what do thesevalidations do they take inumwhat is validation validation is thething that takes in parameters andreturns a list of results really a listof task of results but and because ofthat I can start doing so so like Icouldn't do the inputs the same as theoutput here for reasons but the outputsthe same on every single one of thesevalidations which then lets me whichgets me back into functional land whereI can apply functional code to say justspin through these and if there's aproblem stopso like I just like I I've really I'mreally proud of this one I really likethis codeumanywayum there's so much more I could show youum I want to like here's here's somecool stuff with types like I know Csharp is finally getting this but thisidea of domain nouns or I can renameit's not just a good or a string it's abranch ID and a branch name and acontainer name and a reference ID andall that kind of stuff so like when I'mlooking at the code it's very clear tome what's going onumthere's so much more I'd love to showyou so but time to wrap upas I said number one tip for c-sharpprogrammers static classes and staticmethods like just do it go home pick anypick any uh a class from any code baseyou want just for funand try to rewrite it where all themethods that mod that modify State allthe methods that mutate State pull themout write them as data classes withstatic methods and you'll love what youseesothis is my like I believe Elegancematters like correctness concisenessmaintainability and code matters morethan getting everything else that kindof performanceit really does and I get most of themillsockets don't get I don't leave manyon the floor butum anyway thank you so much again thankyou so much for coming thank you so muchfor watching this videoum I'd love your feedback I want thestars star the repo if you have notstarted the repo I want to do itno excusesum I'm in this to win this is real thisis not an experiment like I'mgit will be replaced like I'm in thisa hundred percent like I want to winthisum and you know programming is art andcraft don't ever forget that it's allabout art and craft and thank you thankyou thank you[Applause][Music]
[00:00.000 --> 00:06.040] You're listening to Software Unscripted, I'm your host Richard Feldman.[00:06.040 --> 00:10.060] On today's episode, I'm talking with Ryan Haskell-Glatz, one of my co-workers at Vendor[00:10.060 --> 00:14.400] and author of the open source Elm projects, Elm SPA and Elm Land.[00:14.400 --> 00:18.580] We get into things like new user onboarding experiences, framework churn, and dynamics[00:18.580 --> 00:21.600] between authors and users in open source communities.[00:21.600 --> 00:23.800] And now, growing programming communities.[00:23.800 --> 00:27.040] Ryan, thanks so much for joining me.[00:27.040 --> 00:28.160] Thanks for having me.[00:28.160 --> 00:31.560] So a question I've always been interested to ask people who get involved in sort of[00:31.560 --> 00:35.040] like community building, and you and I have both been involved in, I would say like different[00:35.040 --> 00:38.980] eras of like Elm's community building, like me kind of more at the like very beginning[00:38.980 --> 00:42.280] and you kind of like sort of like the current like leading the charge, I would say.[00:42.280 --> 00:46.560] So one of the tools that you've built that I would say is like pretty important for community[00:46.560 --> 00:48.960] building is this Elm Land tool.[00:48.960 --> 00:53.400] And I'm curious, how did you arrive at like, this is the thing that I want to build to[00:53.400 --> 00:55.720] solve a problem that you see in the world?[00:55.720 --> 00:56.720] Totally.[00:56.720 --> 00:57.720] Yeah.[00:57.720 --> 00:58.720] It's exciting.[00:58.720 --> 01:02.000] It's exciting that it's like, oh, yeah, this is a cool, cool essential tool.[01:02.000 --> 01:05.680] Yeah, I so I don't know.[01:05.680 --> 01:07.320] This might be a crazy tangent.[01:07.320 --> 01:08.480] We'll see how it goes.[01:08.480 --> 01:13.440] But when I first heard of Elm, it was it was through the let's be mainstream.[01:13.440 --> 01:14.440] This is Evans.[01:14.440 --> 01:15.440] What was it?[01:15.440 --> 01:16.440] Curry on Prague.[01:16.440 --> 01:17.440] I want to say.[01:17.440 --> 01:18.440] Exactly.[01:18.440 --> 01:19.440] I think it was in 2015.[01:19.440 --> 01:20.440] He gave the talk.[01:20.440 --> 01:21.440] And I saw it.[01:21.440 --> 01:24.200] I think it was like September of 2016.[01:24.200 --> 01:25.600] I was web developer.[01:25.600 --> 01:27.160] I was doing a lot of Vue.js.[01:27.160 --> 01:30.160] I had just sold my you sold my company like, hey, we got to use Vue.js.[01:30.160 --> 01:31.160] It's this great thing.[01:31.160 --> 01:32.840] Like everyone should use it.[01:32.840 --> 01:35.000] I used checkout medium.[01:35.000 --> 01:36.440] I lived in the suburbs and I'd commute.[01:36.440 --> 01:40.920] So I used to, you know, pull out the medium app to learn new things or see what's going[01:40.920 --> 01:41.920] on.[01:41.920 --> 01:42.920] And I saw this.[01:42.920 --> 01:45.720] I saw this article that said, so you want to be a functional programmer?[01:45.720 --> 01:46.720] And I was like, I don't know.[01:46.720 --> 01:47.720] I don't know what that means.[01:47.720 --> 01:48.720] Do I?[01:48.720 --> 01:49.720] Yeah.[01:49.720 --> 01:51.480] Maybe that's it.[01:51.480 --> 01:55.840] So I read it and it was interesting and there was a, you know, at the bottom, it said, oh,[01:55.840 --> 01:56.840] if you like this, check out Elm.[01:56.840 --> 01:57.840] And I'm like, yeah, yeah, yeah.[01:57.840 --> 01:58.840] Whatever.[01:58.840 --> 02:00.760] I went to, you know, part two, there was four parts to it.[02:00.760 --> 02:04.120] And I kept saying, no, no, no, I don't care about Elm, you know, whatever, just like give[02:04.120 --> 02:05.280] me the content.[02:05.280 --> 02:07.840] And then I got to part four and there was, that was it.[02:07.840 --> 02:09.400] And I was like, I need more.[02:09.400 --> 02:11.240] So I finally got the Elm link.[02:11.240 --> 02:16.240] I went to the Elm, I think it was a Facebook page, some Elm Facebook group that shows how[02:16.240 --> 02:17.240] old it was.[02:17.240 --> 02:18.240] Yeah.[02:18.240 --> 02:19.560] I didn't even know there was one of those.[02:19.560 --> 02:20.560] Yeah.[02:20.560 --> 02:22.760] And there was a link to let's be mainstream.[02:22.760 --> 02:26.800] And that talk really connected with me because I had never done functional programming before.[02:27.760 --> 02:32.680] I just knew the pain points of, you know, getting it wrong in JavaScript every now and[02:32.680 --> 02:35.880] then and kind of like button my head up against stuff at work.[02:35.880 --> 02:40.520] And Evan had that, that the rainbow logo, the Elmland rainbow was taken from that talk[02:40.520 --> 02:41.520] that was like the happy place.[02:41.520 --> 02:42.520] Oh yeah.[02:42.520 --> 02:43.520] I remember that.[02:43.520 --> 02:44.520] Yeah.[02:44.520 --> 02:45.520] On the graph.[02:45.520 --> 02:46.520] Yeah.[02:46.520 --> 02:47.520] Exactly.[02:47.520 --> 02:48.520] Yeah.[02:48.520 --> 02:49.520] So it was like the whole, he had a whole setup where he's like, Hey, how do we get, you[02:49.520 --> 02:54.320] know, Haskell people to a place where it's easier to, you know, do functional programming?[02:54.320 --> 02:56.040] He's like, that's not what I'm looking for.[02:56.040 --> 02:59.280] I'm looking for how do we make it more reliable for JavaScript developers?[02:59.280 --> 03:00.960] And I'm like, Hey, that's me.[03:00.960 --> 03:03.760] And I think that that really aligned me with the vision.[03:03.760 --> 03:09.680] So Elmland, I guess this is back in, you know, last year, I was like, I really want to kind[03:09.680 --> 03:14.680] of return back to that, that feeling of like, this is really designed for, to be really[03:14.680 --> 03:18.560] easy and approachable for people that are doing front end development.[03:18.560 --> 03:19.560] And how do we get there?[03:19.560 --> 03:24.040] So, so yeah, when that project started, it was like, okay, I need to really take a good[03:24.080 --> 03:26.080] look at how does React do stuff?[03:26.080 --> 03:27.080] How does Vue do stuff?[03:27.080 --> 03:28.080] How does Spelt do stuff?[03:28.080 --> 03:34.280] And can I, can I make that learning curve a little bit more familiar for people?[03:34.280 --> 03:37.200] But yeah, that's, that's kind of how Elmland started.[03:37.200 --> 03:38.600] That was the inspiration for it.[03:38.600 --> 03:43.760] So basically just like kind of wanting to take that feeling that you had back in 2015[03:43.760 --> 03:48.920] watching Evan's talk and sort of like turn it into a tool and something that's like not[03:48.920 --> 03:52.240] just inspires people to try Elm, but like just helps them.[03:52.280 --> 03:56.560] It'll sort of achieve that goal of like, you can get to this, this wonderful land of like[03:56.560 --> 03:59.720] reliability and stuff, but like get up and running faster and easier.[04:00.000 --> 04:00.680] Exactly.[04:00.680 --> 04:01.720] Yeah, that's exactly it.[04:02.520 --> 04:02.760] Yeah.[04:02.760 --> 04:06.840] Cause I feel like in that space, there's a lot of, I don't know if maybe dependency is a[04:06.840 --> 04:10.480] strong word, but if you're doing front end development, there's a lot of tools that just[04:10.480 --> 04:13.960] you type a few letters in your terminal and like, oops, I have an application.[04:14.000 --> 04:14.360] Right.[04:14.360 --> 04:18.560] And so I feel like, you know, before I did Elm, I was, when I was using Vue, there was[04:18.560 --> 04:22.840] NuxJS and I was like, oh, I can just build an app and it'll tell me how to organize[04:22.840 --> 04:27.720] things and just kind of seem to streamline the process of like trying something out and[04:27.720 --> 04:28.640] like getting up and running.[04:29.040 --> 04:33.200] I think this is a really important point because so like Elm, like I've talked to Evan[04:33.200 --> 04:36.520] about this in the past and like one of the things that he really likes from like a teaching[04:36.520 --> 04:42.160] perspective is just to sort of reassure people that, you know, you only need to just[04:42.160 --> 04:46.560] download the Elm executable and then you can just start with like a main.Elm and just go[04:46.560 --> 04:48.440] from there and that's all you need to get up and running.[04:48.440 --> 04:51.640] However, there's a lot of people for whom that's just not what they're used to.[04:51.720 --> 04:52.960] That's not the workflow.[04:52.960 --> 04:54.040] That's like normal to them.[04:54.040 --> 04:56.960] They're used to like, no, I want like a CLI that's going to generate a bunch of stuff for[04:56.960 --> 04:57.160] me.[04:57.160 --> 05:02.400] And like, I want to, I don't want to have to start from, you know, hello world and plain[05:02.400 --> 05:06.760] texts, no style sheets, you know, like just build everything up from scratch.[05:07.080 --> 05:12.400] Some people like that, but there's always been a bit of mismatch between the only early[05:12.400 --> 05:16.800] Elm experience that was available and what a lot of, especially JavaScript developers[05:16.840 --> 05:19.920] are used to and kind of looking for as a way to on-ramp.[05:19.960 --> 05:24.520] And what I love about your approach is that you still have the same sort of destination,[05:24.520 --> 05:26.560] like you still end up getting the Elm experience.[05:26.560 --> 05:30.440] It's not like, you know, took Elm and like made it into something else.[05:30.800 --> 05:35.320] It's more that you're like, here is, you know, if this is the onboarding experience that[05:35.320 --> 05:37.560] you're used to and you're looking for, here it is.[05:37.680 --> 05:42.480] It's the same experience, but you're going to get to a much nicer destination once you[05:42.480 --> 05:43.560] go through that experience.[05:43.720 --> 05:45.040] Yeah, yeah, it's funny.[05:45.360 --> 05:49.960] It reminds me, Evan gave his talk, I think it was on storytelling where he was talking[05:49.960 --> 05:52.000] about, it's like radio versus television.[05:52.120 --> 05:56.320] At one point he was talking about like how television just takes away the work from[05:56.320 --> 05:58.600] having to like, you know, visualize what you're hearing.[05:59.120 --> 06:02.200] And I feel like Elmlands, it's like just giving you an app and like takes away the[06:02.200 --> 06:03.800] work of like, Oh, I can like do this.[06:03.800 --> 06:06.440] Oh, that means I could build, I could build GitHub or something.[06:06.440 --> 06:07.600] It's like, no, just show them GitHub.[06:08.920 --> 06:12.800] It's like, now you don't have to do that extra step, but yeah, totally.[06:13.120 --> 06:15.200] So yeah, I think that's a, that's a great summary of it.[06:15.200 --> 06:17.880] Just like kind of making that a familiar experience.[06:18.480 --> 06:18.800] Yeah.[06:18.880 --> 06:19.640] What they might be used to.[06:19.840 --> 06:22.760] What was the point, which you were like, this is the missing piece.[06:22.760 --> 06:27.160] Was there some, cause like oftentimes I found that when I decided to like invest[06:27.160 --> 06:31.040] as much time into a tool as I'm sure you have into Elmland, there's some moment[06:31.120 --> 06:34.760] where there's kind of like a trigger where I'm like, okay, this, this, I want[06:34.760 --> 06:37.240] this to exist badly enough that I'm going to put a bunch of hours into it.[06:37.560 --> 06:38.920] Was, did you have a moment like that?[06:38.920 --> 06:42.760] Or, or was it just kind of like, you know, eventually just like an accumulation[06:42.760 --> 06:46.440] of like hearing stories from people about where's X, but for Elm, you know,[06:46.440 --> 06:48.280] where's Nux for Elm or something like that.[06:49.400 --> 06:53.480] Yeah, I think it, I think this, this goes back because before Elmland, I was[06:53.480 --> 06:57.760] working on LMSPA, which was kind of, it was more focused on like the routing[06:57.880 --> 07:00.040] and like kind of scaffolding side of things.[07:00.480 --> 07:04.000] And then I like maybe shelled out isn't the right word, but I kind of shelled[07:04.000 --> 07:07.720] out the rest of the app building experience to like, Oh, like go look up[07:07.720 --> 07:10.160] how to do on UI or go look up how to use it out of scope.[07:10.240 --> 07:10.560] Yeah.[07:10.680 --> 07:11.000] Yeah.[07:11.000 --> 07:13.400] It's like, that's not my, it's not my wheelhouse, man.[07:13.880 --> 07:18.360] But I felt like, I feel like for a long time, and you know, I guess to this[07:18.360 --> 07:22.720] day, the Elm community is a bunch of kind of independent contributors that build[07:22.720 --> 07:27.280] really focused libraries that are designed well enough that they can plug in together.[07:27.760 --> 07:31.960] Um, but the thing that I would see a lot in the Elm Slack or, you know, people[07:31.960 --> 07:34.040] asking is like, how do I build an app?[07:34.320 --> 07:34.600] Right.[07:34.880 --> 07:37.000] And that means like, how do I glue all this stuff together?[07:37.360 --> 07:40.880] Even in the Elm SPA, you know, users channel in the Slack, they'd be like,[07:40.880 --> 07:42.240] Hey, can I use Elm UI with this?[07:42.280 --> 07:44.040] Like the question, can I use it?[07:44.080 --> 07:48.160] Like not even like how it's like, is this even viable at like a base level?[07:48.160 --> 07:48.640] Does it work?[07:48.640 --> 07:48.920] Yeah.[07:49.000 --> 07:53.520] I feel like, um, there was a moment where I was like, I think I just need to answer[07:53.520 --> 07:55.880] the, the high level question of how do I do it?[07:56.320 --> 07:58.760] Like, how do I make something real and how do I build something?[07:59.400 --> 08:02.160] Cause there's a, there's a lot of, uh, there's not a lot of tools like a lot[08:02.160 --> 08:06.360] of, in terms of like NPM comparison, but there's a lot of like separate kind[08:06.360 --> 08:10.320] of projects and it's not clear, um, to a newcomer necessarily, like how they're[08:10.320 --> 08:12.000] all supposed to kind of work together.[08:12.160 --> 08:12.400] Right.[08:12.400 --> 08:15.600] But it wasn't just like an answer in the sense of like an FAQ entry, right?[08:15.600 --> 08:18.400] Where it's like, oh, the answer is like, yes, you could use Elm UI with that.[08:18.440 --> 08:19.560] Go forth and do it.[08:19.560 --> 08:22.440] It's, it's more like, here's an answer to the whole set of questions.[08:22.440 --> 08:23.560] And like, here's a starting point.[08:23.680 --> 08:26.040] It reminds me a little bit of, um, and I'm sure you've like, you know, this[08:26.040 --> 08:29.440] is on your radar, it was like create react app where like that was, seemed[08:29.440 --> 08:32.520] like there was a similar motivation story there where there was a bunch of people[08:32.520 --> 08:36.240] saying like, well, reacts only designed to be responsible for rendering, but[08:36.240 --> 08:37.920] there's all this other stuff that goes into making an app.[08:38.000 --> 08:39.840] How do I glue all these pieces together?[08:39.840 --> 08:42.520] And then create react app was like an answer to that.[08:42.560 --> 08:45.720] And then, you know, you could like, you know, eject and whatever else.[08:45.720 --> 08:47.560] And I haven't really kept up with the react rule.[08:47.560 --> 08:50.480] I don't know if that's like still a thing that people do or not, but I know[08:50.480 --> 08:54.040] that that was like, that was originally like the motivation there, or[08:54.040 --> 08:54.800] at least that's what I heard.[08:55.240 --> 08:55.680] Totally.[08:55.800 --> 08:56.120] Yeah.[08:56.160 --> 09:00.480] I saw a tweet recently, someone saying like, uh, like junior developers being[09:00.480 --> 09:04.440] like, what's a CRA people like these days don't know what it is anymore.[09:04.600 --> 09:06.040] So is it, has it fallen out of favor?[09:06.040 --> 09:06.680] I don't even know.[09:06.920 --> 09:07.480] I don't know.[09:07.480 --> 09:10.960] I mean, I haven't, I haven't really been in the react ecosystem too much, but[09:11.240 --> 09:14.520] I think people are aware of it and they still kind of reference it as like a thing.[09:14.720 --> 09:16.600] I mean, the JavaScript ecosystem is blown up.[09:16.720 --> 09:17.920] There's all these new frameworks.[09:18.120 --> 09:20.680] There's like solid, there's Astro, there's quick.[09:20.680 --> 09:22.920] There's like all these new things that kind of look like react,[09:23.040 --> 09:24.120] but work a little different.[09:24.440 --> 09:26.080] It's good to know that some things never change.[09:26.080 --> 09:29.080] It felt like for a minute there, there was going to be a consensus[09:29.360 --> 09:30.880] in the JavaScript ecosystem.[09:30.880 --> 09:32.880] So I guess that didn't last long.[09:33.120 --> 09:33.920] No, yeah.[09:34.080 --> 09:37.040] It's actually really cool to see like what's going on now because the[09:37.040 --> 09:38.800] frameworks are getting more nuanced.[09:39.240 --> 09:42.000] It feels like they're, they're more, I've heard, I've heard you talk about this[09:42.000 --> 09:45.200] on the podcast before of like things to try to convince you they're good for[09:45.200 --> 09:48.560] everything and you kind of find out later on, like, that's not really true.[09:49.000 --> 09:49.240] Yeah.[09:49.480 --> 09:53.160] If you look at the projects coming out now, it's like the, I've noticed two big[09:53.160 --> 09:58.000] changes since, you know, 2015 or 2016 when I was like really in the space, which[09:58.000 --> 10:02.520] is the framework wars seem to have died down or at least the strategy has changed[10:02.520 --> 10:03.960] where everyone's like polite online.[10:04.600 --> 10:06.240] Maybe they're like secretly at war.[10:06.360 --> 10:06.720] Okay.[10:06.720 --> 10:07.080] Okay.[10:07.120 --> 10:07.400] Yeah.[10:07.600 --> 10:12.200] Everyone seems very like reasonable and honest about like trade-offs.[10:12.280 --> 10:17.160] So like the Ryan, I think carniado is how you pronounce his last name.[10:17.240 --> 10:19.760] He's the author of solid JS and he's like, it's really fast.[10:19.760 --> 10:20.440] You're the benchmarks.[10:20.440 --> 10:21.040] It's for this.[10:21.080 --> 10:21.320] Yeah.[10:21.400 --> 10:24.680] And like people, you know, making Astro like, Hey, this is great for making[10:24.680 --> 10:30.080] websites and I feel like, I feel like there's more nuance now, which it feels[10:30.080 --> 10:32.360] like there always should have been, for example, Elmland, don't build a[10:32.360 --> 10:34.720] website with any SSR.[10:34.720 --> 10:37.640] It doesn't do any SS SEO that that's like, not the focus.[10:37.640 --> 10:41.320] Like if you're trying to build like an app behind login, like if you're, if[10:41.320 --> 10:44.720] you're at, you know, if you're working at vendor, for example, if you ever heard[10:44.720 --> 10:49.120] of that company, if you're working at that, if you're working at vendor,[10:49.200 --> 10:52.280] your apps behind the login, you don't need, you know, server side rendering[10:52.280 --> 10:55.640] and every page, you just want a really nice, reliable experience for the[10:55.640 --> 10:58.240] customer and you want to be able to add features quick, right?[10:58.280 --> 11:01.080] That's what online sports, you know, it's not for everything, but the web[11:01.080 --> 11:06.200] platform is so expansive that it can be like a blurry line between those things[11:06.200 --> 11:08.520] sometimes, and I feel like there's a lot more nuance these days, which[11:08.520 --> 11:10.080] is just great to see that.[11:10.080 --> 11:14.680] Um, yeah, that, that, that framework wars comment takes me back to, it was a[11:14.680 --> 11:18.120] conference, this is like more than five years ago now, I think, and it was[11:18.120 --> 11:22.040] called framework summit and kind of the theme was like, let's get all the[11:22.120 --> 11:25.960] like JavaScript frameworks together and like, you know, give people[11:25.960 --> 11:29.600] presentations about them and let people, you know, understand which one is for[11:29.600 --> 11:31.240] them and which one isn't for them and so forth.[11:31.480 --> 11:35.200] So they also had this like creators day that was like the day before the[11:35.200 --> 11:39.560] presentations, it was like me representing Elm and like Tom Dale from[11:39.560 --> 11:44.920] Ember and Andrew from react and, um, some people whose names I'm forgetting[11:45.040 --> 11:47.080] for, uh, from UJS and AngularJS.[11:47.480 --> 11:50.240] And so basically we, we all got together and just sort of talked about[11:50.240 --> 11:52.720] stuff that affected all of us, which is pretty interesting discussion.[11:52.880 --> 11:55.240] And so there was kind of this spirit of like, okay, we're not really like[11:55.240 --> 11:58.120] competing, we're just kind of, you know, hanging out and like having fun and[11:58.120 --> 12:01.240] we're all kind of doing our own thing and have our own trade-offs and, you[12:01.240 --> 12:02.960] know, there's some commonalities and some differences.[12:03.040 --> 12:06.600] And so the next day, you know, the, the organizers had said like, okay, when[12:06.600 --> 12:09.880] everybody presents, you're sort of like, I forget how long it was, it was like a[12:09.880 --> 12:13.800] 15, 20 minute pitch for like, you know, your latest, latest version of your thing.[12:14.240 --> 12:16.440] Um, so I was talking about like, you know, here's the latest in Elm.[12:16.920 --> 12:20.640] Uh, and, uh, and he was like, you know, and please, the organized like, please[12:20.640 --> 12:23.560] don't, don't like, you know, hate on the other frameworks, you know, if you have[12:23.560 --> 12:25.720] to make comparisons, like be respectful at some point.[12:25.880 --> 12:28.960] And so everybody pretty much took this to heart, except at the very beginning[12:28.960 --> 12:33.320] of his Tom Dale from Ember stands up and he's like, all right, so I'd like to[12:33.320 --> 12:37.240] welcome everybody to the Comedy Central roast of JavaScript frameworks, and then[12:37.240 --> 12:40.960] proceeds to just like, just roast, like all of the other frameworks.[12:41.040 --> 12:42.080] Oh my gosh.[12:42.120 --> 12:43.120] He started it off.[12:43.280 --> 12:46.440] Yeah, I don't think this was the first presentation, but that was how he started[12:46.440 --> 12:48.720] off his, you know, comparison of Ember to them.[12:49.040 --> 12:53.040] Now what's funny though, in retrospect is that the dig that he had on Elm was he[12:53.040 --> 12:56.480] said, he's like, Elm is here, you know, really, really glad to see Elm represented.[12:56.480 --> 13:00.360] It's nice to see Elm here because it makes Ember look popular by comparison,[13:01.400 --> 13:04.920] which maybe at the time was true, but I actually don't think that's true anymore.[13:04.920 --> 13:06.680] I think it's, it's probably the other way around.[13:06.800 --> 13:09.320] I think Elm has almost certainly at this point, eclipsed Ember in[13:09.320 --> 13:11.720] terms of like current, like present day use.[13:11.880 --> 13:12.480] Interesting.[13:12.480 --> 13:13.000] Could be wrong.[13:13.240 --> 13:14.120] I have no idea.[13:14.160 --> 13:14.480] Yeah.[13:14.920 --> 13:18.800] Based on like state of JS surveys and like, I don't know if those are, how[13:18.800 --> 13:22.160] indicative those are, like on the one hand, maybe there's a lot of people using[13:22.200 --> 13:25.240] Ember apps that have been using them for so long that they just don't care to[13:25.240 --> 13:26.760] bother to like respond to state of JS.[13:26.760 --> 13:30.080] Cause they're not interested in like the latest and you know, most cutting edge[13:30.080 --> 13:34.040] stuff, but then again, I also, you know, know that like a lot of Elm people are[13:34.040 --> 13:35.400] just don't care about JS anymore.[13:35.400 --> 13:36.720] So it's just like, I moved on.[13:36.720 --> 13:39.680] And so who knows what percentage of Elm developers respond to state of JS.[13:40.080 --> 13:43.160] So yeah, there's a lot of, there's a lot of factors there, but it's interesting.[13:43.360 --> 13:44.440] I'm one of the crazy ones.[13:44.440 --> 13:45.600] That's like every state of JS.[13:45.600 --> 13:46.960] I'm like, get in there, you gotta put Elm in there.[13:47.280 --> 13:47.680] It's funny.[13:47.680 --> 13:53.280] If you look at the last state of JS, it was like most writings, Elm people[13:53.280 --> 13:57.160] just like, please include me on these lists, but it's, it's fun.[13:57.400 --> 13:58.760] I'm in the same boat in that.[13:58.760 --> 14:02.120] Like I used to look at state of JS, like before I got into Elm.[14:02.520 --> 14:06.000] And then like, since I got into Elm, like, yeah, I, I still, I just like, kind[14:06.000 --> 14:09.160] of always want to make sure it's like, yeah, you know, like just so you know,[14:10.080 --> 14:12.640] I'm not using JS anymore, but, but FYI Elm.[14:12.960 --> 14:13.200] Yeah.[14:13.200 --> 14:14.880] And I'm sure some number of people do that.[14:14.880 --> 14:18.280] Like I always see on Elm Slack, somebody posts a link to state of JS every year.[14:18.280 --> 14:22.000] It's like, Hey, you know, don't forget the JavaScript people don't know we exist[14:22.000 --> 14:27.680] unless we tell what we do, but it's weird because that's, it feels to me like state[14:27.680 --> 14:31.200] of JS for a lot of Elm programmers who are, who are like using it professionally.[14:31.440 --> 14:35.040] That's their main interaction with JavaScript anymore or the world of JavaScript.[14:35.040 --> 14:36.160] And maybe you do some like interop.[14:36.200 --> 14:39.640] And so that's how you like interact with JavaScript code, but it's like the JS[14:39.640 --> 14:42.920] community and all the different frameworks and the solid JS and you know,[00:00.000 --> 00:03.800] good state of JS, like before I got into Elm and then like, since I got into Elm,[00:03.800 --> 00:07.240] like, yeah, I, I still, I just like kind of always want to make sure it's like,[00:07.320 --> 00:12.280] yeah, you know, like, just so you know, I'm not using JS anymore, but, but FYI,[00:12.280 --> 00:14.840] um, yeah, and I'm sure some number of people do that.[00:14.840 --> 00:18.240] Like I always see on Elm Slack, somebody posts a link to state of JS every year[00:18.240 --> 00:21.960] to like, Hey, you know, don't forget the JavaScript people don't know we exist[00:21.960 --> 00:27.040] unless we tell what we do, but it's weird because that's, it feels to me like[00:27.240 --> 00:30.600] state of JS for a lot of Elm programmers who are, who are like using it[00:30.600 --> 00:34.120] professionally, that's their main interaction with JavaScript anymore or[00:34.120 --> 00:34.960] the world of JavaScript.[00:34.960 --> 00:37.320] I mean, maybe you do some like interop and so that's how you like interact[00:37.320 --> 00:40.960] with JavaScript code, but it's like the JS community and all the different[00:40.960 --> 00:44.560] frameworks and the solid JS and you know, whatever the latest thing is, it's[00:44.560 --> 00:48.240] like, I hear about those things, you know, but it's, it's almost even like,[00:48.640 --> 00:52.480] as if I'm not plugged into the front end world at all, because so much of the[00:52.480 --> 00:56.240] front end world is just like JavaScript, you know, I don't want to say drama,[00:56.240 --> 01:02.120] but like, you know, JavaScript framework churn there's, there's always so much[01:02.120 --> 01:06.160] like new stuff that seems like it's some tweak on the last thing.[01:06.160 --> 01:08.720] Whereas in the Elm community, I don't really get that sense.[01:08.720 --> 01:12.880] It seems like it's, it's much more common that you'll have an existing[01:12.880 --> 01:14.560] thing that continues to evolve.[01:14.840 --> 01:19.200] Like for example, Elm CSS, which like I started out and worked on for many years[01:19.200 --> 01:22.840] and kind of, I've not had time anymore because all of my, well, first of all,[01:23.200 --> 01:26.800] back when I had free time, before I had a kid, all of that time was going into[01:26.800 --> 01:30.560] rock and so I've just like, all of my non-rock things just kind of slid to[01:30.560 --> 01:32.920] the backboard or by default, it's funny.[01:32.920 --> 01:36.240] I was, I was catching up with Evan, was this last year, two years ago, whatever.[01:36.240 --> 01:38.680] I, at some point I was in Copenhagen for a conference.[01:38.680 --> 01:41.720] So I hung out with him and Teresa and we caught up about various things.[01:41.720 --> 01:45.560] And I was commenting on how like, I don't really like have time to maintain a lot[01:45.560 --> 01:48.960] of my Elm projects anymore, because I just, every weekend, I'm like this[01:48.960 --> 01:51.160] weekend, I'm going to like go through PRs on this thing.[01:51.520 --> 01:54.200] And then by the end of the weekend, I would have done a bunch of rock stuff.[01:54.240 --> 01:56.480] And I was like, and I still had more rock stuff to do, but I didn't even[01:56.480 --> 01:57.600] get to any of the Elm stuff.[01:57.960 --> 01:58.200] All right.[01:58.200 --> 01:59.120] Next weekend, next weekend.[01:59.360 --> 02:00.520] And then that would just keep happening.[02:01.280 --> 02:02.520] How's, I was joking to Evan.[02:02.520 --> 02:04.920] I was like, yeah, it turns out like making a programming language.[02:05.160 --> 02:06.560] It's really, really time consuming.[02:06.880 --> 02:07.640] Is that sounds funny?[02:07.640 --> 02:08.200] He just laughed.[02:08.200 --> 02:13.280] It's like, yeah, it turns out, I'm sure that's a universal thing.[02:13.680 --> 02:16.680] I mean, I guess like if you're making a toy language, that's like just for you[02:16.680 --> 02:18.640] and like just a hobby thing, then that's, that's one thing.[02:18.680 --> 02:21.080] But if you're like trying to make something that other people are actually[02:21.080 --> 02:24.520] going to use like professionally, it's like kind of a, yeah, there's a lot there.[02:24.920 --> 02:28.640] But I was thinking about this in the context of this sort of like framework[02:28.640 --> 02:32.000] churn in the JavaScript ecosystem, but never use the word churn to describe[02:32.000 --> 02:33.920] what happens in the Elm package ecosystem.[02:33.960 --> 02:37.720] And like in the Elm CSS case, it's like, okay, I'm not working on that actively[02:37.720 --> 02:42.000] anymore, but there's, there's a longtime contributor who had been working on[02:42.000 --> 02:46.480] this sort of like big performance oriented under the hood rewrite that I'd gotten[02:46.520 --> 02:48.400] started and never got all the way through.[02:48.560 --> 02:52.240] He just was like, Hey, is it cool if I like fork this and like continue that work?[02:52.240 --> 02:53.400] And I was like, yes, please do that.[02:53.440 --> 02:56.960] That's awesome because it's not like you're redoing the whole thing.[02:56.960 --> 02:58.080] Like fingers crossed.[02:58.080 --> 03:01.920] I would love to see him finish that because, and publish it because if he can[03:01.920 --> 03:05.680] actually make it across the finish line, it should feel like using the most[03:05.680 --> 03:09.840] recent release of the Elm CSS that I built up, but it should run way faster.[03:10.200 --> 03:12.000] Which in my mind is, is just like, awesome.[03:12.000 --> 03:15.240] If you can get something where it's like, this is already the experience that[03:15.240 --> 03:17.880] people want and are happy with, but it runs way faster.[03:17.880 --> 03:20.080] That's an amazing way to like evolve something.[03:20.640 --> 03:24.520] Whereas the like, well, we redid everything from scratch, but it was, you[03:24.520 --> 03:27.000] know, you use that description of like, it's kind of like react, but with a[03:27.000 --> 03:28.520] twist or like a little bit different.[03:28.920 --> 03:32.000] I'm really glad we don't, we don't see that happening in the Elm community.[03:32.120 --> 03:32.640] Yeah.[03:32.680 --> 03:35.800] I feel like every now and then we'll get the Elm Slack and there'll be, there'll[03:35.800 --> 03:39.160] be something new, new will come out in the react space and I'll see someone like,[03:39.160 --> 03:43.400] Oh, like how, how can we do like hooks and Elm or like, how do we, as felt[03:43.400 --> 03:46.880] doesn't do virtual DOM, like how do we do Elm without virtual DOM?[03:46.880 --> 03:51.000] And like, I see posts like that, but yeah, I don't think they get too much traction,[03:51.000 --> 03:54.840] but I feel like it kind of, I think there's a, just a general anxiety.[03:55.320 --> 04:00.520] Just like, if we're not doing the latest thing, like is, is, are we dead or something?[04:00.520 --> 04:02.680] You know, there's like that kind of energy to it.[04:02.840 --> 04:06.520] So I'm glad you brought that up because I, the way that I've seen those discussions[04:06.520 --> 04:10.280] typically go on Elm Slack is someone will post that and then two or three people[04:10.280 --> 04:12.080] will respond like, no, everything's cool.[04:12.360 --> 04:12.680] Yeah.[04:12.760 --> 04:14.120] Do we need to create a problem here?[04:15.040 --> 04:15.760] Like we're good.[04:15.800 --> 04:18.480] Like what's, what's the actual problem we're trying to solve here?[04:18.480 --> 04:19.400] Is it just FOMO?[04:19.480 --> 04:22.280] Like what's the user experience problem that we have here?[04:22.280 --> 04:23.840] And then like, let's figure out a solution to that.[04:24.040 --> 04:25.640] Is there a user experience problem here?[04:25.640 --> 04:28.280] Or is this just like, someone else is doing X.[04:28.280 --> 04:29.360] Shouldn't we be doing X?[04:29.360 --> 04:29.880] It's like, no.[04:29.880 --> 04:33.960] And I think that's, and maybe I'm being dismissive here, but feels like a[04:33.960 --> 04:37.960] cultural carryover from JavaScript because that's totally a cultural norm in[04:37.960 --> 04:41.800] the JavaScript community is just like, Oh man, like X came out, like,[04:42.240 --> 04:43.400] shouldn't we be doing X?[04:43.480 --> 04:46.880] And, and there's just like, kind of this, like this constant magnetic pull[04:46.880 --> 04:48.280] towards the latest shiny thing.[04:48.640 --> 04:51.160] And there's almost like a, I mean, at least among people who've been around[04:51.160 --> 04:54.280] the community long enough, like in Elm, it seems like there's a, there's a, an[04:54.280 --> 04:58.440] instinctive resistance to that, where it's like, anytime, like the X, Y problem[04:58.440 --> 05:02.240] is the classic example of this, where it's like, and people are always citing that[05:02.240 --> 05:05.160] and linking, I think it's like, what is X, Y problem dot info or something is the[05:05.160 --> 05:05.480] link.[05:05.920 --> 05:06.760] Yeah, something like that.[05:06.800 --> 05:07.040] Yeah.[05:07.280 --> 05:08.960] That's where I learned about X, Y problems.[05:09.000 --> 05:11.120] I think Elm Slack educated me.[05:11.800 --> 05:12.800] Yeah, it's Elm.[05:12.800 --> 05:13.440] Yeah, me too.[05:13.840 --> 05:17.080] I, I'd never heard of it before Elm, but yeah, it's like this for those who[05:17.080 --> 05:20.840] aren't familiar, it's, it's this idea of like, you know, you say like hooks,[05:20.840 --> 05:21.840] let's use that as an example.[05:21.840 --> 05:24.800] You come in saying like, Hey, you know, how does Elm do hooks?[05:24.880 --> 05:28.200] And you say, well, hang on, let's, let's take a step back and ask like, what's[05:28.200 --> 05:29.160] the real problem here?[05:29.160 --> 05:30.480] Like, what's the direct problem?[05:30.480 --> 05:34.520] Like we're starting to work on a solution and we have a question about the solution,[05:34.560 --> 05:38.080] but let's step, let's step all the way back and see like, what's the immediate[05:38.080 --> 05:38.440] problem?[05:38.440 --> 05:40.360] What's the pain point that we're trying to solve?[05:40.640 --> 05:44.040] And then we can talk about solutions kind of from scratch and maybe we'll end up[05:44.040 --> 05:48.240] going down the same road that this solution is like presupposing, but maybe[05:48.240 --> 05:51.480] not, maybe it'll turn out that there's actually a better category of solution[05:51.480 --> 05:51.720] here.[05:53.440 --> 05:58.040] And hooks are an interesting example because I remember when, when hooks and[05:58.040 --> 06:02.080] suspense were announced, which I think might've been the same talk or it might[06:02.080 --> 06:02.840] have been different talks.[06:02.880 --> 06:06.520] I don't remember, but I remember hearing about them.[06:06.520 --> 06:10.840] And I was at that point, like very into Elm and like has really had not been[06:10.840 --> 06:14.360] spending any time with react in a while, like months or years.[06:15.240 --> 06:19.440] And I remember hearing it and I was like, I don't understand what problem this is[06:19.440 --> 06:19.840] solving.[06:19.840 --> 06:23.680] If you're not like Facebook, if you're like literally Facebook and you have like[06:23.680 --> 06:27.200] a gazillion different widgets on the screen and, and no, they're all like, you[06:27.200 --> 06:28.680] know, customizable in different ways.[06:28.680 --> 06:32.040] And some of them need to be like real time, like chat, but then others don't[06:32.040 --> 06:33.240] need to be like the newsfeed.[06:33.240 --> 06:37.440] And I was like, okay, if you're literally Facebook, I can see how this might be[06:37.440 --> 06:42.200] solving a practical problem, but if you're not literally Facebook and there's[06:42.200 --> 06:47.480] like 99.9% of, you know, that's huge underestimate, basically everyone else.[06:47.760 --> 06:51.120] Like, like what, why, why are people excited about this?[06:51.160 --> 06:55.720] And it felt to me like an XY problem example where it's like, you know, yeah,[06:56.000 --> 06:58.800] you can see getting excited about it, you know, for the sake of, oh, it's a new[06:58.800 --> 07:01.360] shiny thing at that conceptually.
(upbeat music)- Hi there, I'm Muir Manders,a software developer on thesource control team at Meta.(upbeat music)First, I think it's usefulto start with a definitionof source control.Source control is the practiceof tracking changes to source code.Fundamentally, source control helpssoftware developers read,write, and maintain source code.Another way to think about itis source control helpsdevelopers collaborateby sending and receivingchanges with other developersand by tracking differentbranches of work.Source control also providescritical metadata to developersso that they can understandwhen and why source code has changed.Now, looking at the currentlandscape of source control,I think it's safe to saythat it's dominated by Git.Git is popular for a reason,it does a lot of things right.But when the Saplingproject first started,Git didn't quite meetour scalability needs.(upbeat music)As I mentioned before,initially scalabilitywas the primary focusof the Sapling project.To keep up with the pace ofcode growth over the years,we've redesigned many aspectsof our source control system.One of the key elements to ourscalability is lazy fetching.By lazy, I mean thatSapling doesn't fetch datauntil it's needed.For example, file history andthe commit graph are both lazyand more than just the repodata behind the scenes,your working copy is lazy as well.We use a virtualized file systemto defer fetching of filescontents until it's accessed.Together, this means you can clone a repowith tens of millions offiles in a matter of seconds.It also means that giantrepo can fit on your laptop.There is a catch with the laziness.You must be online to performmany source control operations.This trade-off is worth it for usbut it may not be worthit for smaller repos.Beyond scalability, we've focused a loton the user experience.We aim to hide unnecessary complexity,while providing a rich set oftools right out of the box.A good example to start with is undo.Just like in any software,when you make a mistake,or you just change your mind,you wanna undo your changes.In Sapling, undoing most operationsis as easy as "sl undo".Undo demonstrates how Saplinghas developed first-classintegrated conceptsthat improve the developer experience.In the same vein as undo,but perhaps even more core to Saplingis the concept of stacked commits.A commit stack is asequence of local commitssimilar on the surface to a Git branch.Commit stacks differ from Gitbranches in two main ways.First, a Git branch is essentiallya name that points to a commit.With a sapling stack,there is is no indirection,the stack is the set of commits.What does that mean?For one, you don't even haveto give your stack a nameif you don't want,and if you check out a commitin the middle of the stack,you're still on your stack,and you can use normalcommands to amend that commit.Another difference between sapling stacksand Git branches is that stack commitsdon't have to be merged all or nothing.As early commits in your stackare being code reviewed and merged,you can continue pushing more commitsto that same line of work.Similarly, if you push alarge stack of commits,you can incrementallymerge the early commitswhile you continue toiterate on the later commits.(upbeat music)In November, 2022, wereleased the Sapling clientwhich is compatible with Git repos.To try it out, go to sapling-scm.comand follow the instructionsto install the Sapling clientand clone an existing Git repo.There's a couple other coolthings I want to mention.On the website, under the add-on section,you can see that Saplingcomes with a fully featured GUIcalled the Interactive Smartlog,it's really a game changer.It's a high level UI thathides unnecessary detailswhile still given the user powerful toolslike drag and drop rebase.Also, we've released a proof of conceptcode review website called ReviewStackthat's designed for thestacked commit workflow.Finally, I'd like to notethat the Sapling client is just one pieceof Meta's source control system.In the future, we hope to releaseour virtualized file systemand our server implementation.Together, these threeintegrated componentsreally take source control scalabilityand developer experienceto the next level.If you wanna learn more about Saplingplease visit sapling-scm.com.If you're interested ingetting involved directly,please check out our GitHub project page,we welcome contributions.(upbeat music)
# Document TitleBeyond Git: The other version control systems developers useOur developer survey found 93% of developers use Git. But what are the other 7% using?Article hero imageAt my first job out of college (pre-Y2K), I got my first taste of version control systems. We used Microsoft’s Visual SourceSafe (VSS), which had a repository of all the files needed for a release, which was then burned onto a disk and sent to people through the mail. If you wanted to work on one of those files, you had to check it out from the repo—literally, like a library book. That file would be locked until you checked it back in; no one else could edit it. In essence, VSS was a shield on top of a shared file folder.Microsoft discontinued VSS in 2005, coincidently the same year as the first release of Git. While technology has shifted and improved quite a bit since then git has come out as the dominant choice for version control systems. This year, we asked what version control systems people used, and git came out as the clear overall winner.But it’s not quite a blow out; there are two other systems on the list: SVN (Apache Subversion) and Mercurial. There was a time when both of these were prominent in the market, but not everyone remembers those days. Stack Overflow engineering has used both of these in the past, though we now use git like almost everybody else.This article will look at what those version control systems are and why they still have a hold of some engineering teams.Apache SubversionSubversion (SVN) is an open-source version control system that maintains source code in a central server; anyone looking to change code accesses these files from clients. This client server model is an older style, compared to the distributed model git uses, where changes can be stored locally then distributed to the central history (and other branches) when pushed to an upstream repository. In fact, SVN build on historical version control—it was initially intended to be a mostly compatible successor to CVS (Concurrent Versions System), which is itself a front end and expansion to Revision Control System (RCS), initially released way back in 1982.This earlier generation of version control worked great for the way software was built ten to fifteen plus years ago. A piece of software would be built as a central repository, with any and all feature additions merged into a trunk. Branches were rare and eventually absorbed into the mainline. Important files, particularly large binaries, could be “locked” to prevent other developers from changing them while you worked on them. And everything existed as directories—files, branches, tags, etc. This model worked great for a centrally located team that eventually shipped a release, whether as a disc or a download.SVN is a free, open-source version of this model. One of the paid client-server version control systems, Perforce (more on this below), had some traction at enterprise-scale companies, notably Google, but for those unwilling to pay the price for it, SVN was a good option. Plenty of smaller companies (including us at the beginning) used centralized version control to manage their code, and I’m sure plenty of folks still do, whether out of habit or preference.But the ways that engineering organizations work has changed pretty drastically in the last dozen years. There is no longer a central dev team working on a single codebase; you have multiple independent teams each responsible for one or more services. Stack Overflow user VonC has made himself a bit of a version control expert and has guided plenty of companies away from SVN. He sees it a technology built for a less agile way of working. “It does get in the way, in term of management, repository creation, registration, and the general development workflow. As opposed to a distributed model, which is much more agile in those aspects. I suspect the recent developments with remote working will not help those closed environment systems.”The other reason that SVN grew less used was that git showed how things could be better. Quentin Headen, Senior Software Engineer here at Stack Overflow, used SVN early in his career. “In my opinion, the two biggest drawbacks of SVN are that first, it is centralized, which requires a the SVN server to be up for you to commit changes. If your internet is down, you can't commit at all. Second, the branching is very heavy. Once a branch is created, you can't delete it (if I remember correctly). I think there is a command to remove, but it stays in history regardless. Git branches are cheap and can be deleted easily if need be.”Clearly, SVN lost prominence when the new generation of version control arrived. But git wasn’t the only member of that generation.MercurialGit wasn’t the only member of the distributed version control generation. Mercurial first arrived the same year as Git—2005—and became the two primary players. Early on, many people wondered what differences, if any, the two systems had. When Stack Overflow moved away from SVN, Mercurial won out mostly because we had easy access to hosting through Fog Creek Software (now Glitch), another of our co-founder Joel Spolsky’s companies. Eventually, we too gave in to Git.Initially, Mercurial seemed to be the natural fit for developers coming from earlier VC systems. VonC notes, “It's the story of VHS versus Betamax.”I reached out to Raphaël Gomès and Pierre-Yves David, both Mercurial core developers, about where Mercurial fits into the VC landscape. They said that plenty of large companies still use Mercurial in one form or another, including Mozilla, Facebook (though they may have moved to a Mercurial fork ported to Rust called Eden), Google (though as part of a custom VC codebase called Piper), Nokia, and Jane Street. “One of main advantages of Mercurial these days is its ability to scale on a very large project (millions of commits, millions of files). Over the years, companies have contributed performance improvements and dedicated features that make Mercurial a viable option for extreme scale monorepos.”Ry4an Brase, who works at Google and uses their VC, expanded on why: “git is wed to the file system. Even GitHub accesses repositories as files on disk. The concurrency requirements of very large user bases on a single repo scale past filesystem access, and both Google and Facebook found Mercurial could be adapted to a database-like datastore and git could not.” However, with the recent release of Git v2.38 and Scalar, that advantage may be lessened.But another reason that Mercurial may stay at these companies with massive monorepos is that it’s portable and extendable. It’s written in Python, which means it doesn’t need to be compiled to native code, and therefore it can be a viable VC option on any OS with a Python interpreter. It also has a robust extension system. “The extension system allows modifying any and all aspects of Mercurial and is usually greatly appreciated in corporate contexts to customize behavior or to connect to existing systems,” said Gomès and David.Mercurial still has some big fans. Personally, I had never heard of it until some very enthusiast Mercurialists commented on an article of ours, A look under the hood: how branches work in Git.babaloomer: Branches in mercurial are so simple and efficient! You never struggle to find the origin of a branch. Each commit has the name of its branch embedded in it, you can’t get lost! I don’t know how many times I had to drill down git history just to find the origin of a branch.Scott: Mercurial did this much more intuitively than Git. You can tell the system is flawed when the standard practice in many workflows is to use “push -f” to force things. As with any tool, if you have to force it something is wrong.Of course, different developers have different takes on this. Brase doesn’t think that Mercurial’s branching is necessary better. “Mercurial has four ways to do branches,” he said, “and the one that was exactly like git's was called ‘bookmarks’, which the core developers were slow to support. What Mercurial called branches have no equivalent in git (every commit is on one and only one branch and it's part of the commit info and revision hash), but no one wanted that kind.” Well, maybe not no one.Mercurial is still and active project, as Gomès and David attest. They contribute to the code, manage the release cycles, and hold yearly conferences. While not the leading tool, it still has a place.Other version control systemsIn talking to people about version control, I found a few other interesting use cases, primarily around paid version control products.Remember when I said I’d have more on Perforce? It turns out that several people mentioned it even though it didn’t even register on our survey. It turns out that Perforce has a strong presence in the video game industry—some even consider it the standard there. Rob Oates, an industry veteran who is currently the senior director of technology and partnerships at Exploding Kittens said, “Perforce still sees use in the game industry because c video game projects (by variety, count, and total volume of assets) are almost entirely not code.”He gave four requirements that any version control system would need to fulfill in order to work for video game development:Must be useable by laypersons - Artists and designers will be working in this system day-to-day.Must lock certain files/types on checkout - Many of our files cannot be conceptually or technically merged.Must be architected to handle many large files as the primary use case - Many of our files will be dozens or hundreds of megabytes.Must avoid degenerate case with delta compression schemes - Many of our large files change entirely between revisions.Perforce, because of its centralized server and file locking mechanism, fits perfectly. So why not separate the presentation layer from the simulation logic and store the big binary assets in one place and the code in a distributed system that excels at merging changes? The code in video games often depends on the assets. “For example, it would not be unusual for a game's combat system to depend on the driving code, the animations, the models, and the tuning data,” said Oates. “Or a pathfinding system may depend on a navigation mesh generated from the level art. Keeping these concerns in one repo is faster and less confusing when a team of programmers, artists, and designers are working to rapidly iterate on the ‘feel’ of the combat system.”The engineers at these companies often prefer git. When they have projects that don’t have artists and designers, they can git what they want. “Game engines and middleware have an easier time living on distributed version control as their contributors are mostly, if not entirely, engineers,” said Oates. Unfortunately for the devs on video games, most projects have a lot of people creating non-code assets.Another one mentioned was Team Foundation Version Control (TFVC). This was a Microsoft product originally included in Team Foundation Server and still supported in Azure DevOps. It’s considered the spiritual successor to VSS and is another central server style VC system. Art Gola, a solutions architect with Federated Hermes, told me about it. “It was great for its time. It had an API, was supported on Linux (Team Foundation Everywhere) and tons of people using it that no one ever heard from since they were enterprise devs.”But Gola’s team is actively trying to move their code out of the TFVC systems they have, and he suspects that a lot of other enterprise shops are too. Compared to the agility git provides, TFVC felt clunky. “It requires you to have a connection to the central server. Later versions allow you to work offline, but you only had the latest version of the code, unlike git. There is no built in pull request type of process. Branching was a pain.”One could assume that now that the age of centralized version control is waning and distributed version control is ascendant, there is no innovation in the VC space. But you’d be mistaken. “There are a lot of cool experiments in the VCS space,” said Patrick Thomson, a GitHub engineer who compared Git and Mercurial in 2008, “Pijul and the theory of patch algebra, especially—but Git, being the most performant DVCS, is the only one I use in industry. I work on very large codebases.”Why did Git win?After seeing what the version control landscape looks like in 2022, it may be obvious why distributed version control won out as the VC of choice for software developers. But it may not be immediately obvious why Git has such a commanding share of the market over Mercurial. Both of them first came out around the same time and have similar features, though certainly not one to one. Certainly, many people prefer it. “For personal projects, I pick Mercurial. If I was starting another company, I'd use Git to avoid having to retrain and argue with new hires,” said Brase.In fact, it should have had an advantage because it was familiar to SVN users and the centralized way of doing things. “Mercurial was certainly the most easy to use and more familiar to use because it was a bit like using subversion, but in a distributed fashion,” said VonC. But that fealty to the old ways may have hurt it as well. “That is also one aspect which was ultimately against Mercury because just having the vision of using an old tool in a distributed fashion was not necessarily the be best fit to develop in a decentralized way.”The short answer why it won comes down to a strong platform and built-in user base. “Mercurial lost the popularity battle in the early 2010s to Git. It's something we attribute in large part to the soaring rise of GitHub at that time, and to the natural endorsement of Git by the Linux community,” said Gomès and David.Mercurial may have started out in a better position, but it may have lost ground over time. “Mercurial's original fit was a curated, coherent user experience with a built-in web UI,” said Brase. “GitHub gave git the good web UI and coherent couldn't beat the feature avalanche from Git contributors and the star power of its founder.”That feature avalanche and focus on user needs may have been a hidden factor in pushing adoption. Thomson, in his comparison nearly fifteen years ago, likened Git to MacGyver and Mercurial to James Bond. Git let you scrape together a bespoke solution to nearly every problem if you were a command-line wizard, while Mercurial—if given the right job—could be fast and efficient. So where does Thomson stand now? “My main objection to Git—the UI—has improved over the years (I now use an Emacs-based Git frontend, which is terrific), whereas Mercurial’s primary drawback, its slow speed on large repositories, is still, as far as I can tell, an extant problem.”Like MacGyver, Git has been improvising and adapting to fit whatever challenges come its way. Like James Bond, Mercurial has its way of doing things. It works great for some situations, but it has a distinct point of view. “My favorite example of a difference in how git and Mercurial approach new features is the `config` command,” said Brase. “Both `git config` and `hg config` are commands to edit settings such as the user's email address. The `git config` command modifies `~/.gitrc` for you and usually gets it right. The Mercurial author refused all contributions that edited a config file for you. Instead `hg config` launched your text editor on `~/.hgrc`, saying ‘What is it with coders who are intimidated by text-based config files? Like doctors that can't stand blood.’”Regardless, it seems that while Git feels like the only version control game in town, it isn’t. Options for how to solve your problems are always a plus, so if you’ve been frustrated with the way it seems that everyone does things, know that there are other ways of working, and commit to learning more.AuthorsRyan DonovanStaffImage of Ryan DonovanCode for a Livinggitmercurialperforcesvnversion controlRecent articlesJune 20, 2024Enterprise 2024.4: Demonstrating and improving community impactJune 19, 2024The real 10x developer makes their whole team betterJune 10, 2024Generative AI Is Not Going To Build Your Engineering Team For YouJune 6, 2024Breaking up is hard to do: Chunking in RAG applicationsLatest PodcastJune 25, 2024A very special 5-year-anniversary edition of the Stack Overflow podcast!
# Document TitleSteno & PLAboutWhere are my Git UI features from the future?Jan 5, 2023Czytaj po polsku 🇵🇱Git sucksRubricClientsAwardsRelated postsCommentsGit sucksThe Git version control system has been causing us misery for 15+ years. Since its inception, a thousand people have tried to make new clients for Git to improve usability.But practically everyone has focused on providing a pretty facade to do more or less the same operations as Git on the command-line — as if Git’s command-line interface were already the pinnacle of usability.@SeanScherer — One thing I didn't get about Pijul (which seems to be a pretty promising approach to VCS - in case you've not stumbled over it before) - is that they seem to be aiming to more or less emulate the Git workflow (when the main developer himself argues that the backend offers far smoother ones... :/ ).➥ 1 replyNo one bothers to consider: what are the workflows that people actually want to do? What are the features that would make those workflows easier? So instead we get clients which think that git rebase -i as the best possible way to reword a commit message, or edit an old commit, or split a commit, or even worth exposing in the UI.RubricI thought about some of the workflows I carry out frequently, and examined several Git clients (some of which are GUIs and some of which are TUIs) to see how well they supported these workflows.Many of my readers won’t care for these workflows, but it’s not just about the workflows themselves; it’s about the resolve to improve workflows by not using the faulty set of primitives offered by Git. I do not care to argue about which workflows are best or should be supported.Workflows:reword: It should be possible to update the commit message of a commit which isn’t currently checked out.Rewording a commit is guaranteed to not cause a merge conflict, so requiring that the commit be checked out is unnecessary.It should also be possible to reword a commit which is the ancestor of multiple branches without abandoning some of those branches, but let’s not get our hopes up…@Azeirah — I found that Sublime Merge does support renaming commits. Just right click any commit, select "edit" and choose the "edit commit message" option➥ 2 repliessync: It should be possible to sync all of my branches (or some subset) via merge or rebase, in a single operation.I do this all the time! Practically the first thing every morning when coming into work.@yasyf — I've found Graphite (graphite.dev) makes this pretty easy! If your whole team is using it, that is.➥ 0 repliessplit: There should be a specific command to split a commit into two or more commits, including commits which aren’t currently checked out.Splitting a commit is guaranteed to not cause a merge conflict, so requiring that the commit be checked out is unnecessary.Not accepting git rebase -i solutions, as it’s very confusing to examine the state of the repository during a rebase.preview: Before carrying out a merge or rebase, it should be possible to preview the result, including any conflicts that would arise.That way, I don’t have to start the merge/rebase operation in order to see if it will succeed or whether it will be hard to resolve conflicts.Merge conflicts are perhaps the worst part about using Git, so it should be much easier to work with them (and avoid dealing with them!).The only people who seem to want this feature are people who come from other version control systems.undo: I should be able to undo arbitrary operations, ideally including tracked but uncommitted changes.This is not the same as reverting a commit. Reverting a commit creates an altogether new commit with the inverse changes, whereas undoing an operation should restore the repository to the state it was in before the operation was carried out, so there would be no original commit to revert.large-load: The UI should load large repositories quickly.The UI shouldn’t hang at any point, and should show useful information as soon as it’s loaded. You shouldn’t have to wait for the entire repository to load before you can examine commits or branches.The program is allowed to be slow on the first invocation to build any necessary caches, but must be responsive on subsequent invocations.large-ops: The UI should be responsive when carrying out various operations, such as examining commits and branches, or merging or rebasing.Extra points:I will award honorary negative points for any client which dares to treat git rebase -i as if it were a fundamental primitive.I will award honorary bonus points for any client which seems to respect the empirical usability research for Git (or other VCSes). Examples:Gitless: https://gitless.com/IASGE: https://investigating-archiving-git.gitlab.io/Since I didn’t actually note down any of this, these criteria are just so that any vendors of these clients can know whether I am impressed or disappointed by them.ClientsI picked some clients arbitrarily from this list of clients. I am surely wrong about some of these points (or they’ve changed since I last looked), so leave a comment.Update 2022-01-09: Added IntelliJ.Update 2022-01-10: Added Tower.Update 2023-05-28: Upgraded Magit’s reword rating.I included my own project git-branchless, so it doesn’t really count as an example of innovation in the industry. I’m including it to demonstrate that many of these workflows are very much possible.Git CLIGitKrakenForkSourcetreeSublime MergeSmartGitTowerGitUpIntelliJMagitLazygitGituigit-branchlessJujutsureword ❌ 1 ❌ ❌ ❌ ⚠️ 2 ❌ ⚠️ 2 ✅ ⚠️ 2 ⚠️ 2 ❌ ❌ ✅ ✅sync ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ❌ ✅ ❌split ❌ 1 ❌ ❌ ❌ ❌ ❌ ❌ ✅ ❌ ❌ ❌ ❌ ❌ ✅preview ❌ ❌ ⚠️ 3 ❌ ⚠️ 3 ❌ ⚠️ 3 ❌ ❌ ✅ 4 ❌ ❌ ⚠️ 5 ✅ 6undo ❌ ✅ ❓ ✅ ✅ ❌ ✅ ✅ ❌ ❌ ⚠️ 7 ❌ ✅ ✅large-load ✅ 8 ❌ ❌ ❌ ✅ ❌ ❌ ❌ ✅ ✅ 9 ✅ ✅ ✅ ❌large-ops ✅ 8 ❌ ❌ ✅ ✅ ❌ ❌ ✅ ✅ ✅ 9 ✅ ✅ ✅ ❌Notes:1 It can be done via git rebase -i or equivalent, but it’s not ergonomic, and it only works for commits reachable from HEAD instead of from other branches.2 Rewording can be done without checking out the commit, but only for commits reachable from HEAD. There may be additional limitations.3 Partial support; it can show whether the merge is fast-forward or not, but no additional details.4 Can be done via magit-merge-preview.5 Partial support; if an operation would cause a merge conflict and --merge wasn’t passed, then instead aborts and shows the number of files that would conflict.6 Jujutsu doesn’t let you preview merge conflicts per se, but merges and rebases always succeed and the conflicts are stored in the commit, and then you can undo the operation if you don’t want to deal with the merge conflicts. You can even restore the old version of the commit well after you carried out the merge/rebase, if desired. This avoids interrupting your workflow, which is the ultimate goal of this feature, so I’m scoring it as a pass for this category.7 Undo support is experimental and based on the reflog, which can’t undo all types of operations.8 Git struggles with some operations on large repositories and can be improved upon, but we’ll consider this to be the baseline performance for large repositories.9 Presumably Magit has the same performance as Git, but I didn’t check because I don’t use Emacs.AwardsCommendations:GitUp: the most innovative Git GUI of the above.GitKraken: innovating in some spaces, such as improved support for centralized workflows by warning about concurrently-edited files. These areas aren’t reflected above; I just noticed them on other occasions.Sublime Merge: incredibly responsive, as to be expected from the folks responsible for Sublime Text.Tower: for having a pleasing undo implementation.Demerits:Fork: for making it really hard to search for documentation (“git fork undo” mostly produces results for undoing forking in general, not for the Fork client).SmartGit: for being deficient in every category tested.Related postsThe following are hand-curated posts which you might find interesting.Date Title19 Jun 2021 git undo: We can do better12 Oct 2021 Lightning-fast rebases with git-move19 Oct 2022 Build-aware sparse checkouts16 Nov 2022 Bringing revsets to Git05 Jan 2023 (this post) Where are my Git UI features from the future?11 Jan 2024 Patch terminologyWant to see more of my posts? Follow me on Twitter or subscribe via RSS.CommentsDiscussion on Hacker NewsDiscussion on LobstersSteno & PLsubscribe via RSSWaleed Khanme@waleedkhan.namearxanasarxanasThis is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.
# Document TitleBringing revsets to GitNov 16, 2022Intended audienceIntermediate to advanced Git users.Developers of version control systems.OriginExperience with Mercurial at present-day Meta.My work on git-branchless.Revsets are a declarative language from the Mercurial version control system. Most commands in Mercurial that accept a commit can instead accept a revset expression to specify one or more commits meeting certain criteria. The git-branchless suite of tools introduces its own revset language which can be used with Git.Try it outExisting Git syntaxBetter scriptingBetter graph viewBetter rebasingBetter testingPrior workRelated postsCommentsTry it outTo try out revsets, install git-branchless, or see Prior work for alternatives.Sapling SCMExisting Git syntaxGit already supports its own revision specification language (see gitrevisions(7)). You may have already written e.g. HEAD~ to mean the immediate parent of HEAD.However, Git’s revision specification language doesn’t integrate well with the rest of Git. You can write git log foo..bar to list the commits between foo and bar, but you can’t write git rebase foo..bar to rebase that same range of commits.It can also be difficult to express certain sets of commits:You can only express contiguous ranges of the commits, not arbitrary sets.You can’t directly query for the children of a given commit.git-branchless introduces a revset language which can be used directly via its git query or with its other commands, such as git smartlog, git move, and git test.The rest of this article shows a few things you can do with revsets. You can also read the Revset recipes thread on the git-branchless discussion board.Better scriptingRevsets can compose to form complex queries in ways that Git can’t express natively.In git log, you could write this to filter commits by a certain author:$ git log --author="Foo"But negating this pattern is quite difficult; see Stack Overflow question equivalence of: git log –exclude-author?.With revsets, the same search can be straightforwardly negated with not:$ git query 'not(author.name(Foo))'It’s easy to add more filters to refine your query. To additionally limit to files which match a certain pattern and commit messages which contain a certain string, you could write this:$ git query 'not(author.name(Foo)) & paths.changed(path/to/file) & message(Ticket-123)'You can express complicated ad-hoc queries in this way without having to write a custom script.Better graph viewGit has a graph view available with git log --graph, which is a useful way to orient yourself in the commit graph. However, it’s somewhat limited in what it can render. There’s no way to filter commits to only those matching a certain condition.git-branchless offers a “smartlog” command which attempts to show you only relevant commits. By default, it includes all of your local work up until the main branch, but not other people’s commits. Mine looks like this right now:Image of the smartlog view with a few draft commits and branches.Image of the smartlog view with a few draft commits and branches.But you can also filter commits using revsets. To show only my draft work which touches the git-branchless-lib/src/git directory, I can issue this command:Image of the smartlog view as before, but with only two draft commits visible (excluding those on the main branch).Image of the smartlog view as before, but with only two draft commits visible (excluding those on the main branch).Another common use-case might be to render the relative topology of branches in just this stack:Image of a different smartlog view as before, showing branch-1 and branch-3 with an omitted commit between them.Image of a different smartlog view as before, showing branch-1 and branch-3 with an omitted commit between them.You can also render commits which have already been checked into the main branch, if so desired.Better rebasingNot only can you render the commit graph with revsets, but you can also modify it. Revsets are quite useful when used with “patch-stack” workflows, such as those used for the Git and Linux projects, or at certain tech companies practicing trunk-based development.For example, suppose you have some refactoring changes to the file foo on your current branch, and you want to separate them into a new branch for review:Image of a feature branch with four commits. Each commit shows two touched files underneath it.Image of a feature branch with four commits. Each commit shows two touched files underneath it.You can use revsets to select just the commits touching foo in the current branch:Image of the same feature branch as before, but with the first and third commits outlined in red and the touched file 'foo' in red.Image of the same feature branch as before, but with the first and third commits outlined in red and the touched file ‘foo’ in red.Then use git move to pull them out:$ git move --exact 'stack() & paths.changed(foo)' --dest 'main'Image of the same feature branch as before, but the first and third commits are shown to be missing from the original feature branch, with dotted outlines indicating their former positions. They have been moved to a new feature branch, still preserving their relative order.Image of the same feature branch as before, but the first and third commits are shown to be missing from the original feature branch, with dotted outlines indicating their former positions. They have been moved to a new feature branch, still preserving their relative order.If you want to reorder the commits so that they’re at the base of the current branch, you can just add --insert:$ git move --exact 'stack() & paths.changed(foo)' --dest 'main' --insertImage of the same feature branch as before, but the first and third commits are shown to be missing from their original positions in the feature branch, with dotted outlines indicating their former positions. They have been moved to the beginning of that same feature branch, still preserving their relative order, now before the second and fourth commits, also preserving their relative order.Image of the same feature branch as before, but the first and third commits are shown to be missing from their original positions in the feature branch, with dotted outlines indicating their former positions. They have been moved to the beginning of that same feature branch, still preserving their relative order, now before the second and fourth commits, also preserving their relative order.Of course, you can use a number of different predicates to specify the commits to move. See the full revset reference.Better testingYou can use revsets with git-branchless’s git test command to help you run (or re-run) tests on various commits. For example, to run pytest on all of your branches in parallel and cache the results, you can run:$ git test run --exec 'pytest' --jobs 4 'branches()'You can also use revsets to aid the investigation of a bug with git test. If you know that a bug was introduced between commits A and B, and has to be in a commit touching file foo, then you can use git test like this to find the first commit which introduced the bug:$ git test run --exec 'cargo test' 'A:B & paths.changed(foo)'This can be an easy way to skip commits which you know aren’t relevant to the change.Versus git bisectCaching test resultsPrior workThis isn’t the first introduction of revsets to version control. Prior work:Of course, Mercurial itself introduced revsets. See the documentation here: https://www.mercurial-scm.org/repo/hg/help/revsetshttps://github.com/quark-zju/gitrevset: the immediate predecessor of this work. git-branchless uses the same back-end “segmented changelog” library (from Sapling SCM, then called Eden SCM) to manage the commit graph. The advantage of using revsets with git-branchless is that it integrates with several other commands in the git-branchless suite of tools.https://sapling-scm.com/: also an immediate predecessor of this work, as it originally published the segmented changelog library which gitrevset and git-branchless use. git-branchless was inspired by Sapling’s design, and has similar but non-overlapping functionality. See https://github.com/arxanas/git-branchless/discussions/654 for more details.https://github.com/martinvonz/jj: Jujutsu is a Git-compatible VCS which also offers revsets. git-branchless and jj have similar but non-overlapping functionality. It’s worth checking out if you want to use a more principled version control system but still seamlessly interoperate with Git repositories. I expect git-branchless’s unique features to make their way into Jujutsu over time.
# Document TitleSapling: Source control that’s user-friendly and scalableSapling MetaBy Durham Goode, Michael BolinSapling is a new Git-compatible source control client.Sapling emphasizes usability while also scaling to the largest repositories in the world.ReviewStack is a demonstration code review UI for GitHub pull requests that integrates with Sapling to make reviewing stacks of commits easy.You can get started using Sapling today.Source control is one of the most important tools for modern developers, and through tools such as Git and GitHub, it has become a foundation for the entire software industry. At Meta, source control is responsible for storing developers’ in-progress code, storing the history of all code, and serving code to developer services such as build and test infrastructure. It is a critical part of our developer experience and our ability to move fast, and we’ve invested heavily to build a world-class source control experience.We’ve spent the past 10 years building Sapling, a scalable, user-friendly source control system, and today we’re open-sourcing the Sapling client. You can now try its various features using Sapling’s built-in Git support to clone any of your existing repositories. This is the first step in a longer process of making the entire Sapling system available to the world.What is Sapling?Sapling is a source control system used at Meta that emphasizes usability and scalability. Git and Mercurial users will find that many of the basic concepts are familiar — and that workflows like understanding your repository, working with stacks of commits, and recovering from mistakes are substantially easier.When used with our Sapling-compatible server and virtual file system (we hope to open-source these in the future), Sapling can serve Meta’s internal repository with tens of millions of files, tens of millions of commits, and tens of millions of branches. At Meta, Sapling is primarily used for our large monolithic repository (or monorepo, for short), but the Sapling client also supports cloning and interacting with Git repositories and can be used by individual developers to work with GitHub and other Git hosting services.Why build a new source control system?Sapling began 10 years ago as an initiative to make our monorepo scale in the face of tremendous growth. Public source control systems were not, and still are not, capable of handling repositories of this size. Breaking up the repository was also out of the question, as it would mean losing monorepo’s benefits, such as simplified dependency management and the ability to make broad changes quickly. Instead, we decided to go all in and make our source control system scale.Starting as an extension to the Mercurial open source project, it rapidly grew into a system of its own with new storage formats, wire protocols, algorithms, and behaviors. Our ambitions grew along with it, and we began thinking about how we could improve not only the scale but also the actual experience of using source control.Sapling’s user experienceHistorically, the usability of version control systems has left a lot to be desired; developers are expected to maintain a complex mental picture of the repository, and they are often forced to use esoteric commands to accomplish seemingly simple goals. We aimed to fix that with Sapling.A Git user who sits down with Sapling will initially find the basic commands familiar. Users clone a repository, make commits, amend, rebase, and push the commits back to the server. What will stand out, though, is how every command is designed for simplicity and ease of use. Each command does one thing. Local branch names are optional. There is no staging area. The list goes on.It’s impossible to cover the entire user experience in a single blog post, so check out our user experience documentation to learn more.Below, we’ll explore three particular areas of the user experience that have been so successful within Meta that we’ve had requests for them outside of Meta as well.Smartlog: Your repo at a glanceThe smartlog is one of the most important Sapling commands and the centerpiece of the entire user experience. By simply running the Sapling client with no arguments, sl, you can see all your local commits, where you are, where important remote branches are, what files have changed, and which commits are old and have new versions. Equally important, the smartlog hides all the information you don’t care about. Remote branches you don’t care about are not shown. Thousands of irrelevant commits in main are hidden behind a dashed line. The result is a clear, concise picture of your repository that’s tailored to what matters to you, no matter how large your repo.Having this view at your fingertips changes how people approach source control. For new users, it gives them the right mental model from day one. It allows them to visually see the before-and-after effects of the commands they run. Overall, it makes people more confident in using source control.We’ve even made an interactive smartlog web UI for people who are more comfortable with graphical interfaces. Simply run sl web to launch it in your browser. From there you can view your smartlog, commit, amend, checkout, and more.Fixing mistakes with easeThe most frustrating aspect of many version control systems is trying to recover from mistakes. Understanding what you did is hard. Finding your old data is hard. Figuring out what command you should run to get the old data back is hard. The Sapling development team is small, and in order to support our tens of thousands of internal developers, we needed to make it as easy as possible to solve your own issues and get unblocked.To this end, Sapling provides a wide array of tools for understanding what you did and undoing it. Commands like sl undo, sl redo, sl uncommit, and sl unamend allow you to easily undo many operations. Commands like sl hide and sl unhide allow you to trivially and safely hide commits and bring them back to life. There is even an sl undo -i command for Mac and Linux that allows you to interactively scroll through old smartlog views to revert back to a specific point in time or just find the commit hash of an old commit you lost. Never again should you have to delete your repository and clone again to get things working.See our UX doc for a more extensive overview of our many recovery features.First-class commit stacksAt Meta, working with stacks of commits is a common part of our workflow. First, an engineer building a feature will send out the small first step of that feature as a commit for code review. While it’s being reviewed, they will start on the next step as a second commit that will later be sent for code review as well. A full feature will consist of many of these small, incremental, individually reviewed commits on top of one another.Working with stacks of commits is particularly difficult in many source control systems. It requires complex stateful commands like git rebase -i to add a single line to a commit earlier in the stack. Sapling makes this easy by providing explicit commands and workflows for making even the newest engineer able to edit, rearrange, and understand the commits in the stack.At its most basic, when you want to edit a commit in a stack, you simply check out that commit, via sl goto COMMIT, make your change, and amend it via sl amend. Sapling automatically moves, or rebases, the top of your stack onto the newly amended commit, allowing you to resolve any conflicts immediately. If you choose not to fix the conflicts now, you can continue working on that commit, and later run sl restack to bring your stack back together once again. Inspired by Mercurial’s Evolve extension, Sapling keeps track of the mutation history of each commit under the hood, allowing it to algorithmically rebuild the stack later, no matter how many times you edit the stack.Beyond simply amending and restacking commits, Sapling offers a variety of commands for navigating your stack (sl next, sl prev, sl goto top/bottom), adjusting your stack (sl fold, sl split), and even allows automatically pulling uncommitted changes from your working copy down into the appropriate commit in the middle of your stack (sl absorb, sl amend –to COMMIT).ReviewStack: Stack-oriented code reviewMaking it easy to work with stacks has many benefits: Commits become smaller, easier to reason about, and easier to review. But effectively reviewing stacks requires a code review tool that is tailored to them. Unfortunately, many external code review tools are optimized for reviewing the entire pull request at once instead of individual commits within the pull request. This makes it hard to have a conversation about individual commits and negates many of the benefits of having a stack of small, incremental, easy-to-understand commits.Therefore, we put together a demonstration website that shows just how intuitive and powerful stacked commit review flows could be. Check out our example stacked GitHub pull request, or try it on your own pull request by visiting ReviewStack. You’ll see how you can view the conversation and signal pertaining to a specific commit on a single page, and you can easily move between different parts of the stack with the drop down and navigation buttons at the top.SaplingScaling SaplingNote: Many of our scale features require using a Sapling-specific server and are therefore unavailable in our initial client release. We describe them here as a preview of things to come. When using Sapling with a Git repository, some of these optimizations will not apply.Source control has numerous axes of growth, and making it scale requires addressing all of them: number of commits, files, branches, merges, length of file histories, size of files, and more. At its core, though, it breaks down into two parts: the history and the working copy.Scaling history: Segmented Changelog and the art of being lazyFor large repositories, the history can be much larger than the size of the working copy you actually use. For instance, three-quarters of the 5.5 GB Linux kernel repo is the history. In Sapling, cloning the repository downloads almost no history. Instead, as you use the repository we download just the commits, trees, and files you actually need, which allows you to work with a repository that may be terabytes in size without having to actually download all of it. Although this requires being online, through efficient caching and indexes, we maintain a configurable ability to work offline in many common flows, like making a commit.Beyond just lazily downloading data, we need to be able to efficiently query history. We cannot afford to download millions of commits just to find the common ancestor of two commits or to draw the Smartlog graph. To solve this, we developed the Segmented Changelog, which allows the downloading of the high-level shape of the commit graph from the server, taking just a few megabytes, and lazily filling in individual commit data later as necessary. This enables querying the graph relationship between any two commits in O(number-of-merges) time, with nothing but the segments and the position of the two commits in the segments. The result is that commands like smartlog are less than a second, regardless of how big the repository is.Segmented Changelog speeds up other algorithms as well. When running log or blame on a file, we’re able to bisect the segment graph to find the history in O(log n) time, instead of O(n), even in Git repositories. When used with our Sapling-specific server, we go even further, maintaining per-file history graphs that allow answering sl log FILE in less than a second, regardless of how old the file is.Scaling the working copy: Virtual or SparseTo scale the working copy, we’ve developed a virtual file system (not yet publicly available) that makes it look and act as if you have the entire repository. Clones and checkouts become very fast, and while accessing a file for the first time requires a network request, subsequent accesses are fast and prefetching mechanisms can warm the cache for your project.Even without the virtual file system, we speed up sl status by utilizing Meta’s Watchman file system monitor to query which files have changed without scanning the entire working copy, and we have special support for sparse checkouts to allow checking out only part of the repository.Sparse checkouts are particularly designed for easy use within large organizations. Instead of each developer configuring and maintaining their own list of which files should be included, organizations can commit “sparse profiles” into the repository. When a developer clones the repository, they can choose to enable the sparse profile for their particular product. As the product’s dependencies change over time, the sparse profile can be updated by the person changing the dependencies, and every other engineer will automatically receive the new sparse configuration when they checkout or rebase forward. This allows thousands of engineers to work on a constantly shifting subset of the repository without ever having to think about it.To handle large files, Sapling even supports using a Git LFS server.More to ComeThe Sapling client is just the first chapter of this story. In the future, we aim to open-source the Sapling-compatible virtual file system, which enables working with arbitrarily large working copies and making checkouts fast, no matter how many files have changed.Beyond that, we hope to open-source the Sapling-compatible server: the scalable, distributed source control Rust service we use at Meta to serve Sapling and (soon) Git repositories. The server enables a multitude of new source control experiences. With the server, you can incrementally migrate repositories into (or out of) the monorepo, allowing you to experiment with monorepos before committing to them. It also enables Commit Cloud, where all commits in your organization are uploaded as soon as they are made, and sharing code is as simple as sending your colleague a commit hash and having them run sl goto HASH.The release of this post marks my 10th year of working on Sapling at Meta, almost to the day. It’s been a crazy journey, and a single blog post cannot cover all the amazing work the team has done over the last decade. I highly encourage you to check out our armchair walkthrough of Sapling’s cool features. I’d also like to thank the Mercurial open source community for all their collaboration and inspiration in the early days of Sapling, which started the journey to what it is today.I hope you find Sapling as pleasant to use as we do, and that Sapling might start a conversation about the current state of source control and how we can all hold the bar higher for the source control of tomorrow. See the Getting Started page to try Sapling today.
# Document TitleSteno & PLAboutBuild-aware sparse checkoutsOct 19, 2022My team and I recently gave a talk at Git Merge 2022 on our tool Focus, which integrates Git sparse checkouts with the Bazel build system.Talk recording (~15 minutes):And here are some slides I published internally. These go into the technical details of content-hashing, and also have doodles:Related postsThe following are hand-curated posts which you might find interesting.Date Title19 Jun 2021 git undo: We can do better12 Oct 2021 Lightning-fast rebases with git-move19 Oct 2022 (this post) Build-aware sparse checkouts16 Nov 2022 Bringing revsets to Git05 Jan 2023 Where are my Git UI features from the future?11 Jan 2024 Patch terminologyWant to see more of my posts? Follow me on Twitter or subscribe via RSS.CommentsSteno & PLsubscribe via RSSWaleed Khanme@waleedkhan.namearxanasarxanasThis is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.
um yeah so my name is Martin VonschweigbergIexpected my speaker notes to be heresomewhereum but yeahum I work for uh on Source control atand I'm going to talk about uh a projectI've been working on for almost threeyears it's a git compatible VCS calledJiu Jitsu and in case you're wonderingthe name has nothing to do withjiu-jitsu kaisen the animeum it's called Jujitsu just because thebinary is called JJ and the binary iscalled JD because it's easy to typeumoh okay now there it isum so here's an overview of thepresentationumfirst I'm going to give you somebackground about me and about thehistory of source control at Googleand I'm going to go through theworkflows and architecture of JJum and then at the end I'll explainwhat's what's next for the open sourceproject and uh and um our theintegration at Googleso background about me I aftergraduating I worked for Eric's son forabout 70 yearsand while there I think it was it's fairto say I drove the immigration to getthere from from clear caseand I cleaned up some get three basescripts in my spare timethen I joined Google and worked ona compensation appand for the last eight years I've workedon fig which is a which is a project tointegrate material as a client for ourin-house monorepoum so for for context uh let me tell youa bit about the history our VersionControl at Googleso long time ago we supposedly startedwith CVSumand uh then we switched to the perforceand after a while proforce uh wasn'table to handle the repository anymorebecause it got too large so we wrote OurOwn VCS called Piperum but the working copy was still toobig so we created a virtual flightsystem called Cityon top of Piperand that's what almost every user go atGoogle uses nowumpeople who are still missing the dbcsworkflows that they were used to fromoutside Google so we added Mercury ontop of that and as I said that's whatI've been working on for the last eightyearsum and also in case you you didn't knowum this our monoree but Google isextremely large and has like all thesource codeum at Googleand you can you can watch this uh uhtalk by Rachel plattwood and potvinum from at scale that if you're curiousso generally people really like fig butthere's still some major problems uhwe're havingthe probably the biggest one isperformanceum and and that's partly because ofpython python is slowuh and partly because ofum eager data structures that don'tscale to the size of the repoanother problem is uh with uhconsistency we're seeing right racesbecause Mercurial is not designed fordistributed storage so we get corruptionwhen when we store it on top of ourdistributed file systemum and another pain has beenintegration because we're callingMercurial the Mercurial CLI and parsingthe outputs which is not funsoa few years ago we started talking aboutwhat we would want from a next-gen VCSand one of the ideas that came up therewas to automatically commit or to tomake themake to make a make a commit from everysave from from your editorand I got really excited by that ideaand started JD to experiment with itumand then I worked on it as my 20 productfor about two yearsand this spring we decided to investmore so now it's my 100 projectumnextso you may be wondering why we're why Ididn't decide to to just add thesefeatures to to git but as I said I wantI was want to experiment with adifferent ux so I I think that would endup being a completely separateset of commands inside of git and thatwould bereally ugly and wouldn't and shouldn'tget up accepted Upstreamand also I wanted to be able tointegrate into Google's ecosystem and wehad already decided what fig thatagainst using git there because becauseof the problems with integrating it withum our ecosystem at Googleand one of the problems is that thereare multiple implementations I'll getthat read from the file system so wewould have to add any new features in inat least two or three placesokay so let's take a look at theworkflowsso the first the first teacherum is anonymous branches which issomething I copied from mercurialso instead of thisgets scary uh detached head workflow orstateum JJ keeps track of all your commitswithout you having to name themso that means they didn't show up in logoutputs and they will not be garbagecollectedum so it may seem like you would get avery cluttered log output very quicklybutum whenever you rewrite the commit so ifyou amend it or rebate it for examplethe the old version gets gets hidden andyou can also manually hide commits withJG abandonand um so one of the first things you'llnotice if you start using JJ is that theworking copy is is an actual commit thatgets automatically committedand it shows up in the log output atwith an at symboland whenever you make any changes in theworking copyand run the run any JJ command afterthat it will get automatically amendedinto that that commitum so and you can use the JD checkoutcommand and that actually createsa new commit on top of thecommittee specify to keep to store yourworking working copy changes and and ifyou instead wanted to resumeum editing an existing commit you canuse JJ editsso this is some very interestingconsequences like the one important oneis the the working copy would never bedirtyso if you check out the different commitor rebase or something it will nevertell you that you have unstaged changesyou get automatic backup because everycommand you run trace is anotherautomatic backup for youit makes stash unnecessary because yourworking copy commits is effectively astashit also makes commit unnecessarybecause the well it's already committedwhenever you run a command you yourworking copy is committed soand you canum if you if you want to set thedescription commit message you can runjda describe to do that at any timewe get more consistent CLI because theum the the ads commit the working copycommits behaves just like any othercommit so there are no special Flags toto work um to act on the working copyum and for example JD restore canrestore files between any two commitsit defaults to to restoring from theparent of the working copy and into theworking copy just like gets restored asI thinkum but you can pass any two commitsand also uncommitted changes stay inplace so they don't move around with youlike like they do with Git checkoutsso umanother one of of JD's distinguishingfeatures is a first-class conflictsumso if you if you look at the screenshotthere we merge the working copy at withsome other commits from another branchand that succeedseven though there are conflicts and wecan see in the log output afterwardsthat there are conflicts in the workingcopyand that's so as you see the the workingcopy commit there with the ad is is amerge commits with conflictsso these conflicts are recorded in thecommit as in a structured way so they'renot they're not just conflict markersstored in a fileumand and this like design leads to evenmore magic like um maybe obvious one isthat you canyou can delayresolving conflicts until you're readyready until you feel like it so if inthis casewhen we had this conflict we can justcheck out any another commit and dealwith the conflicts later who want toand you can collaborate on on conflictresolutionsyou can resolve some some of theconflicts in in these files and andleave the rest for your co-worker forexampleum maybe less obvious is that a rebasenever fails soum and same I mean we saw that here thatmerge doesn't fail the same thing istrue for rebates and or all the othersimilar commandsum which makes continue and abort andfor rebase and and cherubick and allthose similar commands unnecessarywe we also get a more consistentconflict resolution flow you don't needto remember which commandumcreative your conflicts you just checkout the command commit with conflictand resolve the conflicts in the workingcopy and then squash the that conflictresolution into into the parent thatthat conflict or as in in the screenshothere we we were alreadyediting that commit so we just resolvethe conflict in The Working copy and andit's it's gone from the log outputum soumyeah I remember so so now in the workingcopy here in the second screenshotthe conflict has been resolved so in theworking copyuh sorry in in the merge commit there inthe working copyum that that working copy resolves theconflict rightso we're going to come back to that inthe next slideum and and and and a very importantfeature that took me a year to torealize it was possible is that umbecause uh rebase always succeeds we canwe can always uh rebase all descendantsso if youum check out the commit in the middle ofa stack and amend in some changes uh JDwill always rebase all The Descendantson top so you will not have to do thatyourself and it moves the branches overand so onumyes so so you can uh we can actuallyeven rebaseconflicts and conflict resolutions Idon't have time to explain how thatworks but it doesum so in the top screenshots herethis is continuation from from theprevious slidewe re-based one side of the of the mergefrom conflicts conflicting change tothere and it's and it's descendants ontoconflict and change one and of course weget the same same conflict as we had inthe the merge commit beforeand that stays in the non-conflictingchange but then in the the working copybecause we we had the conflictresolution in the working copy which wasa merge commit beforeum that resolution gets rebased over andit stays in the working copy so we don'thave a conflict in the word copy hereum then the the second screenshotum we were on JJ move 2 that commitswith the the first commit with theconflict and that commandis a bit like the uh all the squash thatsomeone talked about earlier earliertodaybut much more generic you can can movea change from any commit into any othercommitso in this case we're moving the changesfrom the working cup it does the defaultas well there's no dash fromMove It from The Working copy and intothe conflict the first quit with theconflictand then the conflict is gone from theentire stack and the working copy is inthis case becomes emptywe can also rebase merges correctlybyrebasing the diff compared to the automerged parentsonto the new auto merge parentsand and that's actually how JD treatsall diffs or changes in in commits notjust not just on rebasing Souls when youwhen you diff a merge commit it's cometo Auto merge the parents just like theuhthree merged if I think Elijah talkedabout beforeum and the the last feature I was goingto talk about is the what I call theOperation Logso you can think of it as the entirerepo being under Source controlum and by the entire repo I mean refsand on Anonymous heads and the workingcopies position in in each work treeso theumit's it's kind of like it gets ref Vlogsbut they'reumacross all the refs at onceand there's some extra metadata as wellso in the top screenshot there you cansee theumtheir username and hostname and time ofthe operationum and as as you can can imagine havingthese snapshots of the repository at adifferent points in time lets us go backand look at the repository from aprevious snapshotso in the in the middle uhsnapshot screenshot therewe run JD status at a particularoperation and that shows like justbefore we had set set the description onthe commitso we see that the working copy commithas no description but it has a modifiedfont because it was from that point intimeyeah you can run this with any commandnot just status you can do run log ordiv for anythingand of course this is very useful whenfor when you're trying to figure out howum aeropostory got into a certaincertain State especially if it's notyour own repositories you don't rememberwhat what happenedumand like the last screenshot there showshow we restored the entire repo to anearlier State before before the file waseven added in this caseso that's the second operation from thebottom when when we're just created therepository so the working copy is emptyand and there's no descriptionum and and restoring back toan earlier State like that is is veryuseful when you have made a mistake butuh JD actually even lets you undojust a single operation without undoingall the later operationsso that works kind of like git revertedus but instead of acting onon the files in the committed acts onthe refs in an operationso the screenshots here show how weundo and andan operation the second to lastoperation there that abandoned thecommit so we undo that operation and itbecomes visible againum and the Operation Log also gives ussafe concurrency even if you have runcommands ona repositories in on different machinesand sync them via Dropbox for exampleyou can get conflicts of course but youwill notlose data or or get corruptionand then in those cases if you sync twooperations via Dropbox you'll see amerge in this Operation Log outputum we also got a simpler simplerprogramming model because transactionsnever fail when you started a commandum the repository is loaded at thecurrent the latest operation and any anychanges that happenum concurrently will not be seen by thatby that commandand then when the command is is done itcommits that transaction as a newoperation which gets recorded recordedand can't fail just like write me incommit can't failuh and and this is actually why Ideveloped the the Operation Log for thesimpler programming model and then uhthe undo and trying travel stuff is justan accidentokay so um that's all about features andworkflows and take a look let's take alook at the architectureuh so JD is written as restumit's written as a library to be to beeasy to to reuseuh in the CLI is the only the currentonly current user of that library butthe goal is to to be able to use it inin a in a server or a CLI or or uh GUIor or an idea for exampleum and and I hope also by making it thelibrary we reduce the temptation to torewrite and have the same problem askidsum and and because I wanted JD to beable toum integrate with the ecosystem atI try to make it easy to replacedifferent modules a bitumand so for example it comes with twodifferentcommit storage backends by defaultone is the the kids back end the sourcecommits as git commitsand the other one is just a very simplecustom one but it should be possible towrite one that stores commits in thecloud for exampleand same thing with the working copyit'sof course stored on local disk normallybut the goal is to be able to to writeto replace that with an implementationthat writesum writes the work in copy to a VFS forexampleintegrates with a smarter vs VFS by notactually writing all the filesand also to be able to use that atit needs to be scalable to very verylarge repositorieswhich means we can't have any operationsthat need all the ancestors for exampleof the commitsso to achieve that we achieved thatmostly by by laziness not fetchingobjects that we don't need sotry to be careful when designingalgorithms tolike not scale to the size of therepository or the size of a tree unlessnecessaryso another important design decision wastoumperform operations in the repositoryfirst not in the working copy sowhen all operations are at like rightcommits and update references to pointto those commits and then onlyafterwards after transaction commitsactuallyum is that it's the working copy updatedto to match that that states the newstate if if they're working copy evenChanged by the operationum and it helps here that theum the working copy is to commit becausethen you even when you're running acommand that updates the working copythat's to say the same thing you justcommit that transaction that modifiesthe working copy commit and then updateto working copy afterwardsum and we get a simpler programmingmodel because we don't have to we alwayscreate commitswhatever commit we need and we don'tneed to worry about what's in theworking copy at thatso same same thing as Elijah was talkingabout I think with merge awardum and this makes it much much fasterby not touching the working copyunnecessarilyumand it also helps with the laziness bybecause you don't need tofetch objects from the back end rightwhich might be from a serverum just in order to update the workingcopyand then update it back later forexampleand as I said JD is a is git compatibleandum you can you can start using it on onthe same independently of others who usethe sameum git project soumand this this compatibility happens at afew different levels at the lowest levelthere's the the commit storageso there is as I said there's one backand you commit storage back in thestores commits in in back in gitRepositoryumand uh so so that that reads commitsfrom the get back the backing gitrepository the the object store uh andconverts it into the in-memory in memoryuh representation that JJ expectsum and there's also a functionality foruh importing gitreps uh into jda andexporting refs to getum and and for like interacting with Gitremotesand of course these commands only workin when you're when the back end is theget back endnot with the custom backendum and of course since I I worked onMercurial for many many yearsum there are many good things I want toreplicatesuch as the the simple clean ux with forexampleum its rev sets language for selectingrevisionsand uh the simpler workflows without thestaging area soI copied thosemercurial's history rewriting sport isalso pretty good with HD split and phonefor example for for splitting and uh anduh squash squashing commitsand uh HD rebase which rebases a wholethree up commits not just a linearBranch or a single Branch at leastum so I copy those things as welland then Mercurial has is verycustomizablemostly thanks to being written in Pythonso that since JJ has written in Rust wecan't just copy that but we can youcan't use monkey patchingin the same wayum but I hope the the modular design Italked about earlier at least helpsthere but it still have a long way to goifto be as customizable as as materialokay soum let's take a look at our plans forthe projects the open source project andand our Integrations at Googleum so remember this is has probably hasbeen my 20 project for most of itslifetime so there's a lot of featuresand functionality missingum like for example GC and repacking isnot implementedso we need to implement that for bothcommits and and operationsfor backings that lately fetch objectsover the network to make to make thatperform okay we need to dosome batch prefetching and cachingthere is no support forcopies or renamesso we have todecide if we want to do to track themlike Mercurial or bitkeeper does forexample or if you want to detect themlike like it doeshappy to hear suggestionsand just lots of other features aremissing on this particular one Iactually got a pull request from for afew days ago so but still like get blamefor example is not thereum we need to make it easier to replacedifferent components like for exampleadding a custom command or adding acustom backend without like using whileusing the API or not having to to patchthe source codeumand yeah language bindings again want toavoid rewrites so hope people can usethe API insteadso we want to make that as easy aspossible in different languagessecurity hardening andthe the custom back end uh thenon-get-back and uh is has very littlesupport like there is no support forpush and pull for exampleum sothat would need to be done at some pointbut the the get back is justumwhat I would recommend to everyone formany many years ahead anyway soand of coursecontributions are very welcome the URLis for the project is therewe don't want this to be just Googlerunning the productum and yes I said my this is now myfull-time project at Googleuh and and the the goal of that projectis toum to improve the internal developmentecosystemthere aretwo big main parts of this this projectone is toreplace Mercurial by JJ internallythe other part is to move thecommit storage andrepository storage out of the filesystem and into the cloudso we're hoping to create a single hugecommit graph with all the commits fromall Google Developersin in just one graphand and by having therepository is available in the cloud weyou should be able toaccess them from anywhere and and fromlike a cloudy Audi for example or reviewtool should be able to access yourrepository and modify itsoum and this diagram here shows how we'replanning to accomplish thatso most of this isthe same as on on a previous slideI've added some new systems andIntegrations with in redone of the first things we're going tohave to do is to replace the commitgraph or which is called commit inductorin the in the diagramuh because it currently assumes that allancestors are are available which ofcourse doesn't scale to the size of ourmonorailso that's uh we're going to have torewrite to make that lazyumI'm probably going to to use somethingcalled segmented change log that ourfriends at meta developed for theirmercurial-based VCSum then we're going to addnew new backends for the internalcloud-based storage of commits andoperation logs well operation logs arebasically repositoriesum and we will add a custom working copyimplementation for ourVFSso we we we're hoping to not use sparsecheckouts so instead thisour VFS is we can just tell the the BFSwhich commit to check out basically andand like a few files on topumand we'll add some we'll add a bunch ofcommands probably forintegrating with our internalintegration with our existing reviewreviews and and like starting to reuseor merging into Mainline and stuff likethatand then um we're going to add a serveron top of this that will be used by thecloud ID and review toolum yeah that was all I had so thanks forlistening if you have any questions youcan find a link toto Discord chat for example on on fromtheir poster repository page GitHub pageum and feel free to email me atmartinvancy google.comthank you[Applause]
[Music]thank youokay good afternoon great to see so manyof you here welcome to this talkmy name is Hano and I am from theNetherlands and in the Netherlands Iwork as an I.T consultant at infosupportmy Twitter handle is at a notifyI thought it was rather clever getnotified about things that hono is doingbut if you don't think the same thingthat's not a problem my wife alsodoesn't think it's very clever but I doso that's good enough for meum I tweet about stuff that I spend waytoo much time on so I tweet aboutprogramming Java I'm a musician so Ialso tweet about whenever a pop songmakes me think of software development Itweet about virtual control stuff stuffthat I encounter in the projects that Ido so if you like that feel free to giveme a follownow when it comes to Version Control thetopic of today's talk I have come a longway I mean when I was in college back inI didn't use git or subversion I didVersion Control by email who recognizedthis physical though by email yeahcode repository do a few changes zip itup send it via meal to your classmatesthis is the final version and you get areply 10 minutes later no this is thefinal version no this is the finalversionyou get it when I started my first jobin it 2007 I was really eager to learnsomething new in the world of VersionControl but at that place of work theyused Version Control by USB stickalso very modern you know you just putyour chain set on a USB stick walk overto a Central Computer merge it manuallythere over there and go to yourworkstation back again with a fresh copydisasterthank goodness it's 15 years later todayand I have managed to gain someexperience with proper vertical toolsystems like subversion and git I eventeach a git course at info support aone-day course where I teach our Juniorcolleagues about how to use git as adeveloper so lots of command line stuffand also the pros and cons ofdistributed Version Control how itcompares to earlier version systems likeCVS and subversion and this is actuallyone of the core slides and it displayswhich Version Control Systems haveemerged until now and their publicationdate and it tries to put this intoperspective by comparing the publicationdate to the most modern phone availableat the time so for example the Nokia3310 I had one of those who had one ofthoseNokia friends hiCoral correlates to subversion both thisphone and subversion are indestructibleand the battery was excellent but theversion doesn't have a battery so thatdoesn't make any sense at all whereasCVS published in 1990 obviously ancientrefers to oh this one here it even has apower block also ancientum and at the end of one particularcourse a student came up to me and shesays I really like that you told us kidsand I'm going to be so happy using itbut it seems like kind of an old systemright I mean it was published in 2005and back then it was 2019 why are westill using it 14 years later I meanisn't that kind of old for a VersionControl System what's going to be thenext big thingand to be honest I couldn't answer herquestionI wasn't sure I had no idea I thoughtwell it is probably going to be aroundforever well of course I didn't say thatout loud because nothing lasts forreferee in it world right next weekthere will be a new JavaScript frameworkprobably so you you never know so Icouldn't tell her that so and I reallydidn't like it so Idecided to investigate it for a bit andthings got kind of out of hand because Ididn't justtell her the answer I had to investigateit and the investigation turned into aconference talk and it's the one you'reattending right now so welcome to youall thank you for attending and let'stry to find the answer to this student'squestion together shall we and I thinkto answer a question we have to discoverwhat factors are influencing thepopularity of Version Control today solet's take git as an example why do wethink kid became so popularwell if we look at this graph for amoment we see one of the reasons so thisis a graph of the popularity of surgerycontrol systems until 2016 and we seethat from the moment that GitHub waslaunched right here gets started growingmore rapidly also when bitbucket hostingservice that supported Mercurialinitially but started supporting gitaround 2012. get started to rise evenmore in popularity so a very importantfactor is hosting platform support but Ithink there are more reasons for examplekiller features so when git waspublished it was meant as a replacementfor bitkeeper the only distributedsurgical tool system at the time but ithad a commercial license and so whenthis commercial license was also goingto be active for open source developmenta lot of new projects emerge Mercurialand also git so free to useand it was meant to be be fast soeveryday operations should always takeless than a second is what Lena Stovallsaid about this and it would supporteasy branching unlike CVS for example inapprentices CVS it's just a carbon copyof a directory could take a very longtime with a lot of nesting and smallfileshosting platform support for git is ofcourse superb I mean there are 18different websites that offer publicgets repositories as of 2022.and open source Community Support isalso an important reason I think git hasbeen the driving force behind Globalopen source development with contributorspread across the globeand I really like to use these strengthsof git as prediction variables factorsthat will influence our prediction forthe next big thing and I would like toadd another one I call it the handicapof the head start because it's alwaysthe hardest for the top product toremain the top product only if youinnovate regularly you will remain thetop product and not every product hasmanaged this for example InternetExplorer has just been declared kind ofdead right in the start of this yearfail to innovate fail to add tabsbrowsing for example during the browserWars and it got overtaken by Firefox andlater Google Chromeum this is the same thing that happensto subversion or maybe to your favoritefootball team I'm A Fine Arts reporterfor the dutchies uh in the in the roomso I really know what it is to get ontop and not stay there I mean we onlyget Champions like once in the 18 yearsso I really know how this feels when weput the graph data into a table this iswhat we see we see that in 2021 soperson is down to eight percent and getus up to 74 now the graph didn't lastuntil 2021 so I had to extrapolate herefor a bit but I think these numbers areare quite correctum the question is what will happen in2032 will get silby the top one well ofcourse we also have to keep into accountthat new products will emerge so let'sadd some new products to the mixin the rest of the talk I'll add twonewer version control systems to thistable and they are called fossil solet's talk about fossil firstso fossilhas the following features it is adistributed Version Control System itwas published a few years after git andit has integrated project management sobox tracking it has a Wiki it has stacknotes sometimes it's called GitHub in abox so all these all this stuff allthese features you can pull up on yourlocal development machine through thecommand fossil UI I'll show you in a bitthere's no really no not need to use anythird-party products here because youhave a fully featured developer websiteavailable when you use the VersionControl System fossilso like I said there's this thisbuilt-in web interface and there's alsoan auto sync merge mode which means likein git terms after you commit youautomatically push if there are noconflictsbut you can also use a manual immersionmode if you wantand fossil can also show descendants ofa check-in whereas in Git You can onlyshow ancestors with fossil you can alsouse descendants so relationships betweencommits actually are bi-directionaland it supports operations on multiplerepositories so if you're doingmicroservices and every application is aseparate repository with fossil you canperform operations on all repositoriesthat it knows ofand as a pursuit of all historyphilosophy which means there is norebase in fossil so you can't change thehistory would be would also be a verybad name fossil right for a verticalcontrol system that could change thehistory sothe name is well chosen I thinkIt Was Written specifically to supportthe development of SQL liveand the escrow live project also usesfossil for their developmentand these projects they theyreap the benefits of one another becausefossil just used an sqlite databasefor to to store the relationship betweenthe check-insum there are three codes housing at thechiselab.com so you can use that if youwant to try out fossil but actuallyfossil just needs an SQL live databaseso I guess any hosting space providercould provide this and do the job if youwant so kind of flexiblelike I said you could just host ityourself now already for repository infossil is like I said a single sqlidatabase file and it contains relationsbetween the check-ins so it can produceboth ancestors and descendants so let'ssee how fossil work it works in a quickdemo and I'll switch to a differentscreenfor youthere we go so we are in an emptydirectory right nowlet's enlarge it for a bitum and we're going to create a newfossil repository right here let'screate a demodirectory and let's create a newrepository now I fuss out the rep therepository file is kept outside of thedirectory so this is different than ingit because in git there's the dot gitfolder the nested folder by with fossilyou keep the repository file outside ofthe directorylet's change there into demo and then wehave to connect this directory to therepo file that's in the folder above usand now we have connected to Connectedthemso let's create a Java filewe have created it right nowand let's just add this file to ourrepository we've added it now we can seethe changes if we want there it is andwe can commit like thisjust looks like it actually commit minusM initial commit there we go we've got anew commit and if we want to see thisdeveloper website we just um we justperformed the command fossil UI andthere it is our fossil UI it shows usthese check-ins and the history of ourrepos three you can browse the files ifyou wantokay so I've called this filerandom.java so the idea is to providesome random implementation here right solet's create a class randomand let's have a main method herethat just sends the system out theresult of the random callrightprivate and randomand umwell every time I want to generate arandom number I always try to roll dielike in XKCD Comics someone recognizesitroll a die to get a random number so I'mjust rolling it to people over theremaybe you can help me out to seewhich number it lands onthe result is one ladies and gentlemenso let's return one generated by a dieguaranteed to be random rightyes thanksI'll recover that laterokay so let's return to our fossilrepulsoryumclose the UI downjust for the sake of Sanity oh I'veforgotten something hereoh sorry yes the demo effectthank you folks very muchthere we go one I can run again but itwouldn't be very interesting right soum let's add this new randomimplementation to Fossiland do a fossil commit againimplementate Implement random there wegohave we we have committed it right butthe fossil UI isn't running because Ikilled itthere we go there's an Implement randomnow if you would create a branchingfossil you would also also you couldalso based on the check and see itsdescendants so this is a short talk I'mnot going to be able to show very muchof that but that's a nice feature offossil I think in my opinionso returning to the slide decksoget first fossil actuallyfossil is really meant for small scaleteams so the fossil team and the sqliteteam they are both small teams four orfive developers and they all know eachother personally whereas with Git thisis a major Global open source projectum 5000 contributors to the nilinkskernel so the purpose of these twovertical tool systems are very differentalso the engagement is very Global withGit and personal with sqlite and withfossil and the workflow is also veryvery different so with Git the workflowLieutenant the dictator and Lieutenantworkflow was thought of by LinuxOrville's and and his co-workersum in this case because the Linda kernelis so vastso many lines of code liners only trusta few specific people for certain partsof the Linux kernel so that's why hecalls them his lieutenants whereas withfossil this is a very small team so theworkflow is built on trustso surely something to think about iswhy are most projects using gita physical tool system that was designedto support a globally developed opensource major project like the LincolnSquare gnome because our old projectsactually like that or are a lot ofprojects actually like sqlite and fossila small-scale team with a limited amountof team members not saying you shouldswitch to Fossil I'm just saying maybetry it out for a change and see if youlike itokay so the second first control systemis behold time to dive into that and thefirst thing we need to address is thename so it's a Spanish word for a birdthat's native to Mexico and it's a birdthat is known to do collaborative nestbuilding so I thought they'll kind oflike the analogy of collaborativelybuilding something using a VersionControl Systemand to try different names but this oneis very uniquely googleable and wellthey have a fair point so pickle it isthis vertical tool system was firstpublished in 2015 and it's a it's avertical tool system that is patch basedalso distributed by the way but thepatch based thing is kind of kind ofneatum it's based on the sound theory ofpatches they've got all thesemathematical proofs that their patchTheory works and that um that it doesn'tlead to many conflicts with all kinds ofexamples I'm not a mathematician but ifyou like that stuff you should reallycheck out their website there it's notby no means the first batch basedversion control system because darks isalso a versatile system based on patchesand but darks never got really quitepopular because it was kind of slowand one of the main features of beforeis to fix the slowness of darksyeah like I saidum it also has a feature calledinteractive recording which means whenyou have added files to your in inbehold this is called a batchel editfiles to your patch you can still choosewhich parts of the file you want toinclude in your chains and which partsyou want to leave outsome quick facts about the who it iswritten in Rust which is one of thefaster languages around if you seelanguage speed comparisonsum and it's been bootstraps in 2017which means the people at the hall usebijo as a version console system fortheir source code just like the peopleat IntelliJ use their own Ides todevelop their own ideasso they eat their own dog food basicallyand you can use free code hosting atnest.becool.comtheir hosting platform is called thepickle Nest like the birds that do thecollaborative nest building so it allmakes sense nowand like I saidis based on a patch oriented designso Apache is an intuitive Atomic unit ofwork and it focuses on changes insteadof differences between snapshots so thisis what happens with kit you create aversion of a file you create a nextversion and get saves the two states ofthe file but using changes with thewhole the whole stores the things thatyou added or removed from the file notthe files itselfalso means that applying or unapplying apatch doesn't change its identity withGit it does change its identity when youwant to apply commit to a differentbranch it gets a different ID and it'snot exactly the same commit internallythe the contents of the commit will besimilar but it won't be identicaland the end result of applying severalpatches and before is always the sameregardless of the order in which theywere appliedand behold does this by keeping track ofdependent patches so if I create a filein patch one modified in patch 2 when Iwant to apply patch 2 to a differentfile setbut who knows I have to apply patch 1first and then after that I'm going toapply patch 2.this also means that rebase and mergedon't really existapplying a patch is just like a flying apatch and emerge if you really wouldwant to do a merge between two andpre-hole a branches called a channel soa merge between two channels well youjust have to apply the patches that arethe differences between those twochannels but there's not a merge commandand there are certainly no rebasecommandpickle does model conflict so conflictscan happen but the resolve to resolve aconflict you just have to add a thirdpatch on top of the two patches that areconflicting and because these patchesumthe identities never change you canalways use the third patch that fixesthe conflict on in another spot or addanother channel to fix the same conflictso you don't have to fix conflicts overand over again like you sometimes havewith kidslet's make it more visual I think I'llI'll use the second slide actuallylike if commits were Bank transactionsthis is what would be the differencebetween snapshot and Patch based so ifthis is my bank account and my initialbalance is 100 get food store 100 butwho would storeapparently you've gained 100 right nowEuros or dollars doesn't matterwhen I get paid by my employerit's an okay salary it's not much butyou know 300 Euros apparently a gitwould store 400 because this is the endresult of the change but the who wouldstore plus 300 because this is theactual changeand then I have to pay my heating billand of course uh well all balance willbe gone because that's very expensivenowadays so uh git would store zero andbefore we store minus 400. so this isessentially the difference betweensnapshot and Patch based systemsokay so let's demo this for a bitsee how it will workso here I've got an empty directory withuh before installedand the first thing we need to do isinitialize the repository right so let'srun up a hole in a demo we get a newfolder just like with Git so that seemseasy enoughnow I want to I want to keep track ofthe movies I've seen this is my use caseand this week I'm watching one movienext week I'll watch another oneso I'll create a channel for next weekand if I'll ask because what channels doyou knowsorry it's called Channelsingular there's a main Channel and anext week channel so right now we're onthe main channel right so let'sumcreate a movies filewell this weekI'm a devox and in my spare time I canwatch the movies I like right becauseI'm I'm all alone right now and I wasplanning to watcha movie by Quentin TarantinoI saw this one was on Netflix it's beena long time coming I should have seen italready didn't solet's make sure it happens this weekrightum let's add let's tell people that thisfile existsand let's record itlike so like record minus mwatched a moviehere we gonow we get a an edit screen it says themessage this is the author this is myauthor key it's very it's very inclusiveI can change my name later to hennainstead of Hano and it will still knowthat this was me so I really like thefeatureum and here we I can say this line Ineed to add this line or I can skip itand if there will be multiple lines thisis the interactive recourse so I cancomment things out here if I want butI'm not going to do it right now I'mjust going to save it like thisumand umI think I'm going to watch another movieso let's umedit it out againummovies but this time it will bethe next week and I'll I won't be here Iwill be home with my kids and my kidswant to watch The Little Mermaidso let'sadd that movie to our list alsowatch another moviethere we gosyntax errors I'm not sure what I didall this is what I did sorrynow if we ask for the pickle log we seeall the patches that were here right andwhat I want to do is takethis patch so I'll copy it right now touh to my clipboardand change channelsto the next week onebecause switch channel next weekChannel switch it's Channel switchshould have known thathere we go and of coursethose patches are not present in thischannel but I can apply it right nowand what it has happened it has appliedtwo patches because it can't apply anaddition in a file that wasn't there inthe first place so it knows there's adependent patch and I need to apply thatone first and I'm not going to be ableto show you but it's it's just as easywith unrecording so I can unrecord apatch and who will know what patches aredependent of the on recording so that'sreally nice[Music]umand uh these are all the commandsactually the command I also like iscredits I really hate the git blamethat's so negative big hole just sayscredit you know because developers arecreating good stuff so that's a nicetouch by the whole team yeah I like thatlet's return to the slides to wrap thetalk upforeignstatus status can wehere we go can we use this in productionwell it has been a research project forquite some time but it has enteredversion 1.0at beginning of this year and they arecurrently in a beta phase so I'm goingto really keep track of that and see ifI can introduce it in all of my projectsright now I've only used it I think forkeeping track of flight decks working onit with people collaboratively and Ireally liked it until now and I thinkit's quite promising becauseum it is quite simple to work withpatches and you don't get thisthis very familiar feeling that oh I'vedone something wrong with Git and now myrepository has gone to hell and I don'tknow how to fix it these things happenquite not so often with with the whole alot less often actuallybut of course there are a few drawbacksand I'll try to plug them into ourprediction prediction variablesso these were the prediction variablesthat we created right just before we didthe demos of fossil and behold killerfeatures hosting platform supports opensource community support and thehandicap of a head startnow so now that we have seen a bit offossil and before let's score themaccording to these variableswell when we talk about features I um II gave behold the highest score becauseI like their fast patch-based versioningum especially if you're working with inteams with both Technical andnon-technical contributors it's it couldbe very very good for them to to to tocooperate with a version control systemthat doesn't require so much detailsabout the virtual control system in thefirst place like with Git You have towell basic usage it's fine but when youdo something wrong or something youdidn't intend you need a very bizarremagic command that you can't rememberand have to Google every every time youneed itum when so with be cool this could be alot easier for them so and I think Ilike the show descendants feature offossil and also the fact that it focusesitself on small teams I think there aremore small teams in the world than largeteamsum but I've scored them just a bit lowerso when it comes to hosting capabilitiesof course git will win this one becauselike I said there are 18 differentplaces on the internet where you canhost public get repositories and theMercurial repositories are all right Iguess there are like seven or eightum Team Foundation those has only oneAzure devopsum and fossil and before both have onesothey have the potential to grow butstill it's not their strong suit rightand we have to wait and see how popularthey getI think git and Mercurial life bothproven that they are superb inin supporting the open source communityand we have to wait and see how thisgoes with uh fossil although I can tellyou already now that because fossil canalso be hosted by yourself at your ownyour own serviceum that won't will not help them in theopen source Community whereas becausethe whole nest and you can share yourprojects there so a bit more potentialthereand the dominant projects suffer themost from the handicap of the have startso gets a Mercurial suffer the most fromthat and when we add it all together Ithink that get a mercuro will well theycome up come out okay team Foundation Idon't think we'll be using much of thatanymore especially now that Azure devopssupports kit repositoriesuh fossil was staying about the same andbefore I think will gain someso when we have to put this intopercentages of course this is this is alot of guesswork right but it's the bestI could do with the data that I haveright now I think we'll see somethinglike thissogit has git has a very high top alreadyand um compared to 2021 I think it willgrow a bit further but suppression won'tgrow it will only Decline and Mercurialwill also decline a bit and teamFoundation I think in my git course Iget a lot of people who were used toteam foundation and want to learn gitright now so I think there will alsodecline for a bit and I think alsofossil and who will grow for a bit Ithink there are good projects I thinkthey'll gain some Traction in their owncommunities but of course we have towait and see whether they will gainGlobal traction so I put I've put themat six percent and three percent ofcourse they all or gases but like I saidit's very much dependent on whetherhosting platforms want to more hostingplatforms want to support these VersionControl Systemsum I've just sorted it by popularity foryour convenience right hereif you're really into the topic I'vecompiled a list of articles that I readin preparation of this talk I also tweetthe slide so you can read them later ifyou wantumbasic question at the end of the talk Ithink is now what what should I do withthis information wellI one of the things that I learned bydoing research or that actually two ofthe things I learned is a lot ofprojects are nothing like the Linuxkernel I think most of the projects Iworked on during my working hours werenothing like the Linens kernel so whydon't you try false over chains and seeif you like itand secondly get snapshoting and thesituations that you can find yourself inafter doing an operation that you didn'tintend might be too technical for theaverage user so why don't you try behoofor a change maybe for a side project wework on it with a few people see if youlike it and you can like I said you canuse it because it is production readyand stableum it's a short talk so not really muchtime for uh public questions but if youhave some you can come see me after thetalk I would like to thank you for yourattention and have a great conferenceday[Applause][Music]
# Document TitleIs it time to look past Git?#git#scm#devopsI clearly remember the rough days of CVS and how liberating the switch to Subversion felt all those years ago. At the time I was pretty sure that nothing would ever get as good as Subversion. As it turns out, I was wrong, and Git showed it to me! So, having learned distributed version control concepts and embraced Git, I was pretty zealous about my newfound super powers. Again, I felt sure that nothing would ever surpass it. Again, it turned out I was wrong.At the time of this writing, Git's been with us for over a decade and a half. During that time the ecosystem has absolutely exploded. From the rise of GitHub and GitLab to the myriad of new subcommands (just look at all this cool stuff), clearly Git has seen widespread adoption and success.So what's wrong with Git?Well for starters, I'm not the only person who thinks that Git is simply too hard. While some apparently yearn for the simpler Subversion days, my frustration comes from working with Git itself, some of which others have been pointing out for the last decade.My frustrations are:I find the git-* family of utilities entirely too low-level for comfort. I once read a quote that Git is a version control construction kit more than a tool to be used directly. I think about this quote a lot.Git's underlying data model is unintuitive, and I love me a good graph data structure. Conceiving of a project's history as a non-causal treeish arrangement has been really hard for me to explain to folks and often hard for me to think about. Don't get me started on git rebase and friends.I can't trust Git's 3-way merge to preserve the exact code I reviewed. More here and here.Two interesting projects I've discovered recently aim to add a Git-compatible layer on top of Git's on-disk format in order to simplify things. If not to simplify the conceptual model, then at least to provide an alternative UX.One such project is the Gitless initiative which has a Python wrapper around Git proper providing far-simpler workflows based on some solid research. Unfortunately it doesn't look like Gitless' Python codebase has had active development recently, which doesn't inspire much confidence.An alternative project to both Git and Gitless is a little Rust project called Jujutsu. According to the project page Jujutsu aims, at least in part, to preserve compatibility with Git's on-disk format. Jujutsu also boasts its own "native" format letting you choose between using the jj binary as a front-end for Git data or as a net-new tool.What have we learned since Git?Well technically one alternative I am going to bring up predates Git by several years, and that's DARCS. Fans of DARCS have written plenty of material on Git's perceived weaknesses. While DARCS' Haskell codebase apparently had some issues, its underlying "change" semantics have remained influential. For example, Pijul is a Rust-based contender currently in beta. It embraces a huge number of the paradigms, including the concept of commutative changes, which made DARCS effective and ergonomic. In my estimation, the very concept of change commutation greatly improves any graph-based models of repositories. This led me to have a really wonderful user experience when taking Pijul out for a spin on some smallish-local projects.While Pijul hasn't yet reached production-quality maturity, the Fossil project has a pretty good track record. Fossil styles itself as the batteries-included SCM which has long been used as the primary VCS for development of the venerable SQLite project. In trying out Fossil, I would rate its usability and simplicity as being roughly on par with Pijul, plus you get all the extras like a built-in web interface complete with forums, wiki, and an issue tracker.Ultimately I prefer the Pijul model more than Fossil's but still can't realistically see either of them replacing Git.Why I'm not ditching Git (yet)While my frustrations have lead me to look past Git, I've come to the conclusion that I can't ditch it quite yet. Git's dominance is largely cemented and preserved by its expansive ecosystem. It's so much more than the Git integration which ships standard in VSCode and even more than the ubiquity of platforms like GitHub and GitLab. It comes down to hearts and minds insofar as folks have spent so much time "getting good at Git" that they lack the motivation to switch right now. Coupled with the pervasiveness of Git in software development shops, it's hard to envision any alternatives achieving the requisite critical mass to become a proper success.Granted, just as Git had a built-in feature to ease migration from Subversion, Fossil has a similar feature to help interface with Git. That certainly helps, but I honestly don't know if it'll be enough.👋 Before you goDo your career a big favor. Join DEV. (The website you're on right now)It takes one minute, it's free, and is worth it for your career.Get startedCommunity mattersTop comments (2)picjessekphillips profile image•Jul 10 '22I wish tools would emphasize a better flow. You point to rebase as an issue, but it makes for much more clarity of changes.I have not looked at the tools you noted, but I tend to find that these things go back to basics rather then empower.People should utilize git as a way to craft a change, not just saving some arbitrary state of the source code.Replysasham1 profile image•Jan 24Hi Jonathan, you might want to check out Diversion version control - we just launched on HN.Would love to hear your thoughts!
# Pijul for Git users## IntroductionPijul is an in-development distributed version control system thatimplements repositories using a different [model][model] thanGit's. This model enables [cool features][why-pijul] and avoids[common problems][bad-merge] which we are going to explore in thistutorial. (The good stuff appears from the "Maybe we don't needbranches" section.) It is assumed that the reader uses Git in adaily base, i.e. knows not only how to commit, push and pull butalso has had the need to cherry-pick commits and rebase branchesat least once.## Creating a repo```$ mkdir pijul-tutorial$ cd pijul-tutorial$ pijul init```Nothing new here.## Adding files to the repositoryJust like in Git after we create a file it must be explicitlyadded to the repository:```$ pijul add <files>```There's a difference though: in Pijul a file is just a UNIX file,i.e. directories are also files so we don't need to create `.keep`files to add *empty* directories to our repos. Try this:```$ mkdir a-dir$ touch a-dir/a-file$ pijul add a-dir/$ pijul status```The output will be:```On branch masterChanges not yet recorded:(use "pijul record ..." to record a new patch)new file: a-dirUntracked files:(use "pijul add <file>..." to track them)a-dir/a-file```To add files recursively we must use the `--recursive` flag.## Signing keysPijul can sign patches automatically, so let's create a signing keybefore we record our first patch:```$ pijul key gen --signing```The key pair will be located in `~/.local/share/pijul/config`. Atthe moment the private key is created without a password so treatit with care.## Recording patchesFrom the user perspective this is the equivalent to Git's commitoperation but it is interactive by default:```$ pijul recordadded file a-dirShall I record this change? (1/2) [ynkadi?] yadded file a-dir/a-fileShall I record this change? (2/2) [ynkadi?] yWhat is your name <and email address>? Someone's nameWhat is the name of this patch? Add a dir and a fileRecorded patch 6fHCAzzT5UYCsSJi7cpNEqvZypMw1maoLgscWgi7m5JFsDjKcDNk7A84Cj93ZrKcmqHyPxXZebmvFarDA5tuX1jL```Here `y` means yes, `n` means no, `k` means undo and remake lastdecision, `a` means include this and all remaining patches, `d`means include neither this patch nor the remaining patches and `i`means ignore this file locally (i.e. it is added to`.pijul/local/ignore`).Let's change `a-file`:```$ echo Hello > a-dir/a-file$ pijul recordIn file "a-dir/a-file"+ HelloShall I record this change? (1/1) [ynkad?] yWhat is the name of this patch? Add a greetingRecorded patch 9NrFXxyNATX5qgdq4tywLU1ZqTLMbjMCjrzS3obcV2kSdGKEHzC8j4i8VPBpCq8Qjs7WmCYt8eCTN6s1VSqjrBB4```## Ignoring filesWe saw that when recording a patch we can chose to locally ignorea file, but we can also create a `.pijulignore` or `.ignore` filein the root of our repository and record it. All those filesaccept the same patterns as a `.gitignore` file.Just like in Git if we want to ignore a file that was recorded ina previous patch we must remove that file from the repository.## Removing files from the repository```$ pijul remove <files>```The files will be shown as untracked again whether they wererecorded with a previous patch or not, so this has the effect of`git reset <files>` or `git rm --cached` depending on the previousstate of these files.## Removing a patch```$ pijul unrecord```This command is interactive. Alternatively, one can use `pijulunrecord <patch>` to remove one or more patches, knowing their hash.Patch hashes can be obtained with `pijul log`.Unrecording and recording the same patch again will leave therepository in the same state.There are cases where a patch depends on a previous one.For example if a patch edits (and only edits) file A it willdepend on the patch that created that file. We can see thesedependencies with `pijul dependencies` and they are managedautomatically. This is why `pijul unrecord <patch>` mightsometimes refuse to work.## Discarding changes```$ pijul revert```This is like `git checkout` applied to files (instead ofbranches).## BranchesTo create a new branch we use the `pijul fork <branch-name>`command and to switch to another branch we use `pijulcheckout <branch-name>`.To apply a patch from another branch we use the `pijul apply<patch-hash>` command. Notice that this doesn't produce adifferent patch with a different hash like `git cherry-pick` does.Finally to delete a branch we have the `delete-branch` subcommand,but:## Maybe we don't need branchesBecause in Git each commit is related to a parent (except for thefirst one), branches are useful to avoid mixing up unrelated work.We don't want our history to look like this:```* More work for feature 3|* More work for feature 1|* Work for feature 3|* Work for feature 2|* Work for feature 1```And if we need to push a fix for a bug ASAP we don't want to alsopush commits that still are a work in progress so we createbranches for every new feature and work in them in isolation.But in Pijul patches usually commute: in the same way that 3 + 4 +8 produces exactly the same result than 4 + 3 + 8, if we applypatch B to our repo before we apply patch A and then C the resultwill be exactly the same that our coworkers will get if they applypatch A before patch C and then patch B. Now if patch C has adependency called D (as we saw in Removing a patch) they cannotcommute, but the entire graph commutes with other patches, i.e ifI apply patch A before patch B and then patches CD I would get thesame repository state than if I applied patch B before patches CDand then patch A. So Alice could have the same history as in theprevious example while Bob could have```* More work for feature 1|* Work for feature 2|* More work for feature 3|* Work for feature 3|* Work for feature 1```And the repos would be equivalents; that is, the files would be thesame. Why is that useful?We can start working on a new feature without realizingthat it is actually a new feature and that we need a new branch.We can create all the patches we need for that feature (e.g. thepatches that implement it, the patches that fix the bugsintroduced by it, and the patches that fix typos) in whateverorder we want. Then we can unrecord these patches and record themagain as just one patch without a rebase. (There's actually norebase operation in Pijul.)But this model really shines when we start to work with:## RemotesAt the moment, pushing works over SSH, the server only needs to havePijul installed. [The Nest][nest] is a free service that hosts publicrepositories. We can reuse our current SSH key pair or create a newpair with```$ pijul key gen --ssh```This new key pair will be stored in the same directory used forthe signing keys and we can add it to [The Nest][nest] like we dowith SSH keys in GitHub.Now that we have an account on [The Nest][nest] we can upload oursigning key with `pijul key upload`.Now let's push something:```$ pijul push <our-nest-user-name>@nest.pijul.com:<our-repo>```Unless we pass the `--all` flag Pijul will ask us which patches wewant to push. So we can keep a patch locally, unrecord it, recordit again, decide that actually we don't need it and kill itforever or push it a year later when we finally decide that theworld needs it. All without branches.If we don't want to specify the remote every time we push we canset it as default with the `--set-default` flag.Of course to pull changes we have the `pijul pull` command.Both commands have `--from-branch` (source branch),`--to-branch` (destination branch) and `--set-remote` (create alocal name for the remote) options.BTW if we can keep patches for ourselves can we pull only thepatches we want? Yes, that's called "partial clone". [It wasintroduced in version 0.11][partial-clone] and works like this:```$ pijul pull --path <patch-hash> <remote>```Of course it will bring the patch and all its dependencies.As we have seen we neither need branches, cherry-picking norrebasing because of the patch theory behind Pijul.## Contributing with a remoteWith Pijul we don't need forking either. The steps to contributeto a repo are:1. Clone it with `pijul clone <repo-url>`2. Make some patches!3. Go to the page of the repo in [The Nest][nest] and open anew discussion4. [The Nest][nest] will create a branch with the number of thediscussion as a name5. Push the patches with `pijul push<our-user-name>@nest.pijul.com:<repo-owner-user-name>/<repo-name>--to-branch :<discussion-number>`Then the repo owner could apply our patches to the master branch.You can also attach patches from your repos to a discussion whenyou create or participate in one.## TagsA tag in Pijul is a patch that specifies that all the previouspatches depend on each other to recreate the current state of therepo.To create a tag we have the `pijul tag` command which will ask fora tag name.After new patches are added to the repo we can recreate the stateof any tag by creating a new branch:```pijul fork --patch <hash-of-the-tag> <name-of-the-new-branch>```Because tags are just patches we can look for their hashes with`pijul log`.## In summaryForget about bad merges, feature branches, rebasing and conflictsproduced by merges after cherry-picking.## Learning more[Pijul has an on-line manual][manual] but currently it is alittle bit outdated. The best way to learn more is byexecuting `pijul help`. This will list all the subcommands and wecan read more about any of them by running `pijul help<subcommand>`.The subcommands are interactive by default but we can pass data tothem directly from the command line to avoid being askedquestions. All these options are explained in each subcommand's help.For more information on the theory behind Pijul refer to JoeNeeman's [blog post on the matter][theory]. He also wrote a post[that explains how Pijul implements it][implementation].## A work in progressAs we said Pijul is an in-development tool: [the UI couldchange in the future][ui-changes] and there are some missingfeatures. ([Something][bisect] like `bisect` would be superhelpful.) But that's also an opportunity: the developers seemquite open to [receive feedback][discourse].[bad-merge]: https://tahoe-lafs.org/~zooko/badmerge/simple.html[bisect]: https://discourse.pijul.org/t/equivalent-of-checking-out-an-old-commit/176[discourse]: https://discourse.pijul.org[implementation]: https://jneem.github.io/pijul/[nest]: https://nest.pijul.com[manual]: https://pijul.org/manual/[model]: https://pijul.org/model/[partial-clone]: https://pijul.org/posts/2018-11-20-pijul-0.11/[theory]: https://jneem.github.io/merging/[ui-changes]: https://discourse.pijul.org/t/equivalent-of-checking-out-an-old-commit/176[why-pijul]: https://pijul.org/manual/why_pijul.html
# Document TitleDesigning Data Structures for Collaborative AppsMatthew Weidner | Feb 10th, 2022Home | RSS FeedKeywords: collaborative apps, CRDTs, compositionI’ve put many of the ideas from this post into practice in a library, Collabs. You can learn more about Collabs, and see how open-source collaborative apps might work in practice, in my Local-First Web talk: Video, Slides, Live demo.Suppose you’re building a collaborative app, along the lines of Google Docs/Sheets/Slides, Figma, Notion, etc. One challenge you’ll face is the actual collaboration: when one user changes the shared state, their changes need to show up for every other user. For example, if multiple users type at the same time in a text field, the result should reflect all of their changes and be consistent (identical for all users).Conflict-free Replicated Data Types (CRDTs) provide a solution to this challenge. They are data structures that look like ordinary data structures (maps, sets, text strings, etc.), except that they are collaborative: when one user updates their copy of a CRDT, their changes automatically show up for everyone else. Each user sees their own changes immediately, while under the hood, the CRDT broadcasts a message describing the change to everyone else. Other users see the change once they receive this message.CRDTs broadcast messages to relay changesNote that multiple users might make changes at the same time, e.g., both typing at once. Since each user sees their own changes immediately, their views of the document will temporarily diverge. However, CRDTs guarantee that once the users receive each others’ messages, they’ll see identical document states again: this is the definition of CRDT correctness. Ideally, this state will also be “reasonable”, i.e., it will incorporate both of their edits in the way that the users expect.In distributed systems terms, CRDTs are Available, Partition tolerant, and have Strong Eventual Consistency.CRDTs work even if messages might be arbitrarily delayed, or delivered to different users in different orders. This lets you make collaborative experiences that don’t need a central server, work offline, and/or are end-to-end encrypted (local-first software).Google Docs doesn't let you type while offlineCRDTs allow offline editing, unlike Google Docs.I’m particularly excited by the potential for open-source collaborative apps that anyone can distribute or modify, without requiring app-specific hosting.# The Challenge: Designing CRDTsHaving read all that, let’s say you choose to use a CRDT for your collaborative app. All you need is a CRDT representing your app’s state, a frontend UI, and a network of your choice (or a way for users to pick the network themselves). But where do you get a CRDT for your specific app?If you’re lucky, it’s described in a paper, or even better, implemented in a library. But those tend to have simple or one-size-fits-all data structures: maps, text strings, JSON, etc. You can usually rearrange your app’s state to make it fit in these CRDTs; and if users make changes at the same time, CRDT correctness guarantees that you’ll get some consistent result. However, it might not be what you or your users expect. Worse, you have little leeway to customize this behavior.Anomaly in many map CRDTs: In a collaborative todo-list, concurrently deleting an item and marking it done results in a nonsense list item with no text field.In many map CRDTs, when representing a todo-list using items with "title" and "done" fields, you can end up with an item {"done": true} having no "title" field. Image credit: Figure 6 by Kleppmann and Beresford.Simulated user complaint: In a collaborative spreadsheet, concurrently moving a column and entering data might delete the entered data. If you don't control the CRDT implementation, then you can't fix this.If a user asks you to change some behavior that comes from a CRDT library, but you don't understand the library inside and out, then it will be a hard fix.This blog post will instead teach you how to design CRDTs from the ground up. I’ll present a few simple CRDTs that are obviously correct, plus ways to compose them together into complicated whole-app CRDTs that are still obviously correct. I’ll also present principles of CRDT design to help guide you through the process. To cap it off, we’ll design a CRDT for a collaborative spreadsheet.Ultimately, I hope that you will gain not just an understanding of some existing CRDT designs, but also the confidence to tweak them and create your own!# Basic DesignsI’ll start by going over some basic CRDT designs.# Unique Set CRDTOur foundational CRDT is the Unique Set. It is a set in which each added element is considered unique.Formally, the user-facing operations on the set, and their collaborative implementations, are as follows:add(x): Adds an element e = (t, x) to the set, where t is a unique new tag, used to ensure that (t, x) is unique. To implement this, the adding user generates t, e.g., as a pair (device id, device-specific counter), then serializes (t, x) and broadcasts it to the other users. The receivers deserialize (t, x) and add it to their local copy of the set.delete(t): Deletes the element e = (t, x) from the set. To implement this, the deleting user serializes t and broadcasts it to the other users. The receivers deserialize t and remove the element with tag t from their local copy, if it has not been deleted already.In response to user input, the operator calls "Output message". The message is then delivered to every user's "Receive & Update display" function.The lifecycle of an add or delete operation.When displaying the set to the user, you ignore the tags and just list out the data values x, keeping in mind that (1) they are not ordered (at least not consistently across different users), and (2) there may be duplicates.Example: In a collaborative flash card app, you could represent the deck of cards as a Unique Set, using x to hold the flash card’s value (e.g., its front and back strings). Users can edit the deck by adding a new card or deleting an existing one, and duplicate cards are allowed.When broadcasting messages, we require that they are delivered reliably and in causal order, but it’s okay if they are arbitarily delayed. (These rules apply to all CRDTs, not just the Unique Set.) Delivery in causal order means that if a user sends a message m after receiving or sending a message m’, then all users delay receiving m until after receiving m’. This is the strictest ordering we can implement without a central server and without extra round-trips between users, e.g., by using vector clocks.Messages that aren’t ordered by the causal order are concurrent, and different users might receive them in different orders. But for CRDT correctness, we must ensure that all users end up in the same state regardless, once they have received the same messages.For the Unique Set, it is obvious that the state of the set, as seen by a specific user, is always the set of elements for which they have received an add message but no delete messages. This holds regardless of the order in which they received concurrent messages. Thus the Unique Set is correct.Note that delivery in causal order is important—a delete operation only works if it is received after its corresponding add operation.We now have our first principle of CRDT design:Principle 1. Use the Unique Set CRDT for operations that “add” or “create” a unique new thing.Although it is simple, the Unique Set forms the basis for the rest of our CRDTs.Aside. Traditionally, one proves CRDT correctness by proving that concurrent messages commute—they have the same effect regardless of delivery order (Shapiro et al. 2011)—or that the final state is a function of the causally-ordered message history (Baquero, Almeida, and Shoker 2014). However, as long as you stick to the techniques in this blog post, you won’t need explicit proofs: everything builds on the Unique Set in ways that trivially preserve CRDT correctness. For example, a deterministic view of a Unique Set (or any CRDT) is obviously still a CRDT.Aside. I have described the Unique Set in terms of operations and broadcast messages, i.e., as an operation-based CRDT. However, with some extra metadata, it is also possible to implement a merge function for the Unique Set, in the style of a state-based CRDT. Or, you can perform VCS-style 3-way merges without needing extra metadata.# List CRDTOur next CRDT is a List CRDT. It represents a list of elements, with insert and delete operations. For example, you can use a List CRDT of characters to store the text in a collaborative text editor, using insert to type a new character and delete for backspace.Formally, the operations on a List CRDT are:insert(i, x): Inserts a new element with value x at index i, between the existing elements at indices i and i+1. All later elements (index >= i+1) are shifted one to the right.delete(i): Deletes the element at index i. All later elements (index >= i+1) are shifted one to the left.We now need to decide on the semantics, i.e., what is the result of various insert and delete operations, possibly concurrent. The fact that insertions are unique suggests using a Unique Set (Principle 1). However, we also have to account for indices and the list order.One approach would use indices directly: when a user calls insert(i, x), they send (i, x) to the other users, who use i to insert x at the appropriate location. The challenge is that your intended insertion index might move around as a result of users’ inserting/deleting in front of i.The *gray* cat jumped on **the** table.Alice typed " the" at index 17, but concurrently, Bob typed " gray" in front of her. From Bob's perspective, Alice's insert should happen at index 22.It’s possible to work around this by “transforming” i to account for concurrent edits. That idea leads to Operational Transformation (OT), the earliest-invented approach to collaborative text editing, and the one used in Google Docs and most existing apps. Unfortunately, OT algorithms are quite complicated, leading to numerous flawed algorithms. You can reduce complexity by using a central server to manage the document, like Google Docs does, but that precludes decentralized networks, end-to-end encryption, and server-optional open-source apps.List CRDTs use a different perspective from OT. When you type a character in a text document, you probably don’t think of its position as “index 17” or whatever; instead, its position is at a certain place within the existing text.“A certain place within the existing text” is vague, but at a minimum, it should be between the characters left and right of your insertion point (“on” and “ table” in the example above) Also, unlike an index, this intuitive position doesn’t change if other users concurrently type earlier in the document; your new text should go between the same characters as before. That is, the position is immutable.This leads to the following implementation. The list’s state is a Unique Set whose values are pairs (p, x), where x is the actual value (e.g., a character), and p is a unique immutable position drawn from some abstract total order. The user-visible state of the list is the list of values x ordered by their positions p. Operations are implemented as:insert(i, x): The inserting user looks up the positions pL, pR of the values to the left and right (indices i and i+1), generates a unique new position p such that pL < p < pR, and calls add((p, x)) on the Unique Set.delete(i): The deleting user finds the Unique Set tag t of the value at index i, then calls delete(t) on the Unique Set.Of course, we need a way to create the positions p. That’s the hard part—in fact, the hardest part of any CRDT—and I don’t have space to go into it here; you should use an existing algorithm (e.g., RGA) or implementation (e.g., Yjs’s Y.Array). Update: I’ve since published two libraries for creating and using CRDT-style positions in TypeScript: list-positions (most efficient), position-strings (most flexible). Both use the Fugue algorithm.The important lesson here is that we had to translate indices (the language of normal, non-CRDT lists) into unique immutable positions (what the user intuitively means when they say “insert here”). That leads to our second principle of CRDT design:Principle 2. Express operations in terms of user intention—what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.This principle works because users often have some idea what one operation should do in the face of concurrent operations. If you can capture that intuition, then the resulting operations won’t conflict.# RegistersOur last basic CRDT is the Register. This is a variable that holds an arbitrary value that can be set and get. If multiple users set the value at the same time, you pick one of them arbitrarily, or perhaps average them together.Example uses for Registers:The font size of a character in a collaborative rich-text editor.The name of a document.The color of a specific pixel in a collaborative whiteboard.Basically, anything where you’re fine with users overwriting each others’ concurrent changes and you don’t want to use a more complicated CRDT.Registers are very useful and suffice for many tasks (e.g., Figma and Hex use them almost exclusively).The only operation on a Register is set(x), which sets the value to x (in the absence of concurrent operations). We can’t perform these operations literally, since if two users receive concurrent set operations in different orders, they’ll end up with different values.However, we can add the value x to a Unique Set, following Principle 1. The state is now a set of values instead of a single value, but we’ll address that soon. We can also delete old values each time set(x) is called, overwriting them.Thus the implementation of set(x) becomes:For each element e in the Unique Set, call delete(e) on the Unique Set; then call add(x).The result is that at any time, the Register’s state is the set of all the most recent concurrently-set values.Loops of the form “for each element of a collection, do something” are common in programming. We just saw a way to extend them to CRDTs: “for each element of a Unique Set, do some CRDT operation”. I call this a causal for-each operation because it only affects elements that are prior to the for-each operation in the causal order. It’s useful enough that we make it our next principle of CRDT design:Principle 3a. For operations that do something “for each” element of a collection, one option is to use a causal for-each operation on a Unique Set (or List CRDT).(Later we will expand on this with Principle 3b, which also concerns for-each operations.)Returning to Registers, we still need to handle the fact that our state is a set of values, instead of a specific value.One option is to accept this as the state, and present all conflicting values to the user. That gives the Multi-Value Register (MVR).Another option is to pick a value arbitrarily but deterministically. E.g., the Last-Writer Wins (LWW) Register tags each value with the wall-clock time when it is set, then picks the value with the latest timestamp.Grid of pixels, some conflicting (outlined in red). One conflicting pixel has been clicked on, revealing the conflicting choices.In Pixelpusher, a collaborative pixel art editor, each pixel shows one color by default (LWW Register), but you can click to pop out all conflicting colors (MVR). Image credit: Peter van Hardenberg (original).In general, you can define the value getter to be an arbitrary deterministic function of the set of values.Examples:If the values are colors, you can average their RGB coordinates. That seems like fine behavior for pixels in a collaborative whiteboard.If the values are booleans, you can choose to prefer true values, i.e., the Register’s value is true if its set contains any true values. That gives the Enable-Wins Flag.# Composing CRDTsWe now have enough basic CRDTs to start making more complicated data structures through composition. I’ll describe three techniques: CRDT Objects, CRDT-Valued Maps, and collections of CRDTs.# CRDT ObjectsThe simplest composition technique is to use multiple CRDTs side-by-side. By making them instance fields in a class, you obtain a CRDT Object, which is itself a CRDT (trivially correct). The power of CRDT Objects comes from using standard OOP techniques, e.g., implementation hiding.Examples:In a collaborative flash card app, to make individual cards editable, you could represent each card as a CRDT Object with two text CRDT (List CRDT of characters) instance fields, one for the front and one for the back.You can represent the position and size of an image in a collaborative slide editor by using separate Registers for the left, top, width, and height.To implement a CRDT Object, each time an instance field requests to broadcast a message, the CRDT Object broadcasts that message tagged with the field’s name. Receivers then deliver the message to their own instance field with the same name.# CRDT-Valued MapA CRDT-Valued Map is like a CRDT Object but with potentially infinite instance fields, one for each allowed map key. Every key/value pair is implicitly always present in the map, but values are only explicitly constructed in memory as needed, using a predefined factory method (like Apache Commons’ LazyMap).Examples:Consider a shared notes app in which users can archive notes, then restore them later. To indicate which notes are normal (not archived), we want to store them in a set. A Unique Set won’t work, since the same note can be added (restored) multiple times. Instead, you can use a CRDT-Valued Map whose keys are the documents and whose values are Enable-Wins Flags; the value of the flag for key doc indicates whether doc is in the set. This gives the Add-Wins Set.Quill lets you easily display and edit rich text in a browser app. In a Quill document, each character has an attributes map, which contains arbitrary key-value pairs describing formatting (e.g., "bold": true). You can model this using a CRDT-Valued Map with arbitrary keys and LWW Register values; the value of the Register for key attr indicates the current value for attr.A CRDT-Valued Map is implemented like a CRDT Object: each message broadcast by a value CRDT is tagged with its serialized key. Internally, the map stores only the explicitly-constructed key-value pairs; each value is constructed using the factory method the first time it is accessed by the local user or receives a message. However, this is not visible externally—from the outside, the other values still appear present, just in their initial states. (If you want an explicit set of “present” keys, you can track them using an Add-Wins Set.)# Collections of CRDTsOur above definition of a Unique Set implicitly assumed that the data values x were immutable and serializable (capable of being sent over the network). However, we can also make a Unique Set of CRDTs, whose values are dynamically-created CRDTs.To add a new value CRDT, a user sends a unique new tag and any arguments needed to construct the value. Each recipient passes those arguments to a predefined factory method, then stores the returned CRDT in their copy of the set. When a value CRDT is deleted, it is forgotten and can no longer be used.Note that unlike in a CRDT-Valued Map, values are explicitly created (with dynamic constructor arguments) and deleted—the set effectively provides collaborative new and free operations.We can likewise make a List of CRDTs.Examples:In a shared folder containing multiple collaborative documents, you can define your document CRDT, then use a Unique Set of document CRDTs to model the whole folder. (You can also use a CRDT-Valued Map from names to documents, but then documents can’t be renamed, and documents “created” concurrently with the same name will end up merged.)Continuing the Quill rich-text example from the previous section, you can model a rich-text document as a List of “rich character CRDTs”, where each “rich character CRDT” consists of an immutable (non-CRDT) character plus the attributes map CRDT. This is sufficient to build a simple but inefficient Google Docs-style app with CRDTs.# Using CompositionYou can use the above composition techniques and basic CRDTs to design CRDTs for many collaborative apps. Choosing the exact structure, and how operations and user-visible state map onto that structure, is the main challenge.A good starting point is to design an ordinary (non-CRDT) data model, using ordinary objects, collections, etc., then convert it to a CRDT version. So variables become Registers, objects become CRDT Objects, lists become List CRDTs, sets become Unique Sets or Add-Wins Sets, etc. You can then tweak the design as needed to accommodate extra operations or fix weird concurrent behaviors.To accommodate as many operations as possible while preserving user intention, I recommend:Principle 4. Independent operations (in the user’s mind) should act on independent state.Examples:As mentioned earlier, you can represent the position and size of an image in a collaborative slide editor by using separate Registers for the left, top, width, and height. If you wanted, you could instead use a single Register whose value is a tuple (left, top, width, height), but this would violate Principle 4. Indeed, then if one user moved the image while another resized it, one of their changes would overwrite the other, instead of both moving and resizing.Again in a collaborative slide editor, you might initially model the slide list as a List of slide CRDTs. However, this provides no way for users to move slides around in the list, e.g., swap the order of two slides. You could implement a move operation using cut-and-paste, but then slide edits concurrent to a move will be lost, even though they are intuitively independent operations.Following Principle 4, you should instead implement move operations by modifying some state independent of the slide itself. You can do this by replacing the List of slide CRDTs with a Unique Set of CRDT Objects { slide, positionReg }, where positionReg is an LWW Register indicating the position. To move a slide, you create a unique new position like in a List CRDT, then set the value of positionReg equal to that position. This construction gives the List-with-Move CRDT.# New: Concurrent+Causal For-Each OperationsThere’s one more trick I want to show you. Sometimes, when performing a for-each operation on a Unique Set or List CRDT (Principle 3a), you don’t just want to affect existing (causally prior) elements. You also want to affect elements that are added/inserted concurrently.For example:In a rich text editor, if one user bolds a range of text, while concurrently, another user types in the middle of the range, the latter text should also be bolded.One user bolds a range of text, while concurrently, another user types " the" in the middle. In the final result, " the" is also bolded.In other words, the first user’s intended operation is “for each character in the range including ones inserted concurrently, bold it”.In a collaborative recipe editor, if one user clicks a “double the recipe” button, while concurrently, another user edits an amount, then their edit should also be doubled. Otherwise, the recipe will be out of proportion, and the meal will be ruined!I call such an operation a concurrent+causal for-each operation. To accomodate the above examples, I propose the following addendum to Principle 3a:Principle 3b. For operations that do something “for each” element of a collection, another option is to use a concurrent+causal for-each operation on a Unique Set (or List CRDT).To implement this, the initiating user first does a causal (normal) for-each operation. They then send a message describing how to perform the operation on concurrently added elements. The receivers apply the operation to any concurrently added elements they’ve received already (and haven’t yet deleted), then store the message in a log. Later, each time they receive a new element, they check if it’s concurrent to the stored message; if so, they apply the operation.Aside. It would be more general to split Principle 3 into “causal for-each” and “concurrent for-each” operations, and indeed, this is how the previous paragraph describes it. However, I haven’t yet found a good use case for a concurrent for-each operation that isn’t part of a concurrent+causal for-each.Concurrent+causal for-each operations are novel as far as I’m aware. They are based on a paper I, Heather Miller, and Christopher Meiklejohn wrote last year, about a composition technique we call the semidirect product, which can implement them (albeit in a confusing way). Unfortunately, the paper doesn’t make clear what the semidirect product is doing intuitively, since we didn’t understand this ourselves! My current opinion is that concurrent+causal for-each operations are what it’s really trying to do; the semidirect product itself is (a special case of) an optimized implementation, improving memory usage at the cost of simplicity.# Summary: CRDT Design TechniquesThat’s it for our CRDT design techniques. Before continuing to the spreadsheet case study, here is a summary cheat sheet.Start with our basic CRDTs: Unique Set, List CRDT, and Registers.Compose these into steadily more complex pieces using CRDT Objects, CRDT-Valued Maps, and Collections of CRDTs.When choosing basic CRDTs or how to compose things, keep in mind these principles:Principle 1. Use the Unique Set CRDT for operations that “add” or “create” a unique new thing.Principle 2. Express operations in terms of user intention—what the operation means to the user, intuitively. This might differ from the closest ordinary data type operation.Principle 3(a, b). For operations that do something “for each” element of a collection, use a causal for-each operation or a concurrent+causal for-each operation on a Unique Set (or List CRDT).Principle 4. Independent operations (in the user’s mind) should act on independent state.# Case Study: A Collaborative SpreadsheetNow let’s get practical: we’re going to design a CRDT for a collaborative spreadsheet editor (think Google Sheets).As practice, try sketching a design yourself before reading any further. The rest of this section describes how I would do it, but don’t worry if you come up with something different—there’s no one right answer! The point of this blog post is to give you the confidence to design and tweak CRDTs like this yourself, not to dictate “the one true spreadsheet CRDT™”.# Design WalkthroughTo start off, consider an individual cell. Fundamentally, it consists of a text string. We could make this a Text (List) CRDT, but usually, you don’t edit individual cells collaboratively; instead, you type the new value of the cell, hit enter, and then its value shows up for everyone else. This suggests instead using a Register, e.g., an LWW Register.Besides the text content, a cell can have properties like its font size, whether word wrap is enabled, etc. Since changing these properties are all independent operations, following Principle 4, they should have independent state. This suggests using a CRDT Object to represent the cell, with a different CRDT instance field for each property. In pseudocode (using extends CRDTObject to indicate CRDT Objects):class Cell extends CRDTObject {content: LWWRegister<string>;fontSize: LWWRegister<number>;wordWrap: EnableWinsFlag;// ...}The spreadsheet itself is a grid of cells. Each cell is indexed by its location (row, column), suggesting a map from locations to cells. (A 2D list could work too, but then we’d have to put rows and columns on an unequal footing, which might cause trouble later.) Thus let’s use a Cell-CRDT-Valued Map.What about the map keys? It’s tempting to use conventional row-column indicators like “A1”, “B3”, etc. However, then we can’t easily insert or delete rows/columns, since doing so renames other cells’ indicators. (We could try making a “rename” operation, but that violates Principle 2, since it does not match the user’s original intention: inserting/deleting a different row/column.)Instead, let’s identify cell locations using pairs (row, column), where “row” means “the line of cells horizontally adjacent to this cell”, independent of that row’s literal location (1, 2, etc.), and likewise for “column”. That is, we create an opaque Row object to represent each row, and likewise for columns, then use pairs (Row, Column) for our map keys.The word “create” suggests using Unique Sets (Principle 1), although since the rows and columns are ordered, we actually want List CRDTs. Hence our app state looks like:rows: ListCRDT<Row>;columns: ListCRDT<Column>;cells: CRDTValuedMap<[row: Row, column: Column], Cell>;Now you can insert or delete rows and columns by calling the appropriate operations on columns and rows, without affecting the cells map at all. (Due to the lazy nature of the map, we don’t have to explicitly create cells to fill a new row or column; they implicitly already exist.)Speaking of rows and columns, there’s more we can do here. For example, rows have editable properties like their height, whether they are visible, etc. These properties are independent, so they should have independent states (Principle 4). This suggests making Row into a CRDT Object:class Row extends CRDTObject {height: LWWRegister<number>;isVisible: EnableWinsFlag;// ...}Also, we want to be able to move rows and columns around. We already described how to do this using a List-with-Move CRDT:class MovableListEntry<C> extends CRDTObject {value: C;positionReg: LWWRegister<UniqueImmutablePosition>;}class MovableListOfCRDTs<C> extends CRDTObject {state: UniqueSetOfCRDTs<MovableListEntry<C>>;}rows: MovableListOfCRDTs<Row>;columns: MovableListOfCRDTs<Column>;Next, we can also perform operations on every cell in a row, like changing the font size of every cell. For each such operation, we have three options:Use a causal for-each operation (Principle 3a). This will affect all current cells in the row, but not any cells that are created concurrently (when a new column is inserted). E.g., a “clear” operation that sets every cell’s value to "".Use a concurrent+causal for-each operation (Principle 3b). This will affect all current cells in the row and any created concurrently. E.g., changing the font size of a whole row.Use an independent state that affects the row itself, not the cells (Principle 4). E.g., our usage of Row.height for the height of a row.# Finished DesignIn summary, the state of our spreadsheet is as follows.// ---- CRDT Objects ----class Row extends CRDTObject {height: LWWRegister<number>;isVisible: EnableWinsFlag;// ...}class Column extends CRDTObject {width: LWWRegister<number>;isVisible: EnableWinsFlag;// ...}class Cell extends CRDTObject {content: LWWRegister<string>;fontSize: LWWRegister<number>;wordWrap: EnableWinsFlag;// ...}class MovableListEntry<C> extends CRDTObject {value: C;positionReg: LWWRegister<UniqueImmutablePosition>;}class MovableListOfCRDTs<C> extends CRDTObject {state: UniqueSetOfCRDTs<MovableListEntry<C>>;}// ---- App state ----rows: MovableListOfCRDTs<Row>;columns: MovableListOfCRDTs<Column>;cells: CRDTValuedMap<[row: Row, column: Column], Cell>;Note that I never explicitly mentioned CRDT correctness—the claim that all users see the same document state after receiving the same messages. Because we assembled the design from existing CRDTs using composition techniques that preserve CRDT correctness, it is trivially correct. Plus, it should be straightforward to reason out what would happen in various concurrency scenarios.As exercises, here are some further tweaks you can make to this design, phrased as user requests:“I’d like to have multiple sheets in the same document, accessible by tabs at the bottom of the screen, like in Excel.” Hint (highlight to reveal): Use a List of CRDTs.“I’ve noticed that if I change the font size of a cell, while at the same time someone else changes the font size for the whole row, sometimes their change overwrites mine. I’d rather keep my change, since it’s more specific.” Hint: Use a Register with a custom getter.“I want to reference other cells in formulas, e.g., = A2 + B3. Later, if B3 moves to C3, its references should update too.” Hint: Store the reference as something immutable.# ConclusionI hope you’ve gained an understanding of how CRDTs work, plus perhaps a desire to apply them in your own apps. We covered a lot:Traditional CRDTs: Unique Set, List/Text, LWW Register, Enable-Wins Flag, Add-Wins Set, CRDT-Valued Map, and List-with-Move.Novel Operations: Concurrent+causal for-each operations on a Unique Set or List CRDT.Whole Apps: Spreadsheet, rich text, and pieces of various other apps.For more info, crdt.tech collects most CRDT resources in one place. For traditional CRDTs, the classic reference is Shapiro et al. 2011, while Preguiça 2018 gives a more modern overview.I’ve also put many of these ideas into practice in a library, Collabs. You can learn more about Collabs, and see how open-source collaborative apps might work in practice, in my Local-First Web talk: Video, Slides, Live demo.# Related WorkThis blog post’s approach to CRDT design - using simple CRDTs plus composition techniques - draws inspiration from a number of sources. Most similar is the way Figma and Hex describe their collaboration platforms; they likewise support complex apps by composing simple, easy-to-reason-about pieces. Relative to those platforms, I incorporate more academic CRDT designs, enabling more flexible behavior and server-free operation.The specific CRDTs I describe are based on Shapiro et al. 2011 unless noted otherwise. Note that they abbreviate “Unique Set” to “U-Set”. For composition techniques, a concept analogous to CRDT Objects appears in BloomL; CRDT-Valued Maps appear in Riak and BloomL; and the Collections of CRDTs are inspired by how Yjs’s Y.Array and Y.Map handle nested CRDTs.Similar to how I build everything on top of a Unique Set CRDT, Mergeable Replicated Data Types are all built on top of a (non-unique) set with a 3-way merge function, but in a quite different way.Other systems that allow arbitrary nesting of CRDTs include Riak, Automerge, Yjs, and OWebSync.# AcknowledgmentsI thank Heather Miller, Ria Pradeep, and Benito Geordie for numerous CRDT design discussions that led to these ideas. Jonathan Aldrich, Justine Sherry, and Pratik Fegade reviewed a version of this post that appears on the CMU CSD PhD Blog. I am funded by an NDSEG Fellowship sponsored by the US Office of Naval Research.# Appendix: What’s Missing?The CRDT design techniques I describe above are not sufficient to reproduce all published CRDT designs - or more generally, all possible Strong Eventual Consistency data structures. This is deliberate: I want to restrict to (what I consider) reasonable semantics and simple implementations.In this section, I briefly discuss some classes of CRDTs that aren’t covered by the above design techniques.# Optimized ImplementationsGiven a CRDT designed using the above techniques, you can often find an alternative implementation that has the same semantics (user-visible behavior) but is more efficient.For example:Suppose you are counting clicks by all users. According to the above techniques, you should use a Unique Set CRDT and add a new element to the set for each click (Principle 1), then use the size of the set as the number of clicks. But there is a much more efficient implementation: merely store the number of clicks, and each time a user clicks, send a message instructing everyone to increment their number. This is the classic Counter CRDT.Peritext is a rich-text CRDT that allows formatting operations to also affect concurrently inserted characters, like our example above. Instead of using concurrent+causal for-each operations, they store formatting info at the start and end of the range, then do some magic to make sure that everything works correctly. This is much more efficient than applying formatting to every affected character like in our example, especially for memory usage.My advice is to start with an implementation using the techniques in this blog post. That way, you can pin down the semantics and get a proof-of-concept running quickly. Later, if needed, you can make an alternate, optimized implementation that has the exact same user-visible behavior as the original (enforced by mathematical proofs, unit tests, or both).Doing things in this order - above techniques, then optimize - should help you avoid some of the difficulties of traditional, from-scratch CRDT design. It also ensures that your resulting CRDT is both correct and reasonable. In other words: beware premature optimization!3 out of 4 Peritext authors mention that it was difficult to get working: here, here, and here.One way to optimize is with a complete rewrite at the CRDT level. For example, relative to the rich text CRDT that we sketched above (enhanced with concurrent+causal for-each operations), Peritext looks like a complete rewrite. (In reality, Peritext came first.)Another option is to “compress” the CRDT’s state/messages in some way that is easy to map back to the original CRDT. That is, in your mind (and in the code comments), you are still using the CRDT derived from this blog post, but the actual code operates on some optimized representation of the same state/messages.For example, in the rich text CRDT sketched above, if storing separate formatting registers for each character uses too much memory, you could compress the state by deduplicating the identical formatting entries that result when a user formats a range of text. Then the next time you receive an operation, you decompress the state, apply that operation, and recompress. Likewise, if formatting a range of characters individually generates too much network traffic (since there is one CRDT message per character), you could instead send a single message that describes the whole formatting operation, then have recipients decompress it to yield the original messages.# Optimized SemanticsSome existing CRDTs deliberately choose behavior that may look odd to users, but has efficiency benefits. The techniques in this blog post don’t always allow you to construct those CRDTs.The example that comes to my mind concerns what to do when one user deletes a CRDT from a CRDT collection, while concurrently, another user makes changes to that CRDT. E.g., one user deletes a presentation slide while someone else is editing the slide. There are a few possible behaviors:“Deleting” semantics: The deletion wins, and concurrent edits to the deleted CRDT are lost. This is the semantics we adopt for the Collections of CRDTs above; I attribute it to Yjs. It is memory-efficient (deleted CRDTs are not kept around forever), but can lose data.“Archiving” semantics: CRDTs are never actually deleted, only archived. Concurrent edits to an archived CRDT apply to that CRDT as usual, and if the new content is interesting, users can choose to un-archive. We described how to do this using the Add-Wins Set above. This is the nicest semantics for users, but it means that once created, a CRDT stays in memory forever.“Resetting” semantics: A delete operation “resets” its target CRDT, undoing the effect of all (causally) prior operations. Concurrent edits to the deleted CRDT are applied to this reset state. E.g., if a user increments a counter concurrent to its deletion, then the resulting counter value will be 1, regardless of what its state was before deletion.This semantics is adopted by the Riak Map and a JSON CRDT paper, but it is not possible using the techniques in this blog post. It is memory-efficient (deleted CRDTs are not kept around forever) and does not lose data, but has weird behavior. E.g., if you add some text to a slide concurrent to its deletion, then the result is a slide that has only the text you just entered. This is also the cause of the above map anomaly.I am okay with not supporting odd semantics because user experience seems like a first priority. If your desired semantics results in poor performance (e.g. “archiving” semantics leading to unbounded memory usage), you can work around it once it becomes a bottleneck, e.g., by persisting some state to disk.# Hard SemanticsSome CRDTs seem useful and not prematurely optimized, but I don’t know how to implement them using the techniques in this blog post. Two examples:A Tree CRDT with a “move” operation, suitable for modeling a filesystem in which you can cut-and-paste files and folders. The tricky part is that two users might concurrently move folders inside of each other, creating a cycle, and you have to somehow resolve this (e.g., pick one move and reject the other). Papers: Kleppmann et al., Nair et al..A CRDT for managing group membership, in which users can add or remove other users. The tricky part is that one user might remove another, concurrent to that user taking some action, and then you have to decide whether to allow it or not. This is an area of active research, but there have been some proposed CRDTs, none of which appear to fit into this blog post.Aside. There is a sense in which the Unique Set CRDT (hence this blog post) is “CRDT-complete”, i.e., it can be used to implement any CRDT semantics: you use a Unique Set to store the complete operation history together with causal ordering info, then compute the state as a function of this history, like in pure op-based CRDTs. However, this violates the spirit of the blog post, which is to give you guidance on how to design your CRDT.# Beyond CRDTsCRDTs, and more broadly Strong Eventual Consistency, are not everything. Some systems, including some collaborative apps, really need Strong Consistency: the guarantee that events happen in a serial order, agreed upon by everyone. E.g., monetary transactions. So you may need to mix CRDTs with strongly consistent data structures; there are a number of papers about such “Mixed Consistency” systems, e.g., RedBlue consistency.Home • Matthew Weidner • PhD student at CMU CSD • mweidner037 [at] gmail.com • @MatthewWeidner3 • LinkedIn • GitHub
# Document TitleSteno & PLAboutLightning-fast rebases with git-moveOct 12, 2021You can use git move as a drop-in 10x faster replacement for git rebase (see the demo). The basic syntax is$ git move -b <branch> -d <dest>How do I install it? The git move command is part of the git-branchless suite of tools. See the installation instructions.What does “rebase” mean? In Git, to “rebase” a commit means to apply a commit’s diff against its parent commit as a patch to another target commit. Essentially, it “moves” the commit from one place to another.How much faster is it? See Timing. If the branch is currently checked out, then 10x is a reasonable estimate. If the branch is not checked out, then it’s even faster.Is performance the only added feature? git move also offers several other quality-of-life improvements over git rebase. For example, it can move entire subtrees, not just branches. See the git move documentation for more information.TimingWhy is it faster?What about merge conflicts?Related workInteractive rebaseRelated postsCommentsTimingI tested on the Git mirror of Mozilla’s gecko-dev repository. This is a large repository with ~750k commits and ~250k working copy files, so it’s good for stress tests.It takes about 10 seconds to rebase 20 commits with git rebase:Versus about 1 second with git move:These timings are not scientific, and there are optimizations that can be applied to both, but the order of magnitude is roughly correct in my experience.Since git move can operate entirely in-memory, it can also rebase branches which aren’t checked out. This is much faster than using git rebase, because it doesn’t have to touch the working copy at all.Why is it faster?There are two main problems with the Git rebase process:It touches disk.It uses the index data structure to create tree objects.With a stock Git rebase, you have to check out to the target commit, and then apply each of the commits’ contents individually to disk. After each commit’s application to disk, Git will implicitly check the status of files on disk again. This isn’t strictly necessary for many rebases, and can be quite slow on sizable repos.When Git is ready to apply one of the commits, it first populates the “index” data structure, which is essentially a sorted list of all of the files in the working copy. It can be expensive for Git to convert the index into a “tree” object, which is used to store commits internally, as it has to insert or re-insert many already-existing entries into the object database. (There are some optimizations that can improve this, such as the cache tree extension).Work is already well underway on upstream Git to support the features which would make in-memory rebases feasible, so hopefully we’ll see mainstream Git enjoy similar performance gains in the future.What about merge conflicts?If an in-memory rebase produces a merge conflict, git move will cancel it and restart it as an on-disk rebase, so that the user can resolve merge conflicts. Since in-memory rebases are typically very fast, this doesn’t usually impede the developer experience.Of course, it’s possible in principle to resolve merge conflicts in-memory as well.Related workIn-memory rebases are not a new idea:GitUp (2015), a GUI client for Git with a focus on manipulating the commit graph.Unfortunately, in my experience, it doesn’t perform too well on large repositories.To my knowledge, no other Git GUI client offers in-memory rebases. Please let me know of others, so that I can update this comment.git-revise (2019), a command-line utility which allows various in-memory edits to commits.git-revise is a replacement for git rebase -i, not git rebase. It can reorder commits, but it isn’t intended to move commits from one base to another. See Interactive rebase.Other source control systems have in-memory rebases, such as Mercurial and Jujutsu.The goal of the git-branchless project is to improve developer velocity with various features that can be incrementally adopted by users, such as in-memory rebases. Performance is an explicit feature: it’s designed to work with monorepo-scale codebases.Interactive rebaseInteractive rebase (git rebase -i) is a feature which can be used to modify, reorder, combine, etc. several commits in sequence. git move does not do this at present, but this functionality is planned for a future git-branchless release. Watch the Github repository to be notified of new releases.In the meantime, you can use git-revise. Unfortunately, git-branchless and git-revise do not interoperate well due to git-revise’s lack of support for the post-rewrite hook (see this issue).Related postsThe following are hand-curated posts which you might find interesting.Date Title19 Jun 2021 git undo: We can do better12 Oct 2021 (this post) Lightning-fast rebases with git-move19 Oct 2022 Build-aware sparse checkouts16 Nov 2022 Bringing revsets to Git05 Jan 2023 Where are my Git UI features from the future?11 Jan 2024 Patch terminologyWant to see more of my posts? Follow me on Twitter or subscribe via RSS.CommentsDiscussion on LobstersSteno & PLsubscribe via RSSWaleed Khanme@waleedkhan.namearxanasarxanasThis is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.
# Document TitleSteno & PLAboutgit undo: We can do betterJun 19, 2021Update for future readers: Are you looking for a way to undo something with Git? The git undo command won’t help with your current issue (it needs to be installed ahead of time), but it can make dealing with future issues a lot easier. Try installing git-branchless, and then see the documentation for git undo.MotivationSolutionDemosImplementationRelated postsCommentsMotivationGit is a version control system with robust underlying principles, and yet, novice users are terrified of it. When they make a mistake, many would rather delete and re-clone the repository than try to fix it. Even proficient users can find wading through the reflog tedious.Why? How is it so easy to “lose” your data in a system that’s supposed to never lose your data?Well, it’s not that it’s too easy to lose your data — but rather, that it’s too difficult to recover it. For each operation you want to recover from, there’s a different “magic” incantation to undo it. All the data is still there in principle, but it’s not accessible to many in practice.Here’s my theory: novice and intermediate users would significantly improve their understanding and efficacy with Git if they weren’t afraid of making mistakes.SolutionTo address this problem, I offer git undo, part of the git-branchless suite of tools. To my knowledge, this is the most capable undo tool currently available for Git. For example, it can undo bad merges and rebases with ease, and there are even some rare operations that git undo can undo which can’t be undone with git reflog.Update 2021-06-21: user gldnspud on Hacker News points out that the GitUp client also supports undo/redo via snapshots, also by adding additional plumbing on top of Git.I’ve presented demos below, and briefly discussed the implementation at the end of the article. My hope is that by making it easier to fix mistakes, novice Git users will be able to experiment more freely and learn more effectively.DemosUndoing an amended commit:Undoing a merge conflict that was resolved wrongly:Implementationgit undo is made possible by a recent addition to Git: the reference-transaction hook. This hook triggers whenever a change is made to a reference, such as a branch. By recording all reference moves, we can rebuild the state of the commit graph at any previous point in time. Then we accomplish the undo operation by restoring all references to their previous positions in time (possibly creating or deleting references in the process).I originally built git-branchless in order to replicate a certain Mercurial workflow, but the data structures turn out to be flexible enough to give us a git undo feature nearly for free. You can find more detail at the Architecture page for the project.Related postsThe following are hand-curated posts which you might find interesting.Date Title19 Jun 2021 (this post) git undo: We can do better12 Oct 2021 Lightning-fast rebases with git-move19 Oct 2022 Build-aware sparse checkouts16 Nov 2022 Bringing revsets to Git05 Jan 2023 Where are my Git UI features from the future?11 Jan 2024 Patch terminologyWant to see more of my posts? Follow me on Twitter or subscribe via RSS.CommentsDiscussion on Hacker NewsSteno & PLsubscribe via RSSWaleed Khanme@waleedkhan.namearxanasarxanasThis is a personal blog. Unless otherwise stated, the opinions expressed here are my own, and not those of my past or present employers.
sohi everyone um so this is are-recording of my talk that i wanted togive for bobconfwe had a couple of technical issues andso i decided todo a re-recording so we can actuallyhave thetalk as it was intended so yeah this isdarks because get one uh admittedlythat's a little bit click-baity butuh it's meant to be a bittongue-in-cheek so bear with me thereso yeah um first of all who am iuh i'm raichu um i've been in thehaskell community for around10 years now uh been writing softwarefor programming for 30 years umlots of c plus lots of java but 10 yearsago i found thisprogramming language called haskellwhich i enjoy quite a lotand uh been using it for lots of stuffsince then umi'm the author of various uh open sourceprojectsum some of you you might probably knowthere is thisproject called haskell vim which is a uhplugin for for vim umthe syntax highlighting and indentationand stuff like that anduh for some reason it has become quitepopular anduh yeah people seem to be enjoying it umi'm speaking regularly at conferences souhif you google write you on youtube youwill find a couple of talksthat i've done over the years and umi'm also co-founder of a hackspacein betafield which is called acme labsso hex bases are theseloose um clubhouses if you want to likewherepeople meet and uh do projects and stuffart and beauty software projectswhatever comes to mindand i'm also the co-founder and cto ofanti in bielefeldwe are uh we are software consultants weare writing umsoftware for companies thatwant to have certain problems solved andweuh we deliver that well at leastthat's the intention um maybe you'veheard of uswe've also did the uh the advent ofhaskell2020. uh we picked up that work andit uh i think it was quite popular and alot of people enjoyed thatso yeah it produced a couple ofinteresting blog posts soyou can still read those up at thehaskell.comso yeah umi said the talk is a little bitclickbaity so i want totalk about what this talk is not so thistalk is not like getbad dark is good i don't want toconvince you toswitch everything to darks andabandon git that's not the idea i justwant to present an alternativeto what is basically the mainstream atthe momentbecause i think darks is a veryinterestingvery interesting version control systembecause it doesthings absolutely in a different waylike it's not just a different uifor the same concept it does thingsconceptuallyvery different um this talk is also nota transition guideuh i can't just take a workflow and thentransform thatinto like a git workflow and thentransform that into darksworkflows grow organically this talkwill only give you hints at what darkscan doand maybe that fits your workflow maybeit doesn'tso you will have to find ways tomake that work for you if you'reinterested in thatso okay let's get started um wellnot really first we got to talk a littlebit about the dark's image sorryum when you have heard about darks thereare a couple of things thatuh that are quite commonlike conceptions of the system somepeople consider it to be deadi can assure you it's not just recentlywe had a major release2.16 and we're now at version 2.16.3um so it's very active it's still beingdeveloped by very veryclever people and yeah it's a life andkickinganother thing that i hear quite a lot isthat people complain that it's slowand i don't think that statement doesn'tnecessarily make sense on its ownbecause when you say something is slowyouhave to compare it to something and whenyou compare it to gitit's certainly slow git is pretty muchbuilt around the idea that it has to beoptimizedum for every in every nook and crannythere is optimizations and stuff likethatso uh to me personallyit really isn't that big of a deal if anoperation just takes500 milliseconds versus 20 millisecondsi mean that might be an issue for youbutwith most of the projects that i'mworking onthis is not the issue like the versioncontrol system is not the bottleneckum the bottleneck is how you use it andwhat kind of workflows itit enables me to do and to me that'sthat's way morethat's way more important than likehavingshaving off like three seconds from mypushum another thing that i hear quite a lotis like the exponential merge problemand people talk about that a lot on onforums and communication platforms andthat is something that has been an issuewithearlier versions of darks where has beenquite prominent butum the darks people reallyworked out this issue quite a lot so theissue is thatum when you're merging things you canrun into exponential runtime which iscertainly not something that you woulddesire anduh there are darks people have put a lotofwork into finding all those edge casesanduh fixing them and making the algorithmbetter and and the performanceway better and personally i've beenusing darts foryears and i never encountered thatproblemum there might be still i mean there'scertainly still places where these theseedge conditions can happenbut um they areexceptionally rare and that brings meback to the the idea thatthat darks is dead um they are workingon a newpatch theory that eliminates all thosei mean the goal is to eliminate allthose edge cases andmake it more performant and yeah growthe system so that's that's veryinteresting ongoing work at the momentand i'm going to link you to the paperthatlies the foundation for that sothis is my favorite everyone knows howto use gitum yeah what does that meanum whenever i hear that i i think ofthatand and please don't get me wrong um youcan do the same thingswith darts and any other version controlsystembut to me the notion you know how to usesomethingis pretty much like when you say in aprogramming language i know the syntaxof a programming languagewhat's more important is that you canproduce something meaningfulwith that programming language and to meit's the same thing with version controlyes you can use the version control youcan you can create patches you can pushstuffbut to me the most important thing is umhow to build meaningful patchesespecially like in a working context andi know that'sthat's a hard hard thing if you're doinga prototype and stuff like that and andii admit that i'm sometimes guilty ofdoing justthat um especially in a prototypesituation where we're not100 sure what the end result is going tobeso yeah to me the idea is that a versioncontrol system has toreally give you something back when youarecrafting patches in a meaningful way youhave to be able to use themas as good as possible and dars prettymuch does that and i hope i canillustrate that later in the demo foryouso now yeah let's get now now we'rereally getting startedso what's the different differenceanyway umfrom the from the outside looking in itmight look likedarks is just another version controlsystem with a slightly different ui andunderneath everything works just likecompletely the sameand this is basically the point whereyou start to differentiate between asnapshot based version control somethinglike git or mercurialand um a patch-based version control andthis this notionmight be a little bit weird because itfocuses around patches what does thatmeanum i will show you in the demo but firstof all let's let's look at asituation where we have fourcharacters here we have we have alice wehave bobwith charlie and eve and alice and bobthey both havepatches a and charlie has beendevelopingpatch b which is a new feature orwhateverdocumentation choose whatever you wantand eve has also been working on a patchon a patch cso as it turns out alice and charlie areworking closely togetherthey are in a team together but at themoment they'renot in the same office they have to workremotelythere are different situations wherethat might occur like a global pandemicbut yeah so as it turns out those twoare communicating on a regular basisso um alice knows about charlie's patchandshe's eager to try it and apply it toher repository so what she does is shebasically pulls this patcha couple of minutes later she hearsabout eve's patch andbecause she's not working on the sameteam they're they're not communicatingfrequently butnow she hears about it and she's shethinks oh that's that's actuallysomething that i want to pull in as wellbecausethe work that i'm currently doing wouldvery much benefit from that and now shepulls that inum on the other side of the city orplanet uh bob suddenlyuh realizes that eve who he's workingclosely together withhas written patch c and he's like ohgreat i've been waiting for that patchfor a long time and pulls that in now hehaspulled that patch and you can now seealice hasabc and you can maybe sense thatsomething interesting is going on hereand he has uh bypass patches a and cum and a couple of minutes later hehears about charlie's patchand thinks oh i got to pull that in aswell so now you pull thatin and now you can see that um alice'srepository hasthe order the order of patches in whichshe pulled those in she has a b and cand uh bob has a cand b sohow do their repositories look like nowso with gitum we would have something like thissome kind ofthis history so we start at a commonstarting pointum alice pulls in charlie's feature shepulls an eavs featureand then she merges and what you can seehere is thatthese commits these last commits wherethe head ofthe gift repo is those differbecause with charlie's and eve's patchesthey divergedfrom that common starting point and nowwe have to bring those back togetherand these merge commits are created bythe people who havepulled in the patches it's not like afast forward we have to merge somethingand now what's happened happening isthatbasically when you are pulling inpatches in a different orderthings get very different so it's likelet's say you are there and we you see abanana and you see anapple and you take the apple and youtake the banana and you would certainlyend up with a different result if youwould take the banana first and theappleso this is something that happens inversion control quite a lotbut um now i want to show you how docsessentially does the same deals with thesame situation hereso yeah it's demo time hopefully this isgoing to work outso um i'm going to start with a cleanslate herethere's there's nothing there and i'mgoing tomake a repository alice and go to alicerepository and i'm going to show you howdarts works here i'm on my remotemachine at the momentso yeah i've initialized initialize thatrepositoryyou can see this underscore docs folderiswhere darks is doing all the bookkeepingand now we're going to to uh write alittlelittle file and i want to apologize forthe color scheme herelet's do something like that so this isthe fileand we're going to add that to the repoandthis file is now being tracked and wewant to recordum our initial patchinitial record yeah and now you can seethe docs workflow is like superinteractive yeswhat changes do we want to record yes wehave added a fileand we've added the content so darks isbasically prompting us of what we wantto do which islike super helpful if you want to domore complex operationsso if you want to take a look at therepository now we can see thatwe have recorded a patch so now let'sjustcreate all thoseall those repositories that we have hadbefore we've createdbob we create charlie and wecreate so all these repositories are nowin the same stateso let's go to um let's go to tocharlie first and uhbuild charlie's patch here and whichonly basically does issomething like super simple it's not akeyboard i know so it's a little bitweirdoh we just add a to the top of the fileand nowwe can ask darts what's new and it saysokay atline one we are going to add character aso let's record thatyes yes this is charlieand let's do the same thing with eveso what eve is doing she's essentiallyjust going to add bto the end of the fileyes we're going to add that and this isso let's go back to alice and reproducethe workflow that we just hadsoalice is going to pull from charliefirstso yeah and now we can look at the fileexcuse me you can say we can see thepatch has been appliedso do the same thing to eand we can see those two files have beenmergedso yeah if we look at the history we cannow seeum we've pulled from charlie first thenwe pull from eve and what's importanthereis those hashes those those hashes areessentially the identityof of what's going on hereso of the identity of the patch to bemore precise so let's go to boband pull we do that the other way in abrown the first thing is we do we pullfrom eve becausethat's who charlie's been working withand we pull thatpull that patch and let's take a look atthatso let's let's let's let's look at thosethings side by side let's go into umalice's repository and the darks loghereand here you can see so this is this iswhatcharlie is seeing uh excuse me aboutwhat bob is seeingand you can see that even thoughso this um just to be precise herewhat's happening here is a cherry pickdart is basically constantly doingcherry picks and mergingand even though i've cherry picked thispatchfrom um from charlie's repository fromexcuse me from if's repositoryum they actually have the samethe same identity the same hash eventhoughthey have been pulled in differentorders that have been cherry picked indifferent orders so what darksdoes is that it doesn't change a patchidentity when it does cherry pickingwhich is like immensely neat i thinkso let's do the same thing with charlieyes yes and as you can seeit's the same thing you can even dosomething like we can get to go fromthat repository andumdear test test repositoryand it just to demonstrate the cherrypicking property here if wepull from let's say alice she's got allthe patchesum yes i want to take the initial recorddo i want to take charlie'sno but i want to take uselet's look at that let's first look atthe fileoh come on yeah and now you can seeeven though i've cherry picked the bjust the bpatch and left the a patch behind umi can still end up with result andthis thing this patch this has still thesame identityso how can we see that torepositories have the same statethat's that's the important part we canlook at show repoand here is this thing which is calledthe weak hash and the weak hashhas is computed about all the set of theset of all patches that areinside of that repository and if we dothe same thing to bobyou can see those weak hashesthey line up they have they have thesame stateum yeah this is this is this is how theworkflow is different hereso something else which i find kind ofkind of niceis that um darts can essentially producegraphics output andi'm usually using different tools therebut here you can see what's actuallyhappening like these twothese two patches they don't even thoughtheyeven in a situation where you're in thesame repository and recording those twopatches yourselfdarts can figure out that they do notactually depend on each otherso this the patch that charlie wrotewhich essentially just added a to thefile it doesn'tit the only thing that it depends on isthat the initial record is thereit's the same thing with these umchange that just added something else tothe fileso now we can just pick everythingum pick and choose and this becomes evenmore interesting if we have way morepatches withway more features in the repository sowe can pick and chooseall the different things that we wantand we can look athow they interact with each other and ofcourse we can also statedependencies directly like explicitlysay i want to depend on that patchbut i'm not going to to to show thathereso yeah dogs basically justtreats a repositories a set of patchesand i mean set in a mathematical senselikethe order of the patches of the elementsthat doesn't matter if i have a set ba and a b it's the same set the orderisn't important so let's talk a littlebitdarks does that let's talk about patchtheoryum so we haskellers we all like ourtheorya lot and uh yeah patch theory[Music]just just let me get this straight youdon't need to know the theory toactually enjoyuh or be able to work with darts it'sthe same thing you don't have toknow category theory to to to knowuh haskell uh basically if you would goto a category theorist and explain themfrom a haskell perspective what a monadis they would say it's not a monadyou are this is just a very special casethat you're talking about soum even the theory that we're using ininhaskell is not exactly like the the realdealum and there are very passionatediscussions so i just want to make surethat you know that even though thetheory is likemaybe intellectually stimulating it'snot like mandatoryso if we're talking about patch theorylet's talk about what the patch isso a patch basically patch a takes arepository from some state oto state a that's what it does so likeadding a file or writing content to afile that's pretty simpleum if we have two patches andpatch a takes a repository from o to aand patch b takes it from a to bwe can compose them where hasslers thatthat's what we like we compose thingsall the timeand uh so does patch theory so we canlikesequence those patches upwe also want to be able to invert thepatch so we want totake changes back so if we added a linewe want to get rid of that lineso for every patch there should be aninverseso if we have a patch a that goes frommode to ao inverted should go from a to oso we can go back thereum and we should be able to commutepatches so this is a little bit moretricky and apologize for the notationbut this is important to understand howdocs is basically doing what i justshowed youso if i f patch a and b and they do notdepend on each other so what does thatmean iflet's say patch a adds a file and patchb edits the content of that fileof course we can't swap around thoseoperations that doesn't workyou can't edit the content of a file ifit hasn't been added beforebut let's say we're talking about thesituation that we just had before likewe have a file andpatch a adds something to the top of thefile and patch b adds something to thebottom of the file so line one or maybesomething like maybe line five thatthat's where patch b operatesso essentially we could swap thoseoperations aroundand so what these uh subscripts mean isthat it's not essentially thisthe patches are not fundamentally thesame they're a little bit different likesay this patch addssomething to line one and this addssomething to line sixif we swap those around um forfor patch b everything shifts up a lineso now this addsto line five and this still adds to lineoneeven though the operations are basicallyequivalentso that's what this is what i mean bythat notation a is basically equivalentto a1 and b is equivalent to it to beonethey do the same thing they don't havethey don'tessentially do the same operations likeadding this one this one adds to linesix and this one adds to line 5 eventhough they are doingthey are representing the same changeand what's important is thatthe context that we end up in like b anda1they are still the same they areidentical even thoughwe switch them around sohow does a merge work then so a merge isbasicallylet's take those two patches and bringthem together so how will we do thatwe can't just concatenate c to bwe can't just put that on there becauseit goes from a to cand uh we are now in a repository thathas in state bso with all the rules that we have withall the theuh mechanisms that we have we can't nowdo a mergewe can say okay first we invert b we'reback in state athen we can apply c so now we still havethe information of b in there eventhough it has been reverted but theinformation is still therewe can commute b and cand then we end up with c1 and b1and then what we can do we can we canjust throw awayc1 excuse me b1and then we have a merge we ended upwith the state that we wanted pulled inall those patches and voilaso this is a nice thing we can nowreason algebraicallyabout patches which i find i think isreally exciting to meand another thing that's super excitingto me is thismerging symmetric um so it doesn'tmatter in which orderi'm pulling in those patches so thethe good thing about this is so we firststarted using darksin our hackspace and hackspaces are likevery loosegroups of hackers working together andyou don't necessarilyall of them are like on a on a singlecentral repository like everyone picksand pulls and frompicks pushes to everyone else andum in this situation it's like superuseful if you don'thave the have to maintain the same orderwe can work together we have a veryloose workflow thereand we can just put pull out thingstogetherin the way we want to and uh likeneglect patches anduh take the ones that weseem fit for our current situationthat's pretty neati i enjoy that quite a lotso what applications does that have howdo we apply thatand to me um i have a whaling compositorproject which is like20 000 lines of c code so it's areasonably large projectand what i do have free repositoriesand um what i essentially do is i i usethis this kind of workflow this thismodel orwhat darts enables me to do immensei do i work that i use that quite a lotum so i have a current repository wherei pullall the patches in every all the workhappens there and i haveif i have a feature a new feature thatisis uh not really breaking anyof the workflows or that doesn'tintroduce any breaking changesi pull that into the stable intosustainable repository and you can seei highlighted all the number the part ofthe version numbers in redthat basically are part of the numberthat i'm going to increase with thatreleaseand whenever there's a new feature thatdoesn't break functionality i canpull that over to stable and whenever ijust have a bug fix that doesn'tintroduce any new featuresi can pull it over to release and thisis essentially the workflow that we'reusing at my companyum it it has proven to be very solidand i was able to do some veryvery exciting to me very very fluent andfun release engineering like way moreuh i enjoy that way more than i did withgit but that that's a personalpreferenceso yeah that is something that that weenjoy quite a lotso another thing is that passvale is anopen source project that we are going toto open sourcewithin the next few days and it'sessentially like a password managersomething like a password store umpassword store is using git underneathand wethought well the the model of dars isessentiallysuper neat if you have a verydistributed password manager solots of people working on lots ofdifferent passwordsand you constantly have to merge allthose states and since darts is likevery sophisticated with merging wethought that would be like a veryinterestingfoundation for a uh for a passwordmanagerand we also have model things like trustand and all sorts of things soif you know a password and i give apassword to someone else we can usepass veil to to use to basically useuh use password to transfer thatpassword so we don't have touse insecure channels like post-its orpass post it on a chat oruse email or something like that so wecan use the password managerto transfer our passwords to otherpeople and work together with themso this is also something that is provedto be quite beneficial especially in avery distributed situationsin which we are at the moment so yeahum tooling so if you are using gittooling is great of course there's a lotof tooling a lot of people use it a lotof people writereally great tools and um so we are kindof likestuck in the situation where i have tobuild a lot of stuff ourselves umwhich is fun but it's also work andyeah this this is something that we'vebeen doing for the last few monthsso what i did i i built something that'scalled docsgutterwhich is a um a plugin for neovimit's it's written in lua so it's aplugin for neofilm that essentially doesthe thingsget gutter tests so it shows us changeswe canlook at the changes that uh we can lookatdifferent hungs in a file we can jump tohungs we canum yeah we can we can essentially likegrab throughour uh project and figure out where arethe changesand uh what has been changed so thishelps me a lot with uhwhen i'm i'm working on a rather largecode base and want to craft all mypatches togetherlike this goes in that patch and thisgoes in there so this is likevery helpful and another thing that ihave is a darts fish promptso if you are using the fish or thefriendly indirectshell that's a prompt that is veryusefulso you can see okay this repository haschanged therehave been added files uh untracked filesand stuff like that so everything youknow fromfrom from git prompt and something elsethat we've been working on is calleddark slabwhich is our um we are using thatinternally at the moment but we didn'topen source itbut we're planning to we're going toopen source it it'sa um hosting platform essentially thatwe're using toto uh manage our repositories and at themomentit's like completely command line basedum which some of you might actuallyenjoy and enjoy it quite a loteverything you need is just darks andssh and uhit works on all those machines and to usit's great we don't youneed a web interface we can manage allall our privileges and all ourrights management that is all done onthe command line and that's that'sactually rather exciting and we arelooking forward to open source that toyouso yeah um thatpretty much brings us to the end um so ihave a couple of resources that youmight find interesting if you reallywant to learn darks i have writtena little book a darks book umbecause i didn't like the documentationthat was out there and ithought yeah writing a book would be agood idea it's basicallyuh hosted our at our hackspace becausei wrote it for the hackspace initiallywe're using it there and i thoughtif we are going to use it there has tobe decent documentationand it's very simple it's like baby'sfirst version control system it's notlikesuper sophisticated stuff you can startright off the bat if you've never usedthe version control system beforeum that works um you can start with thebookand here is the uh the thepaper that i've been talking about sothis is like supposedly uhthe foundation for the new patch theorythat people have been working on andthere's an experimentalum you can experiment that with the thelatest uh major release of darksum docs version three is aan option but it's it's stillexperimental so please don't use thatforfor critical code yeah that's about itif you want to get a hold of me i'm ontwitter write you and you can drop me anemail i'd write you at anti-ideum so yeah hope you enjoyed thatif you have any questions just drop mean email or hit me up on chatyeah thanks bye
# Document Title(brought to you by boringcactus)Can We Please Move Past Git? (22 Feb 2021)Git is fundamentally a content-addressable filesystem with a VCS user interface written on top of it.— Pro Git §10.1Most software development is not like the Linux kernel's development; as such, Git is not designed for most software development. Like Samuel Hayden tapping the forces of Hell itself to generate electricity, the foundations on which Git is built are overkill on the largest scale, and when the interface concealing all that complexity cracks, the nightmares which emerge can only be dealt with by the Doom Slayer that is Oh Shit Git. Of course, the far more common error handling method is start over from scratch.Git is bad. But are version control systems like operating systems in that they're all various kinds of bad, or is an actually good VCS possible? I don't know, but I can test some things and see what comes up.MercurialMercurial is a distributed VCS that's around the same age as Git, and I've seen it called the Betamax to Git's VHS, which my boomer friends tell me is an apt analogy, but I'm too young for that to carry meaning. So let me see what all the fuss is about.Well, I have some bad news. From the download page, under "Requirements":Mercurial uses Python (version 2.7). Most ready-to-run Mercurial distributions include Python or use the Python that comes with your operating system.Emphasis theirs, but I'd have added it myself otherwise. Python 2 has been dead for a very long time now, and saying you require Python 2 makes me stop caring faster than referring to "GNU/Linux". If you've updated it to Python 3, cool, don't say it uses Python 2. Saying it uses Python 2 makes me think you don't have your shit together, and in fairness, that makes two of us, but I'm not asking people to use my version control system (so far, at least).You can't be better than Git if you're that outdated. (Although you can totally be better than Git by developing a reputation for having a better UI than Git; word of mouth helps a lot.)SubversionI am a fan of subverting things, and I have to respect wordplay. So let's take a look at Subversion (sorry, "Apache® Subversion®").There are no official binaries at all, and the most-plausible-looking blessed unofficial binary for Windows is TortoiseSVN. I'm looking through the manual, and I must say, the fact that branches and tags aren't actually part of the VCS, but instead conventions on top of it, isn't good. When I want to make a new branch, it's usually "I want to try an experiment, and I want to make it easy to give up on this experiment." Also, I'm not married to the idea of distributed VCSes, but I do tend to start a project well before I've set up server-side infrastructure for it, and Subversion is not designed for that sort of thing at all. So I think I'll pass.You can't be better than Git if the server setup precedes the client setup when you're starting a new project. (Although you can totally be better than Git by having monotonically-ish increasing revision numbers.)FossilFossil is kinda nifty: it handles not just code but also issue tracking, documentation authoring, and a bunch of the other things that services like GitHub staple on after the fact. Where Git was designed for the Linux kernel, which has a fuckton of contributors and needs to scale absurdly widely, Fossil was designed for SQLite, which has a very small number of contributors and does not solicit patches. My projects tend to only have one contributor, so this should in principle work fine for me.However, a few things about Fossil fail to spark joy. The fact that repository metadata is stored as an independent file separate from the working directory, for example, is a design decision that doesn't merge well with my existing setup. If I were to move my website into Fossil, I would need somewhere to put boringcactus.com.fossil outside of D:\Melody\Projects\boringcactus.com where the working directory currently resides. The documentation suggests ~/Fossils as a folder in which repository metadata can be stored, but that makes my directory structure more ugly. The rationale for doing it this way instead of having .fossil in the working directory like .git etc. is that multiple checkouts of the same repository are simpler when repository metadata is outside each of them. Presumably the SQLite developers do that sort of thing a lot, but I don't, and I don't know anyone who does, and I've only ever done it once (back in the days when the only way to use GitHub Pages was to make a separate gh-pages branch). Cluttering up my filesystem just so you can support a weird edge case that I don't need isn't a great pitch.But sure, let's check this out. The docs have instructions for importing a Git repo to Fossil, so let's follow them:PS D:\Melody\Projects\boringcactus.com> git fast-export --all | fossil import --git D:\Melody\Projects\misc\boringcactus.com.fossil]ad fast-import line: [S IN THEWell, then. You can't be better than Git if your instructions for importing from Git don't actually work. (Although you can totally be better than Git if you can keep track of issues etc. alongside the code.)DarcsDarcs is a distributed VCS that's a little different to Git etc. Git etc. have the commit as the fundamental unit on which all else is built, whereas Darcs has the patch as its fundamental unit. This means that a branch in Darcs refers to a set of patches, not a commit. As such, Darcs can be more flexible with its history than Git can: a Git commit depends on its temporal ancestor ("parent"), whereas a Darcs patch depends only on its logical ancestor (e.g. creating a file before adding text to it). This approach also improves the way that some types of merge are handled; I'm not sure how often this sort of thing actually comes up, but the fact that it could is definitely suboptimal.So that's pretty cool; let's take a look for ourselves. Oh. Well, then. The download page is only served over plain HTTP - there's just nothing listening on that server over HTTPS - and the downloaded binaries are also served over plain HTTP. That's not a good idea. I'll pass, thanks.You can't be better than Git while serving binaries over plain HTTP. (Although you can totally be better than Git by having nonlinear history and doing interesting things with patches.)PijulPijul is (per the manual)the first distributed version control system to be based on a sound mathematical theory of changes. It is inspired by Darcs, but aims at solving the soundness and performance issues of Darcs.Inspired by Darcs but better, you say? You have my attention. Also of note is that the developers are also building their own GitHub clone, which they use to host pijul itself, which gives a really nice view of how a GitHub clone built on top of pijul would work, and also offers free hosting.The manual gives installation instructions for a couple Linuces and OS X, but not Windows, and not Alpine Linux, which is the only WSL distro I have installed. However, someone involved in the project showed up in my mentions to say that it works on Windows, so we'll just follow the generic instructions and see what happens:PS D:\Melody\Projects> cargo install pijul --version "~1.0.0-alpha"Updating crates.io indexInstalling pijul v1.0.0-alpha.38Downloaded <a bunch of stuff>Compiling <a bunch of stuff>error: linking with `link.exe` failed: exit code: 1181|= note: "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX64\\x64\\link.exe" <lots of bullshit>= note: LINK : fatal error LNK1181: cannot open input file 'zstd.lib'error: aborting due to previous errorSo it doesn't work for me on Windows. (There's a chance that instructions would help, but in the absence of those, I will simply give up.) Let's try it over on Linux:UberPC-V3:~$ cargo install pijul --version "~1.0.0-alpha"<lots of output>error: linking with `cc` failed: exit code: 1|= note: "cc" <a mountain of arguments>= note: /usr/lib/gcc/x86_64-alpine-linux-musl/9.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lzstd/usr/lib/gcc/x86_64-alpine-linux-musl/9.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lxxhashcollect2: error: ld returned 1 exit statuserror: aborting due to previous errorUberPC-V3:~$ sudo apk add zstd-dev xxhash-devUberPC-V3:~$ cargo install pijul --version "~1.0.0-alpha"<lots of output again because cargo install forgets dependencies immediately smdh>Installed package `pijul v1.0.0-alpha.38` (executable `pijul`)Oh hey, would you look at that, it actually worked, and all I had to do was wait six months for each compile to finish (and make an educated guess about what packages to install). So for the sake of giving back, let's add those instructions to the manual, so nobody else has to bang their head against the wall like I'd done the past few times I tried to get Pijul working for myself.First, clone the repository for the manual:UberPC-V3:~$ pijul clone https://nest.pijul.com/pijul/manualSegmentation faultOh my god. That's extremely funny. Oh fuck that's hilarious - I sent that to a friend and her reaction reminded me that Pijul is written in Rust. This VCS so profoundly doesn't work on my machine that it manages to segfault in a language that's supposed to make segfaults impossible. Presumably the segfault came from C code FFId with unsafe preconditions that weren't met, but still, that's just amazing.Update 2021-02-24: One of the Pijul authors reached out to me to help debug things. Apparently mmap on WSL is just broken, which explains the segfault. They also pointed me towards the state of the art in getting Pijul to work on Windows, which I confirmed worked locally and then set up automated Windows builds using GitHub Actions. So if we have a working Pijul install, let's see if we can add that CI setup to the manual:PS D:\Melody\Projects\misc> pijul clone https://nest.pijul.com/pijul/manual pijul-manual✓ Updating remote changelist✓ Applying changes 47/47✓ Downloading changes 47/47✓ Outputting repositoryHey, that actually works! We can throw in some text to the installation page (and more text to the getting started page) and then use pijul record to commit our changes. That pulls up Notepad as the default text editor, which fails to spark joy, but that's a papercut that's entirely understandable for alpha software not primarily developed on this OS. Instead of having "issues" and "pull requests" as two disjoint things, the Pijul Nest lets you add changes to any discussion, which I very much like. Once we've recorded our change and made a discussion on the repository, we can pijul push boringcactus@nest.pijul.com:pijul/manual --to-channel :34 and it'll attach the change we just made to discussion #34. (It appears to be having trouble finding my SSH keys or persisting known SSH hosts, which means I have to re-accept the fingerprint and re-enter my Nest password every time, but that's not the end of the world.)So yeah, Pijul definitely still isn't production-ready, but it shows some real promise. That said, you can't be better than Git if you aren't production-ready. (Although you can totally be better than Git by having your own officially-blessed GitHub clone sorted out already.) (And maybe, with time, you can be eventually better than Git.)what next?None of the existing VCSes that I looked at were unreservedly better than Git, but they all had aspects that would help beat Git.A tool which is actually better than Git should start by being no worse than Git:allow importing existing Git repositoriesdon't require Git users to relearn every single thing - we already had to learn Git, we've been through enoughThen, to pick and choose the best parts of other VCSes, it shouldhave a UI that's better, or at least perceived as better, than Git's - ideally minimalism and intuitiveness will get you there, but user testing is gonna be the main thingavoid opaque hashes as the primary identifier for things - r62 carries more meaning than 7c7bb33 - but not at the expense of features that are actually importantgo beyond just source code, and cover issues, documentation wikis, and similar items, so that (for at least the easy cases) the entire state of the project is contained within version controlapproach history as not just a linear sequence of facts but a storyoffer hosting to other developers who want to use your VCS, so they don't have to figure that out themselves to get started in a robust wayAnd just for kicks, a couple of extra features that nobody has but everybody should:the CLI takes a back seat to the GUI (or TUI, I guess) - seeing the state gets easier that way, discovering features gets easier that way, teaching to people who aren't CLI-literate gets easier that waycontributor names & emails aren't immutable - trans people exist, and git filter-graph makes it about as difficult to change my name as the state of Colorado didif you build in issue/wiki/whatever tracking, also build in CI in some wayavoid internal jargon - either say things in plain $LANG or develop a consistent and intuitive metaphor and use it literally everywhereI probably don't have the skills, and I certainly don't have the free time, to build an Actually Good VCS myself. But if you want to, here's what you're aiming for. Good luck. If you can pull it off, you'll be a hero. And if you can't, you'll be in good company.
# Document TitleSome things a potential Git replacement probably needs to provideRecently there has been renewed interest in revision control systems. This is great as improvements to tools are always welcome. Git is, sadly, extremely entrenched and trying to replace will be an uphill battle. This is not due to technical but social issues. What this means is that approaches like "basically Git, but with a mathematically proven model for X" are not going to fly. While having this extra feature is great in theory, in practice is it not sufficient. The sheer amount of work needed to switch a revision control system and the ongoing burden of using a niche, nonstandard system is just too much. People will keep using their existing system.What would it take, then, to create a system that is compelling enough to make the change? In cases like these you typically need a "big design thing" that makes the new system 10× better in some way and which the old system can not do. Alternatively the new system needs to have many small things that are better but then the total improvement needs to be something like 20× because the human brain perceives things nonlinearly. I have no idea what this "major feature" would be, but below is a list of random things that a potential replacement system should probably handle.Better server integrationOne of Git's design principles was that everyone should have all the history all the time so that every checkout is fully independent. This is a good feature to have and one that should be supported by any replacement system. However it is not revision control systems are commonly used. 99% of the time developers are working on some sort of a centralised server, be it Gitlab, Github or the a corporation's internal revision control server. The user interface should be designed so that this common case is as smooth as possible.As an example let's look at keeping a feature branch up to date. In Git you have to rebase your branch and then force push it. If your branch had any changes you don't have in your current checkout (because they were done on a different OS, for example), they are now gone. In practice you can't have more than one person working on a feature branch because of this (unless you use merges, which you should not do). This should be more reliable. The system should store, somehow, that a rebase has happened and offer to fix out-of-date checkouts automatically. Once the feature branch gets to trunk, it is ok to throw this information away. But not before that.Another thing one could do is that repository maintainers could mandate things like "pull requests must not contain merges from trunk to the feature branch" and the system would then automatically prohibit these. Telling people to remove merges from their pull requests and to use rebase instead is something I have to do over and over again. It would be nice to be able to prohibit the creation of said merges rather than manually detecting and fixing things afterwards.Keep rebasing as a first class featureOne of the reasons Git won was that it embraced rebasing. Competing systems like Bzr and Mercurial did not and advocated merges instead. It turns out that people really want their linear history and that rebasing is a great way to achieve that. It also helps code review as fixes can be done in the original commits rather than new commits afterwards. The counterargument to this is that rebasing loses history. This is true, but on the other hand is also means that your commit history gets littered with messages like "Some more typo fixes #3, lol." In practice people seem to strongly prefer the former to the latter.Make it scalableGit does not scale. The fact that Git-LFS exists is proof enough. Git only scales in the original, narrow design spec of "must be scalable for a process that only deals in plain text source files where the main collaboration method is sending patches over email" and even then it does not do it particularly well. If you try to do anything else, Git just falls over. This is one of the main reasons why game developers and the like use other revision control systems. The final art assets for a single level in a modern game can be many, many times bigger than the entire development history of the Linux kernel.A replacement system should handle huge repos like these effortlessly. By default a checkout should only download those files that are needed, not the entire development history. If you need to do something like bisection, then files missing from your local cache (and only those) should be downloaded transparently during checkout operations. There should be a command to download the entire history, of course, but it should not be done by default.Further, it should be possible to do only partial checkouts. People working on low level code should be able to get just their bits and not have to download hundreds of gigs of textures and videos they don't need to do their work.Support file lockingThis is the one feature all coders hate: the ability to lock a file in trunk so that no-one else can edit it. It is disruptive, annoying and just plain wrong. It is also necessary. Practice has shown that artists at large either can not or will not use revision control systems. There are many studios where the revision control system for artists is a shared network drive, with file names like character_model_v3_final_realfinal_approved.mdl. It "works for them" and trying to mandate a more process heavy revision control system can easily lead to an open revolt.Converting these people means providing them with a better work flow. Something like this:They open their proprietary tool, be it Photoshop, Final Cut Pro or whatever.Click on GUI item to open a new resource.A window pops up where they can browse the files directly from the server as if they were local.They open a file.They edit it.They save it. Changes go directly in trunk.They close the file.There might be a review step as well, but it should be automatic. Merge requests should be filed and kept up to date without the need to create a branch or to even know that such a thing exists. Anything else will not work. Specifically doing any sort of conflict resolution does not work, even if it were the "right" thing to do. The only way around this (that we know of) is to provide file locking. Obviously this should only be limitable to binary files.Provide all functionality via a C APIThe above means that you need to be able to deeply integrate the revision control system with existing artist tools. This means plugins written in native code using a stable plain C API. The system can still be implemented in whatever SuperDuperLanguage you want, but its one true entry point must be a C API. It should be full-featured enough that the official command line client should be implementable using only functions in the public C API.Provide transparent Git supportEven if a project would want to move to something else, the sad truth is that for the time being the majority of contributors only know Git. They don't want to learn a whole new tool just to contribute to the project. Thus the server should serve its data in two different formats: once in its native format and once as a regular Git endpoint. Anyone with a Git client should be able to check out the code and not even know that the actual backend is not Git. They should be able to even submit merge requests, though they might need to jump through some minor hoops for that. This allows you to do incremental upgrades, which is the only feasible way to get changes like these done.Posted by Jussi at 12:57 PM Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest10 comments:UnknownDecember 28, 2020 at 4:43 PMThank you for this thoughtful assessment of the technical landscape. I think you have provided eloquent design points along this roadmap.ReplyThomas DADecember 28, 2020 at 10:07 PMThe example about a feature branch, isn't thst with force-with-lease is about?ReplyRepliesJussiDecember 29, 2020 at 12:23 AMPossibly. I tried to read that documentation page and I could not for the life of me understand what it is supposed to do. Which gives us yet another thing the replacement should provide:Have documentation that is actually readable and understandable by humans.ReplyRickyxDecember 29, 2020 at 2:05 AMAs an architect (not of software but of buildings) and artist, I totally agree.Currently I found no good solution to version binary cad, 3d files, images... currently used scheme in major studios?2020-11-09_my house.file2020-11-10_my house.file2020-11-22_my house Erik fix.file2020-12-02_my house changed roof.file...Replynot emptyDecember 29, 2020 at 6:48 PMAre you aware of Perforce/Helix? I don't want to advertise this (commercial) application, but some of the points you're making seem to match its features (server-based, file-locking). I haven't used it in production (especially not what seems to be a git-compatible interface) but have looked at it as a way to handle large binary files.Since a creative studio relies on a lot of different applications, using a plain file server still seems to be the most compatible and transparent way to handle files. You just have to make it easy for artists to use a versioning scheme so you don't end up with what you've described (my_file_v5_final_approved_...)ReplyRepliesJussiDecember 30, 2020 at 12:05 AMFrom what I remember Perforce is server-only. And terrible. Only used it for a while ages ago, though.ReplyUnknownDecember 29, 2020 at 10:12 PMI think the best solution would be to have git for mergable/diffable files and perforce for binary assets.It would also be good if artist/designers/architects etc. used mergable file formats for their work, but that is not really possible today.ReplyRepliesJussiDecember 30, 2020 at 12:06 AMThe most important thing a version control system gives you is atomic state. For that you need only one system, two separate ones don't really work.UnknownDecember 30, 2020 at 5:41 PMI think that atomic state is way less important for binary/unmergable assets (compared to code). So from a game dev perspective a dual system should work way better than a single one.ReplyApichatDecember 31, 2020 at 9:43 AMYou don't even mention the name of the project "basically Git, but with a mathematically proven model for X"I think you might refer to Pijul https://pijul.org/ :"Pijul is a free and open source (GPL2) distributed version control system. Its distinctive feature is to be based on a sound theory of patches, which makes it easy to learn and use, and really distributed."It gets recent news :https://pijul.org/posts/2020-11-07-towards-1.0/https://pijul.org/posts/2020-12-19-partials/andPijul - The Mathematically Sound Version Control System Written in Rusthttps://initialcommit.com/blog/pijul-version-control-systemQ&A with the Creator of the Pijul Version Control Systemhttps://initialcommit.com/blog/pijul-creatorReply
# Document TitleIntroductionThis article is a Q&A format with Pierre-Étienne Meunier, the creator and lead developer of the Pijul VCS (version control system).Q: What is your background in computer science and software development?A: I've been an academic researcher for about 10 years, and I've recently left academia to found a science company working on issues related to energy savings and decentralised energy production. I'm the only computer scientist, and we work with sociologists and energy engineers.While I was in academia, my main area of work was asynchronous and geometric computing, whose goal is to understand how systems with lots of simple components can interact in a geometric space to make computation happen. My current work is also somewhat related to this, although in a different way.My favourite simple components differ from man-made "silicon cores", in that they happen in nature or at least in the real world, and are not made for the specific purpose of computation. For many years, I've worked on getting molecules to self-organise in a test tube to form meaningful shapes. And I've also worked on Pijul, where different authors edit a document in a disorderly way, without necessarily agreeing on "good practices" beforehand.The idea of Pijul came while Florent Becker and myself were writing a paper on self-assembly. At some point we started thinking about the shortcomings of Darcs (Florent was one of the core contributors of Darcs at the time). We decided that we had to do something about it, to keep the "mathematically designed" family of version-controlled systems alive.Q: What is your approach to learning, furthering your knowledge and understanding on a topic? What resources do you use?A: The thing I love the most about computer science is that it can be used to think about a wide variety of subjects of human knowledge, and yet you can get a very concrete and very cheap experience of it by programming a computer. In all disciplines, the main way for virtually anyone to learn technical and scientific things is to play with them.For example you can read a book about economics, and then immediately write little simulations of games from game theories, or a basic simulator of macroeconomics. You can read the Wikipedia page about wave functions, and then start writing code to simulate a quantum computer. These simulations will not be efficient or realistic, but they will allow their authors to formalise their ideas in a concrete and precise way.By following this route, you not only get a door into the subject you wanted to understand initially, you also get access to philosophical questions that seemed very abstract before, since computer science is a gateway between the entire world of "pure reason" (logic and mathematics), as Kant would say, and the physical world. What can we know for a fact? Is there a reality beyond language? And suddenly you get a glimpse of what Hume, Kant, Wittgenstein… were after.Q: What was the first programming language you learned and how did you get into it?A: I think I started with C when I was around 12. My uncle had left France to take an administrative position high up at the Vatican, and left most of his things with his brothers and sisters. My mother got his computer, a Victor v286, which was already pretty old when we got it (especially in the days where Moore's law was at its peak). Almost nothing was supplied with it: MS-DOS, a text processor, and a C IDE. So if I wanted to use it, I had little choice. I don't remember playing much with the text processor.Q: Why did you decide to start a Version Control System? What is it about VCS that interests you?A: My first contact with version control systems was with SVN at university, and I remember being impressed when I first started using it for the toy project we were working on.Then, when I did my PhD, my friends and colleagues convinced me to switch to Darcs. As the aspirant mathematician I was, the idea of a rigorous patch theory, where detecting conflicts was done entirely by patch commutation, was appealing. As everybody else, I ran every now and then into Darcs' issues with conflicts, until Florent and I noticed that (1) it had become nearly impossible to convince our skeptical colleagues to install and use it and (2) this particular corner of version control systems was surprisingly close to the rest of our research interests, and that's how we got started.Q: How do you decide what projects to work on?A: This is one of my biggest problems in life. I'm generally interested in a large number of things, and I have little time to explore all of them in the depth they deserve. When I have to choose, I try to do what I think will teach me (and hopefully others) the most, or what will change things the most.Q: Why did you choose to write Pijul in Rust?A: We didn't really choose, Rust was the only language at the time that ticked all the boxes:Statically typed with automatic memory management, because we were writing mathematical code, and we were just two coders, so we needed as much help from compilers as we could get. That essentially meant one of OCaml, Haskell, Scala, Rust, Idris.Fast, because we knew we would be benchmarked against Git. That ruled out Scala (because of startup times) and Idris, which was experimental at the time. Also, Haskell can be super fast, but the performance is hard to guarantee deterministically.Could call C functions on Windows. That ruled out OCaml (this might have been fixed since then), and was a strong argument for Rust, since the absence of a GC makes this particularly easy.An additional argument for Rust is that compiling stuff is super easy. Sure, there are system dependencies, but most of the time, things "just work". It works so well that most things that people are encouraged to split their work into multiple crates rather than bundling it up into a large monolithic library. As a consequence, even very simple programs can easily have dozens of dependencies.Q: When you encounter a tough development problem, what is your process to solve it? Especially if the problem is theoretical in nature?A: It really depends on the problem, but I tend to start all my projects by playing with toy examples, until I understand something new. I take notes about the playing, and then try to harden the reasoning by making the intuitive parts rigorous. This is often where bugs (both in mathematical proofs and in software) are hidden. Often, this needs a new iteration of playing, and sometimes many more. I usually find that coding is a good way to play with a problem, even though it isn't always possible, especially in later iterations of the project.Of course, for things like Pijul, code is needed all the way to the end, but it is a different sort of code, not the "prototype" kind used to understand a problem.Apart from that, I find that taking a break to walk outside, without forcing myself to stay too focused on the problem, is very useful. I also do other, more intensive sports, but they rarely help solve my problems.Q: How big is the Pijul team? Who is it made up of?A: For quite a while there was just Florent and myself, then a few enthusiasts joined us to work on the previous version: we had about 10 contributors, including Thomas Letan who, in addition to his technical contributions, organised the community, welcomed people, etc. And Tae Sandoval, who wrote a popular tutorial (Pijul for Git users), and has been convincing people on social media at an impressive rate for a few years.Learn how Git's code works... 🚀👨💻📚Image of the cover of the Decoding Git Guidebook for DevelopersCheck out our Decoding Git Guidebook for DevelopersNow that we have a candidate version that seems applicable to real-life projects, the team has started growing again, and the early alpha releases of version 1.0.0, even though they still have a few bugs, are attracting new contributors and testers.Q: What are your goals for Pijul capabilities, growth, and adoption?A: When Pijul becomes stable, it will be usable for very large projects, and bring more sanity to source code management, by making things more deterministic.This has the potential to save a large number of engineering hours globally, and to use continuous integration tools more wisely: indeed, on large repositories, millions of CPU-hours are wasted each year just to check that Git and others didn't shuffle lines around during a merge.Smaller projects, beginners and non-technical people could also benefit from version control, but they aren't using it now, or at least not enough, because the barrier is just too high. Moreover, in some fields of industry and administration, people are doing version control manually, which seems like a giant waste of human time to me, but I also know that no current tool can do that job.About growth and adoption, I also want to mention that Pijul started as an open source project, and that we are strongly committed to keeping it open source. Since there is a large amount of tooling to be developed around it to encourage adoption (such as text editor plugins, CI/CD tooling…), we are currently trying to build a commercial side as well, in order to fund these developments. This could be in the form of support, hosting, or maybe specialised applications of Pijul, or all that at the same time.Q: Do you think Pijul (or any other VCS) could ever overtake Git?A: I think there could be a space for both. Git is meant as a content-addressable storage engine, and is extraordinarily efficient at that. One thing that is particularly cool with focusing on versions is that diffs can be computed after the fact, and I can see how this is desirable sometimes. For example, for cases where only trivial merges happen (for example adding a file), such as scientific data, this models the reality quite well, as there is no "intent" of an "author" to convey.In Pijul, we can have different diff algorithms, for instance taking a specific file format into account, but they have to be chosen at record time. The advantages are that the merge is unambiguous, in the sense that there is only one way to merge things, and that way satisfies a number of important properties not satisfied by Git. For example, the fact that merging two changes at once has the same effect as merging the first one, and then the other (as bizarre as it sounds, Git doesn't always do that).However, because Git does not operate on changes (or "diffs"), it fails to model how collaboration really works. For example, when merging changes made on different branches, Git insists on ordering them, which is not actually what happened: in reality, the authors of the two branches worked in parallel, and merged each other's changes in different orders. This sounds like a minor difference, but in real life, it forces people to reorder their history all the time, letting Git guess how to reshuffle their precious source code in ways that are not rigorous at all.Concretely, if Alice and Bob each produce a commit in parallel, then when they pull each other's change, they should get the exact same result, and see the same conflicts if there are conflicts. It shouldn't matter whether it is Alice or Bob who solves the conflicts, as long as the resolution works for both of them. If they later decide to push the result to another repository, there is no reason why the conflicts should reappear. They sometimes do in Git (which is the reason for the git rerere command), and the theory of Pijul guarantees that this is never the case.Q: If "distributed version control" is the 3rd generation of VCS tools, do you anticipate a 4th generation? If so, what might that look like? What might be the distinguishing feature/aspect of the next generation VCS?A: I believe "asynchronous" is the keyword here. Git (Mercurial, Fossil, etc.) are distributed in the sense that each instance is a server, but they are really just replicated instances of a central source of authority (usually hosted on GitHub or GitLab and named "master").In contrast to this, asynchronous systems can work independently from each other, with no central authority. These systems are typically harder to design, since there is a rather large number of cases, and even figuring out how to make a list of cases isn't obvious.Now of course, project leaders will always want to choose a particular version for release, but this should be dictated by human factors only, not by technical factors.Q: What is your personal favorite VCS feature?A: I believe commutativity is the thing every version control system is trying to simulate, with varying degrees of success. Merges and rebases in Git are trying to make things commute (well, when they work), and Darcs does it (except for conflicts). So this is really my favourite feature, and the fact that no one else was doing it rigorously got me into this field.Q: If you could snap your fingers and have any VCS feature magically appear in Pijul, what would it be?A: This is an excellent question. One thing we do not capture very well yet is keeping the identity of code blocks across refactorings. If you split or merge a file, or swap two functions in a file, then these changes will not commute with changes that happen in parallel.We are not alone: Darcs doesn't model that at all, and Git and Mercurial use their merge heuristics to try and solve it. But as I said before, these heuristics are not rigorous, and can sometimes reshuffle files in unexpected ways without telling the user, which has non-trivial security implications.I have ideas on how to do it in Pijul, but they are not fully formalised yet. I think the format is now extensible enough to support them. What was I saying about playing with toy examples and walking outside? ;-)ConclusionIn this article, we presented a Q&A session with the creator and lead developer of the Pijul project, Pierre-Étienne Meunier.If you're interested in learning more about how version control systems work under the hood, check out our Baby Git Guidebook for Developers, which dives into Git's code in an accessible way. We wrote it for curious developers to learn how version control systems work at the code level. To do this we documented the first version of Git's code and discuss it in detail.We hope you enjoyed this post! Feel free to shoot me an email at jacob@initialcommit.io with any questions or comments.
# Document TitleHAL Id: hal-02983557https://inria.hal.science/hal-02983557Submitted on 30 Oct 2020HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.Conflict-Free Replicated Relations forMulti-Synchronous Database Management at EdgeWeihai Yu, Claudia-Lavinia IgnatTo cite this version:Weihai Yu, Claudia-Lavinia Ignat. Conflict-Free Replicated Relations for Multi-Synchronous DatabaseManagement at Edge. IEEE International Conference on Smart Data Services, 2020 IEEE WorldCongress on Services, Oct 2020, Beijing, China. hal-02983557Conflict-Free Replicated Relations forMulti-Synchronous Database Management at EdgeWeihai YuUIT - The Arctic University of NorwayN-9037 Tromsø, Norwayweihai.yu@uit.noClaudia-Lavinia IgnatUniversit´e de Lorraine, CNRS, Inria, LORIAF-54000 Nancy, Franceclaudia.ignat@inria.frAbstract—In a cloud-edge environment, edge devices may notalways be connected to the network. Still, applications mayneed to access the data on edge devices even when they arenot connected. With support for multi-synchronous access, dataon an edge device are kept synchronous with the data in thecloud as long as the device is online. When the device is off-line, the application can still access the data on the device,asynchronously with concurrent data updates either in the cloudor on other edge devices. Conflict-free Replicated Data Types(CRDTs) emerged as a technology for multi-synchronous dataaccess. CRDTs guarantee that when all sites have applied thesame set of updates, the replicated data converge. However,CRDTs have not been successfully applied to relational databases(RDBs) for multi-synchronous access. In this paper, we presentConflict-free Replicated Relations (CRRs) that apply CRDTsto RDBs for support of multi-synchronous data access. WithCRR, existing RDB applications, with very little modification,can be enhanced with multi-synchronous access. We also presenta prototype implementation of CRR with some preliminaryperformance results.Index Terms—CRDT; relational database; eventual consis-tency; integrity constraintsI. INTRODUCTIONThe cloud technology makes data globally shareable andaccessible, as long as the end-user’s device is connected tothe network. When the user’s device is not connected, thedata become inaccessible.There are scenarios where the users may want to accesstheir data even when their devices are off-line. The researchersmay want to access their research data when they are at fieldwork without any network facility. The project managers andengineers may want to access project data when they are underextreme conditions such as inside a tunnel or in the ocean.People may want to use their personal applications duringflight, in a foreign country or during mountain hiking.In a cloud-edge environment, the cloud consists of serverswith powerful computation and reliable data storage capacities.The edge consists of the devices outside the operation domainof the cloud. The above-mentioned scenarios suggest multi-synchronous access of data on edge devices. That is, there aretwo modes of accessing data on edge devices: asynchronousmode—the user can always access (read and write) the data onthe device, even when the device is off-line, and synchronousmode—the data on the device is kept synchronous with thedata stored in the cloud, as long as the device is online.One of the main challenges of multi-synchronous data ac-cess is the limitation of a networked system stated in the CAPtheorem [1,2]: it is impossible to simultaneously ensure allthree desirable properties, namely (C) consistency equivalentto a single up-to-date copy of data, (A) availability of the datafor update and (P) tolerance to network partition.CRDTs, or Conflict-free Replicated Data Types, emerged toaddress the CAP challenges [3]. With CRDT, a site updates itslocal replica without coordination with other sites. The statesof replicas converge when they have applied the same set ofupdates (referred to as strong eventual consistency in [3]).CRDTs have been adopted in the construction of distributedkey-value stores [4], collaborative editors [5]–[8] and local-first software [9,10]. All these bear the similar goal as multi-synchronous data access. There has also been active researchon CRDT-based transaction processing [11] mainly to achievelow latency for geo-replicated data stores.Despite the above success stories, CRDTs have not beenapplied to multi-synchronous access of data stored in rela-tional databases (RDBs). To support multi-synchronous dataaccess, the application typically has to be implemented usingspecific key-value stores built with CRDTs. These key-valuestores do not support some important RDB features, includingadvanced queries and integrity constraints. Consequently, thelarge number of existing applications that use RDBs have tobe re-implemented for multi-synchronous access.In this paper, we propose Conflict-Free Replicated Relation(CRR) for multi-synchronous access to RDB data. Our con-tributions include:• A set CRDT for RDBs. One of the hurdles for adoptingCRDTs to RDBs is that there is no appropriate set CRDTfor RDBs (Section III-A). We present CLSet CRDT(causal-length set) that is particularly suitable for RDBs(Section III-C).• An augmentation of existing RDB schema with CRR ina two-layer system (Section IV). With CRR, we are ableto enhance existing RDB applications with very littlemodification.• Automatic handling of integrity violations at merge (Sec-tion V).• A prototype implementation (Section VI). The implemen-tation is independent of the underlying database manage-ment system (DBMS). This allows different devices of˜푟푖푟푖q(푟푖 )u(푟푖 )queryrefreshupdate˜u( ˜푟푖 )˜푟 푗anti-entropymerge푟 푗Site 푆푖 Site 푆 푗ARlayerCRRlayerFig. 1. A two-layer relational database systemthe same application to deploy different DBMSs. Thisis particularly useful in a cloud-edge environment thatis inherently heterogeneous. We also report some initialperformance result of the prototype.The paper is organized as follows. Section II gives anoverview of CRR, together with the system model our workapplies to. Section III reviews the background of CRDT andpresents the CLSet CRDT that is an underlying CRDT ofCRR. Section IV describes CRR in detail. Section VI presentsan implementation of CRR and some preliminary performanceresults. Section VII discusses related work. Section VIII con-cludes.II. OVERVIEW OF CRRThe RDB supporting CRR consists of two layers: an Ap-plication Relation (AR) layer and a Conflict-free ReplicatedRelation (CRR) layer (see Figure 1). The AR layer presents thesame RDB schema and API as a conventional RDB system.Application programs interact with the database at the ARlayer. The CRR layer supports conflict-free replication ofrelations.The AR-layer database schema 푅 has an augmented CRRschema ˜푅. A site 푆푖 maintains both an instance 푟푖 of 푅 and aninstance ˜푟푖 of ˜푅. A query q on 푟푖 is performed directly on 푟푖 .A request for update u on 푟푖 is translated into ˜u and performedon the augmented relation instance ˜푟푖 . The update is laterpropagated to remote sites through an anti-entropy protocol.Every update in ˜푟푖 , either as the execution of a local update˜u( ˜푟푖 ) or as the merge with a remote update ˜u( ˜푟 푗 ), refreshes 푟푖 .CRR has the property that when both sites 푆푖 and 푆 푗 haveapplied the same set of updates, the relation instances at thetwo sites are equivalent, i.e. 푟푖 = 푟 푗 and ˜푟푖 = ˜푟 푗 .The two-layered system also maintains the integrity con-straints defined at the AR layer. Any violation of integrityconstraint is caught at the AR layer. A local update of ˜푟푖 andrefresh of 푟푖 are wrapped in an atomic transaction: a violationwould cause the rollback of the transaction. A merge at ˜푟푖 anda refresh at 푟푖 is also wrapped in a transaction: a failed mergewould cause some compensation updates.An application developer does not need to care about howthe CRR layer works. Little change is needed for an existingRDB application to function with CRR support.We use different CRDTs for CRRs. Since a relation instanceis a set of tuples or rows, we use a set CRDT for relationinstances. We design a new set CRDT (Section III), as noneof existing set CRDTs are suitable for CRRs (Section III-A).A row consists of a number of attributes. Each of the attributescan be individually updated. We use the LWW (last-writewins) register CRDT [12,13] for attributes in general cases.Since LWW registers may lose some concurrent updates, weuse the counter CRDT [3] for numeric attributes with additiveupdates.System ModelA distributed system consists of sites with globally uniqueidentifiers. Sites do not share memory. They maintain durablestates. Sites may crash, but will eventually recover to thedurable state at the time of the last crash.A site can send messages to any other site in the systemthrough an asynchronous and unreliable network. There is noupper bound on message delay. The network may discard,reorder or duplicate messages, but it cannot corrupt messages.Through re-sending, messages will eventually be delivered.The implication is that there can be network partitions, butdisconnected sites will eventually get connected.III. A SET CRDT FOR CRRSIn this section, we first present a background of CRDTs. Asa relation instance is a set of tuples, CRR needs an appropriateset CRDT as a building block. We present limitations ofexisting set CRDTs for RDBs. We then describe a new setCRDT that addresses the limitations.A. CRDT BackgroundA CRDT is a data abstraction specifically designed for datareplicated at different sites. A site queries and updates itslocal replica without coordination with other sites. The data isalways available for update, but the data states at differentsites may diverge. From time to time, the sites send theirupdates asynchronously to other sites with an anti-entropyprotocol. To apply the updates made at the other sites, a sitemerges the received updates with its local replica. A CRDThas the property that when all sites have applied the same setof updates, the replicas converge.There are two families of CRDT approaches, namelyoperation-based and state-based [3]. Our work is based onstate-based CRDTs, where a message for updates consists ofthe data state of a replica in its entirety. A site applies theupdates by merging its local state with the state in the receivedmessage. The possible states of a state-based CRDT must forma join-semilattice [14], which implies convergence. Briefly, thestates form a join-semilattice if they are partially ordered withv and a join t of any two states (that gives the least upperbound of the two states) always exists. State updates must beinflationary. That is, the new state supersedes the old one inv. The merge of two states is the result of a join.Figure 2 (left) shows GSet, a state-based CRDT for grow-only sets (or add-only set [3]), where 퐸 is a set of possible el-ements, v def= ⊆, t def= ∪, insert is a mutator (update operation)and in is a query. Obviously, an update through insert(푠, 푒)is an inflation, because 푠 ⊆ {푒} ∪ 푠. Figure 2 (right) showsGSet(퐸) def= P (퐸)insert(푠, 푒) def= {푒} ∪ 푠insert훿 (푠, 푒) def={{푒} if 푒 ∉ 푠{} otherwise푠 t 푠′ def= 푠 ∪ 푠′in(푠, 푒) def= 푒 ∈ 푠{푎, 푏, 푐}{푎, 푏} {푎, 푐} {푏, 푐}{푎} {푏} {푐}{}Fig. 2. GSet CRDT and Hasse diagram of statesthe Hasse diagram of the states in a GSet. A Hasse diagramshows only the “direct links” between states.GSet is an example of an anonymous CRDT, since its oper-ations are not specific to the sites that perform the operations.Two concurrent executions of the same mutation, such asinsert({}, 푎), fulfill the same purpose.Using state-based CRDTs, as originally presented [3], iscostly in practice, because states in their entirety are sent asmessages. Delta-state CRDTs address this issue by only send-ing join-irreducible states [15,16]. Basically, join-irreduciblestates are elementary states: every state in the join-semilatticecan be represented as a join of some join-irreducible state(s).In Figure 2, insert훿 is a delta-mutator that returns join-irreducible states which are singleton sets (boxed in the Hassediagram).Since a relation instance is a set of tuples, the basic buildingblock of CRR is a set CRDT, or specifically, a delta-state setCRDT.A CRDT for general-purpose sets with both insertion anddeletion updates can be designed as a causal CRDT [15] suchas ORSet (observed-remove set [12,17,18]). Basically, everyelement is associated with at least a causal context as metadata. A causal context is a set of event identifiers (typically apair of a site identifier and a site-specific sequence number).An insertion or deletion is achieved with inflation of theassociated causal context. Using causal contexts, we are ableto tell explicitly which insertions of an element have beenlater deleted. However, maintaining causal contexts for everyelement can be costly, even though it is possible to compresscausal contexts into vector states, for instance under causalconsistency. Causal CRDTs are not suitable for RDBs, whichgenerally do not allow multi-valued attributes such as causalcontexts.In the following, we present a new set CRDT, based on theabstraction of causal length. It is a specialization of our earlierwork on generic undo support for state-based CRDTs [19].B. Causal lengthThe key issue that a general-purpose set CRDT must addressis how to identify the causality between the different insertionand deletion updates. We achieve this with the abstraction ofcausal length, which is based on two observations.First, the insertions and deletions of a given element occurin turns, one causally dependent on the other. A deletion is anSite 퐴 Site 퐵 Site 퐶{} 푠0퐴 {} 푠0퐵 {} 푠0퐶푎1퐴 : insert(푠0퐴, 푎){푎} 푠1퐴푎1퐵 : insert(푠0퐵, 푎){푎} 푠1퐵{푎} 푠2퐴{푎} 푠1퐶푎2퐴 : delete(푠2퐴, 푎){} 푠3퐴푎2퐵 : delete(푠1퐵, 푎){} 푠2퐵{} 푠3퐵{} 푠4퐵푎2퐶 : delete(푠1퐶 , 푎){} 푠2퐶푎3퐵 : insert(푠4퐵, 푎){푎} 푠5퐵{푎} 푠6퐵{} 푠3퐶{푎} 푠4퐶푎4퐶 : delete(푠4퐶 , 푎){} 푠5퐶Fig. 3. A scenario of concurrent set updatesinverse of the last insertion it sees. Similarly, an insertion isan inversion of the last deletion it sees (or none, if the elementhas never been inserted).Second, two concurrent executions of the same mutation ofan anonymous CRDT (Section III-A) fulfill the same purposeand therefore are regarded as the same update. Seeing onemeans seeing both (such as the concurrent insertions of thesame element in GSet). Two concurrent inverses of the sameupdate are also regarded as the same one.Figure 3 shows a scenario where three sites 퐴, 퐵 and 퐶concurrently insert and delete element 푎. When sites 퐴 and퐵 concurrently insert 푎 for the first time, with updates 푎1퐴and 푎1퐵, they achieve the same effect. Seeing either one of theupdates is the same as seeing both. Consequently, states 푠1퐴,푠2퐴, 푠1퐵 and 푠1퐶 are equivalent as far as the insertion of 푎 isconcerned.Following the same logic, the concurrent deletions on theseequivalent states (with respect to the first insertion of 푎) arealso regarded as achieving the same effect. Seeing one of themis the same as seeing all. Therefore, states 푠3퐴, 푠2퐵, 푠3퐵, 푠4퐵, 푠2퐶and 푠3퐶 are equivalent with regard to the deletion of 푎.Now we present the states of element 푎 as the equivalenceclasses of the updates, as shown in the second column ofTable I. The concurrent updates that see equivalent states andachieve the same effect are in the same equivalence classes.The columns 푟 and ˜푟 in Table I correspond to the states of atuple (as an element of a set) in relation instances in the ARand CRR layers (Section II).When a site makes a new local update, it adds to its state anew equivalence class that contains only the new update. Forexample, when site 퐵 inserts element 푎 at state 푠0퐵 with update푎1퐵, it adds a new equivalence class {푎1퐵 } and the new state 푠1퐵becomes {{푎1퐵 }}. When site 퐵 then deletes the element withupdate 푎2퐵, it adds a new equivalence class {푎2퐵 } and the newTABLE ISTATES OF A SET ELEMENT푠 states in terms of equivalence groups ˜푟 푟푠0퐴 { } { } { }푠1퐴 { {푎1퐴 } } { 〈푎, 1〉 } {푎 }푠2퐴 { {푎1퐴, 푎1퐵 } } { 〈푎, 1〉 } {푎 }푠3퐴 { {푎1퐴, 푎1퐵 }, {푎2퐴 } } { 〈푎, 2〉 } { }푠0퐵 { } { } { }푠1퐵 { {푎1퐵 } } { 〈푎, 1〉 } {푎 }푠2퐵 { {푎1퐵 }, {푎2퐵 } } { 〈푎, 2〉 } { }푠3퐵 { {푎1퐴, 푎1퐵 }, {푎2퐵 } } { 〈푎, 2〉 } { }푠4퐵 { {푎1퐴, 푎1퐵 }, {푎2퐴, 푎2퐵 } } { 〈푎, 2〉 } { }푠5퐵 { {푎1퐴, 푎1퐵 }, {푎2퐴, 푎2퐵 }, {푎3퐵 } } { 〈푎, 3〉 } {푎 }푠6퐵 { {푎1퐴, 푎1퐵 }, {푎2퐴, 푎2퐵 , 푎2퐶 }, {푎3퐵 } } { 〈푎, 3〉 } {푎 }푠0퐶 { } { } { }푠1퐶 { {푎1퐵 } } { 〈푎, 1〉 } {푎 }푠2퐶 { {푎1퐵 }, {푎2퐶 } } { 〈푎, 2〉 } { }푠3퐶 { {푎1퐵 }, {푎2퐵 , 푎2퐶 } } { 〈푎, 2〉 } { }푠4퐶 { {푎1퐴, 푎1퐵 }, {푎2퐴, 푎2퐵 , 푎2퐶 }, {푎3퐵 } } { 〈푎, 3〉 } {푎 }푠5퐶 { {푎1퐴, 푎1퐵 }, {푎2퐴, 푎2퐵 , 푎2퐶 }, {푎3퐵 }, {푎4퐶 } } { 〈푎, 4〉 } { }state 푠2퐵 becomes {{푎1퐵 }, {푎2퐵 }}.The merge of two states is handled as the union of theequivalence classes. For example, when states 푠1퐴 and 푠2퐵merge, the new state 푠3퐵 becomes {{푎1퐴} ∪ {푎1퐵 }, {푎2퐵 }} ={{푎1퐴, 푎1퐵 }, {푎2퐵 }}.Now, observe that whether element 푎 is in the set isdetermined by the number of equivalence classes, rather thanthe specific updates contained in the equivalence classes. Forexample, as there is one equivalence class in state 푠1퐵, theelement 푎 is in the set. As there are two equivalence classesin states 푠2퐵, 푠3퐵 and 푠4퐵, the element 푎 is not in the set.Due to the last observation, we can represent the states ofan element with a single number, the number of equivalenceclasses. We call that number the causal length of the element.An element is in the set when its causal length is an oddnumber. The element is not in the set when its causal lengthis an even number. As shown in Table I, a CRR-layer relation˜푟 augments an AR-later relation 푟 with causal lengths.C. CLSet CRDTFigure 4 shows the CLSet CRDT. The states are a partialfunction 푠 : 퐸 ↩→ N, meaning that when 푒 is not in the domainof 푠, 푠(푒) = 0 (0 is the bottom element of N, i.e. ⊥N = 0).Using partial function conveniently simplifies the specificationof insert, t and in. Without explicit initialization, the causallength of any unknown element is 0. In the figure, insert훿and delete훿 are the delta-counterparts of insert and deleterespectively.An element 푒 is in the set when its causal length is an oddnumber. A local insertion has effect only when the element isnot in the set. Similarly, a local deletion has effect only whenthe element is actually in the set. A local insertion or deletionCLSet(퐸) def= 퐸 ↩→ Ninsert(푠, 푒) def={푠{푒 7 → 푠(푒) + 1} if even(푠(푒))푠 if odd(푠(푒))insert훿 (푠, 푒) def={{푒 7 → 푠(푒) + 1} if even(푠(푒)){} if odd(푠(푒))delete(푠, 푒) def={푠 if even(푠(푒))푠{푒 7 → 푠(푒) + 1} if odd(푠(푒))delete훿 (푠, 푒) def={{} if even(푠(푒)){푒 7 → 푠(푒) + 1} if odd(푠(푒))(푠 t 푠′)(푒) def= max(푠(푒), 푠′ (푒))in(푠, 푒) def= odd(푠(푒))Fig. 4. CLSet CRDTsimply increments the causal length of the element by one.For every element 푒 in 푠 and/or 푠′, the new causal length of 푒after merging 푠 and 푠′ is the maximum of the causal lengthsof 푒 in 푠 and 푠′.IV. CONFLICT-FREE REPLICATED RELATIONSIn this section we describe the design of the CRR layer. Wefocus particularly on the handling of updates.Without loss of generality, we represent the schema of anapplication-layer relation with 푅(퐾, 퐴), where 푅 is the relationname, 퐾 is the primary key and 퐴 is a non-key attribute. For arelation instance 푟 of 푅(퐾, 퐴), we use 퐴(푟) for the projection휋퐴푟. We also use 푟 (푘) for the row identified with key 푘, and(푘, 푎) for row 푟 (푘) whose 퐴 attribute has value 푎.A. Augmentation and cachingIn a two-layer system as highlighted in Section II, weaugment an AR-layer relation schema 푅(퐾, 퐴) to a CRR-layerschema ˜푅(퐾, 퐴, 푇퐴, 퐿) where 푇퐴 is the update timestamps ofattribute 퐴 and 퐿 is the causal lengths of rows. Basically,˜푅(퐾, 퐿) implements a CLSet CRDT (Section III-C) wherethe rows identified by keys are elements, and ˜푅(퐾, 퐴, 푇퐴) im-plements the LWW-register CRDT [12,13] where an attributeof each row is a register.We use hybrid logical clock [20] for timestamps, which isUTC time compatible and for two events 푒1 and 푒2 with clocks휏1 and 휏2, 휏1 < 휏2 if 푒1 happens before 푒2.For a relation instance ˜푟 of an augmented schema ˜푅, therelation instance 푟 of the AR-layer schema 푅 is a cache of ˜푟.For a relation operation op on 푟, we use ˜op as the cor-responding operation on ˜푟. For example, we use ˜∈ on ˜푟 todetect whether a row exists in 푟. That is ˜푟 (푘) ˜∈ ˜푟 ⇔ 푟 (푘) ∈ 푟.According to CLSet (Figure 4), we define ˜∈ and ˜∉ as˜푟 (푘) ˜∈ ˜푟 def= ˜푟 (푘) ∈ ˜푟 ∧ odd(퐿( ˜푟 (푘)))˜푟 (푘) ˜∉ ˜푟 def= ˜푟 (푘) ∉ ˜푟 ∨ even(퐿( ˜푟 (푘)))For an update operation u(푟, 푘, 푎) on 푟, we first perform˜u( ˜푟, 푘, 푎, 휏퐴, 푙) on ˜푟. This results in the new instance ˜푟 ′ androw ˜푟 ′ (푘) = (푘, 푎, 휏′퐴, 푙′). We then refresh the cache 푟 as thefollowing:• Insert (푘, 푎) into 푟 if ˜푟 ′ (푘) ˜∈ ˜푟 ′ and 푟 (푘) ∉ 푟.• Delete 푟 (푘) from 푟 if ˜푟 ′ (푘) ˜∉ ˜푟 ′ and 푟 (푘) ∈ 푟.• Update 푟 (푘) with (푘, 푎) if ˜푟 ′ (푘) ˜∈ ˜푟 ′, 푟 (푘) ∈ 푟 and퐴(푟 (푘)) ≠ 푎.All AR-layer queries are performed on the cached instance푟 without any involvement of the CRR layer.The following subsections describe how the CRR layerhandles the different update operations.B. Update operationsThe CRR layer handles a local row insertion as below:ûinsert( ˜푟, (푘, 푎))def=insert( ˜푟, (푘, 푎, now, 1)) if ˜푟 (푘) ∉ ˜푟update( ˜푟, 푘, (푎, now, 퐿( ˜푟 (푘)) + 1)) if ˜푟 (푘) ∈ ˜푟∧ ˜푟 (푘) ˜∉ ˜푟skip otherwiseA row insertion attempts to achieve two effects: to insert row푟 (푘) into 푟 and to assign value 푎 to attribute 퐴 of 푟 (푎). If 푟 (푘)has never been inserted, we simply insert (푘, 푎, now, 1) into˜푟. If 푟 (푘) has been inserted but later deleted, we re-inserted itwith the causal length incremented by 1. Otherwise, there isalready a row 푟 (푘) in 푟, thus we do nothing.There could be an alternative handling in the last case.Instead of doing nothing, we could update the value of attribute퐴 with 푎. We choose to do nothing, because this behavior isin line with the SQL convention.The CRR layer handles a local row deletion as below:ûdelete( ˜푟, 푘)def={update( ˜푟, 푘, (−, 퐿( ˜푟 (푘)) + 1)) if ˜푟 (푘) ˜∈ ˜푟skip otherwiseIf there is a row 푟 (푘) in 푟, we increment the causal lengthby 1. We do nothing otherwise. In the expression, we use the“−” sign for the irrelevant attributes.The CRR layer handles a local attribute update as below:üupdate( ˜푟, 푘, 푎)def={update( ˜푟, 푘, (푎, now, 퐿( ˜푟 (푘)))) if ˜푟 (푘) ˜∈ ˜푟skip otherwiseIf there is a row 푟 (푘) in 푟, we update the attribute and seta new timestamp. We do nothing otherwise.When the CRR layer handles one of the above updateoperations and results in a new row ˜푟 (푘) in ˜푟 (either insertedor updated), the relation instance with a single row { ˜푟 (푘)} isa join-irreducible state in the possible instances of schema ˜푅.The CRR layer later sends { ˜푟 (푘)} to the remote sites in theanti-entropy protocol.A site merges a received join-irreducible state {(푘, 푎, 휏, 푙)}with its current state ˜푟 using a join:˜푟 t {(푘, 푎, 휏, 푙)}def=insert( ˜푟, (푘, 푎, 휏, 푙)) if ˜푟 (푘) ∉ ˜푟update( ˜푟, 푘, (푎, max(푇 ( ˜푟), 휏), if 퐿( ˜푟) < 푙max(퐿( ˜푟), 푙))) ∨ 푇 ( ˜푟) < 휏skip otherwiseIf there is no ˜푟 (푘) in ˜푟, we insert the received row into ˜푟. Ifthe received row has either a longer causal length or a newertimestamp, we update ˜푟 with the received row. Otherwise, wekeep ˜푟 unchanged. Notice that a newer update on a row that isconcurrently deleted is still updated. The update is thereforenot lost. The row will have the latest updated attribute valuewhen the row deletion is later undone.It is easy to verify that a new relation instance, resultedeither from one of the local updates or from a merge witha received state, is always an inflation of the current relationinstance regarding causal lengths and timestamps.C. CountersThe general handling of attribute updates described earlierin Section IV-B is based on the LWW-register CRDT. This isnot ideal, because the effect of some concurrent updates mayget lost. For some numeric attributes, we can use the counterCRDT [3] to handle updates as increments and decrements.We make a special (meta) relation Ct for counter CRDTsat the CRR layer: Ct(Rel, Attr, Key, Sid, Icr, Dcr), where(Rel, Attr, Key, Sid) is a candidate key of Ct.The attribute Rel is for relation names, Attr for names ofnumeric attributes and Key for primary key values. For an AR-layer relation 푅(퐾, 퐴), and respectively the CRR-layer relation˜푅(퐾, 퐴, 푇퐴, 퐿), where 퐾 is a primary key and 퐴 is a numericattribute, (푅, 퐴, 푘) of 퐶푡 refers to the numeric value 퐴(푟 (푘))(and 퐴( ˜푟 (푘))).The attribute Sid is for site identifiers. Our system modelrequires that sites are uniquely identified (Section II). Thecounter CRDT is named (as opposed to anonymous, describedin Section III-A), meaning that a site can only update itsspecific part of the data structure.The attributes Icr and Dcr are for the increment anddecrement of numeric attributes from their initial values, setby given sites.We set the initial value of a numeric attribute to: the defaultvalue, if the attribute is defined with a default value; the lowerbound, if there is an integrity constraint that specifies a lowerbound of the attribute; 0 otherwise.If the initial value of the numeric attribute 퐴 of an AR-layerrelation 푅(퐾, 퐴) is 푣0, we can calculate the current value of퐴(푟 (푘)) as 푣0 + sum(Icr) − sum(Dcr), as an aggregation onthe rows identified by (푅, 퐴, 푘) in the current instance of Ct.We can translate a local update of a numeric attribute intoan increment or decrement operation. If the current value ofthe attribute is 푣 and the new updated value is 푣′, the updateis an increment of 푣′ − 푣 if 푣′ > 푣, or a decrement of 푣 − 푣′ if푣′ < 푣.Site 푠 with the current 퐶푡 instance 푐푡 handles a localincrement or decrement operation as below:˜inc푠 ( ˜푟, 퐴, 푘, 푣) def={update(ct, 푅, 퐴, 푘, 푠, (푖 + 푣, 푑)) if ct(푅, 퐴, 푘, 푠, 푖, 푑) ∈ ctinsert(ct, (푅, 퐴, 푘, 푠, (푣, 0))) otherwise˜dec푠 ( ˜푟, 퐴, 푘, 푣) def={update(ct, 푅, 퐴, 푘, 푠, (푖, 푑 + 푣)) if ct(푅, 퐴, 푘, 푠, 푖, 푑) ∈ ctinsert(ct, (푅, 퐴, 푘, 푠, (0, 푣))) otherwiseIf it is the first time the site updates the attribute, we inserta new row into Ct. Otherwise, we set the new increment ordecrement value accordingly. Note that 푣 > 0 and the updatesin Ct are always inflationary. The relation consisting of a singlerow of the update is a join-irreducible state in the possibleinstance of Ct and is later sent to remote sites.A site with the current Ct instance ct merges a receivedjoin-irreducible state with a join:ct t {(푅, 퐶, 푘, 푠, 푖, 푑)}def=insert(ct, (푅, 퐶, 푘, 푠, 푖, 푑)) if ct(푅, 퐶, 푘, 푠) ∉ ctupdate(ct, 푅, 퐶, 푘, 푠, if (푅, 퐶, 푘, 푠, 푖′, 푑′) ∈ ct(max(푖, 푖′), max(푑, 푑′))) ∧ (푖′ < 푖 ∨ 푑′ < 푑)skip otherwiseIf the site has not seen any update from site 푠, we insertthe received row into ct. If it has applied some update from푠 and the received update makes an inflation, we update thecorresponding row for site 푠. Otherwise, it has already applieda later update from 푠 and we keep ct unchanged.Notice that turning an update into an increment or decre-ment might not always be an appropriate way to handle anupdate. For example, two sites may concurrently update thetemperature measurement from 11◦퐶 to 15◦퐶. Increasing thevalue twice leads to a wrong value 19◦퐶. In such situations,it is more appropriate to handle updates as a LWW-register.V. INTEGRITY CONSTRAINTSApplications define integrity constraints at AR layer. ADBMS detects violations of integrity constraints when werefresh the cached AR-layer relations.We perform updates on a CRR-layer instance ˜푟 and thecorresponding refresh of an AR-layer instance 푟 in a singleatomic transaction. A violation of any integrity constraintcauses the entire transaction to be rolled back. Therefore alocal update that violates any integrity constraint has no effecton either ˜푟 or 푟.When a site detects a violation at merge, it may perform anundo on an offending update. We first describe how to performan undo. Then, we present certain rules to decide whichupdates to undo. The purpose of making such decisions is toavoid undesirable effects. In more general cases not describedbelow, we simply undo the incoming updates that causeconstraint violations, although this may result in unnecessaryundo of too many updates.A. Undoing an updateFor delta-state CRDTs, a site sends join-irreducible statesas remote updates. In our case, a remote update is a singlerow (or more strictly, a relation instance containing the row).To undo a row insertion or deletion, we simply incrementthe causal length by one.To undo an attribute update augmented with the LWW-register CRDT, we set the attribute with the old value, usingthe current clock value as the timestamp. In order to be able togenerate the undo update, the message of an attribute updatealso includes the old value of the attribute.We handle undo of counter updates with more care, becausecounter updates are not idempotent and multiple sites may per-form the same inverse operation concurrently. As a result, thesame increment (or decrement) might be mistakenly reversedmultiple times.To address this problem, we create a new (meta) relationfor counter undo: CtU(Rel, Attr, Key, Sid, 푇).Similar to relation Ct (Section IV-C), attributes Rel and Attrare for the names of the relations and the counter attributes,and attributes Key and Sid are for key values and sites thatoriginated the update. 푇 is the timestamp of the originalcounter update. A row (푅, 퐴, 푘, 푠, 휏) in CtU uniquely identifiesa counter update.A message for a counter update contains the timestamp andthe delta of the update, in addition to the join-irreducible stateof the update. The timestamp helps us uniquely identify theupdate. The delta helps us generate the inverse update.Relation CtU works as a GSet (Figure 2). When a siteundoes a counter update, it inserts a row identifying the updateinto CtU. When a site receives an undo message, it performsthe undo only when the row of the update is not in CtU.B. UniquenessAn application may set a uniqueness constraint on anattribute (or a set of attributes). For instance, the email addressof a registered user must be unique. Two concurrent updatesmay violate a uniqueness constraint, though none of themviolates the constraint locally.When we detect the violation of a uniqueness constraint atthe time of merge, we decide a winning update and undo thelosing one. Following the common sequential cases, an earlierupdate wins. That is, the update with a smaller timestampwins.As violations of uniqueness constraints occur only forrow insertions and attribute updates (as LWW-registers), theonly possible undo operations are row deletions and attributeupdates (as LWW-registers).C. Reference integrityA reference-integrity constraint may be violated due toconcurrent updates of a reference and the referenced row.Suppose relation 푅1 (퐾1, 퐹) has a foreign key 퐹 referencingto 푅2 (퐾2, 퐴). Assume two sites 푆1 and 푆2 have rows 푟21 (푘2)and 푟22 (푘2) respectively. Site 푆1 inserts 푟11 (푘1, 푘2) and site 푆2concurrently deletes 푟22 (푘2). The sites will fail to merge theconcurrent remote updates, because they violate the reference-integrity constraint.Obviously, both sites undoing the incoming remote updateis undesirable. We choose to undo the updates that addreferences. The updates are likely row insertions. One reasonof this choice is that handling uniqueness violations often leadsto row deletions (Section V-B). If a row deletion as undo fora uniqueness violation conflicts with an update that adds areference, undoing the row deletion will violate the uniquenessconstraint again, resulting in an endless cycle of undo updates.Notice that two seemingly concurrent updates may notbe truly concurrent. One site might have already indirectlyseen the effect of the remote update, reflected as a longercausal length. Two updates are truly concurrent only whenthe referenced rows at the two sites have the same causallength at the time of the concurrent updates. When the twoupdates are not truly concurrent, a merge of an incomingupdate at one of the sites would have no effect. In the aboveexample, the merge of the deletion at site 푆2 has no effect if퐿( ˜푟21 (푘2)) > 퐿( ˜푟22 (푘2)).D. Numeric constraintsAn application may set numeric constraints, such as a lowerbound of the balance of a bank account. A set of concurrentupdates may together violate a numeric constraint, though theindividual local updates do not.When the merge of an incoming update fails due to aviolation of a numeric constraint, we undo the latest update(or updates) of the numeric attribute. This, however, may undotoo many updates than necessary. We have not yet investigatedhow to avoid undoing too many updates that violate a numericconstraint.VI. IMPLEMENTATIONWe have implemented a CRR prototype, Crecto, on top ofEcto1, a data mapping library for Elixir2.For every application-defined database table, we generate aCRR-augmented table. We also generate two additional tablesin Crecto, Ct for counters (Section IV-C) and CtU for counterundo (Section V-A).For every table, Ecto automatically generates a primary-keyattribute which defaults to an auto-increment integer. Instead,we enforce the primary keys to be UUIDs to avoid collisionof key values generated at different sites.In the CRR-augmented tables, every attribute has an associ-ated timestamp for the last update. We implemented the hybridlogic clock [20] as timestamps.1https://github/elixir-ecto/ecto2https://elixir-lang.orgIn Ecto, applications interact with databases through a singleRepo module. This allows us to localize our implementationefforts to a Crecto.Repo module on top of the Ecto.Repomodule.Since queries do not involve any CRR-layer relations,Crecto.Repo forwards all queries unchanged to Ecto.Repo.A local update includes first update(s) of CRR-layer re-lation(s) and then a refresh of the cached relation at ARlayer. All these are wrapped in a single atomic transaction.Any constraint violation is caught at cache refreshment, whichcauses the transaction to be rolled back.A site keeps outgoing messages in a queue. When the siteis online, it sends the messages in the queue to remote sites.The queue is stored in a file so that it survives system crashes.A merge of an incoming update includes also first update(s)of CRR-layer relation(s) and then a refresh of the cachedrelation at AR layer. These are also wrapped in an atomictransaction. Again, any constraint violation is caught at cacherefreshment, which causes the transaction to be rolled back.When this happens, we undo an offending update (Section V).If we undo the incoming update, we generate an undo update,insert an entry in CtU if the update is a counter incrementor decrement, and send the generated undo update to remotesites. If we undo an update that has already been performedlocally, we re-merge the incoming update after the undo of thealready performed update.For an incoming remote update, we can generate the undoupdate using the additional information attached in the mes-sage (Section V-A). For an already performed row insertion ordeletion, we can generate the undo update with an incrementalof the causal length. For an already performed attribute update,we can generate the undo update in one of two ways. We canuse the messages stored in the queue. Or we can simply wait.An undo update will eventually arrive, because this update willviolate the same constraint at a remote site.PerformanceTo study the performance of CRR, we implemented anadvertisement-counter application [18,21] in Crecto. Adver-tisements are displayed on edge devices (such as mobilephones) according to some vendor contract, even when thedevices are not online. The local counters of displayed adver-tisements are merged upstream in the cloud when the devicesare online. An advertisement is disabled when it has reacheda certain level of impression (total displayed number).We are mainly interested in the latency of database updates,which is primarily dependent on disk IOs. Table II showsthe latency measured at a Lenovo ThinkPad T540p laptop,configured with an Intel quad-core i7-4710MQ CPU, 16 GiBRam, 512 GiB SSD, running Linux 5.4. The Crecto prototypeis implemented on Ecto 3.2 in Elixir 1.9.4 OTP 22, deployedwith PostgreSQL3 12.1.To understand the extra overhead of multi-synchronous sup-port, we measured the latency of the updates of the application3https://www.postgresql.orgTABLE IILATENCY OF DATABASE UPDATES (IN MS)Ecto insertion deletion updateNo-tx 0.9 0.9 0.9In-tx 1.5 1.5 1.5Crecto insertion deletion lww counterLocal 2.1 2.1 2.1 2.8Merge 2.1 2.1 2.1 2.8implemented in both Ecto and Crecto. Table II only gives anapproximate indication, since the latency depends on the loadof the system as well as the sizes of the tables. For example,the measured latency of counter updates actually varied from1.1ms to 3.8ms to get an average of 2.8ms. Notice also thatwhether an update is wrapped in a transaction also makesa difference. We thus include the measured latency of Ectoupdates that are either included (In-tx) or not included (No-tx) in transactions.We expected that the latency of the updates in Crecto wouldbe 2–3 times higher than that of the corresponding updates inEcto, since handling an update request involves two or threeupdates in the two layers. The measured latency is less thanthat. One explanation is that several updates may share somecommon data-mapping overhead.VII. RELATED WORKCRDTs for multi-synchronous data accessOne important feature of local-first software [10] is multi-synchronous data access. In particular, data should be firststored on user devices and be always immediately accessible.Current prototypes of local-first software reported in [10] areimplemented using JSON CRDTs [9]. There is no supportfor RDB features such as SQL-like queries and integrityconstraints.The work presented in [22] aims at low latency of dataaccess at edge. It uses hybrid logical clock [20] to detectconcurrent operations and operational transformation [23] toresolve conflicts. In particular, it resolves conflicting attributeupdates as LWW-registers. It requires a central server in thecloud to disseminate operations and the data model is restrictedto the MangoDB4 model.Lasp [18,21] is a programming model for large-scale dis-tributed programming. One of its key features is to allow localcopies of replicated state to change during periods withoutnetwork connectivity. It is based on the ORSet CRDT [12]and provides functional programming primitives such as map,filter, product and union.CRDTs for geo-replicated data storesResearchers have applied CRDTs to achieve low latencyfor data access in geo-replicated databases. Unlike multi-synchronous access, such systems require all or a quorum ofthe replicas to be available.4https://www.mangodb.comRedBlue consistency [11] allows eventual consistency forblue operations and guarantees strong consistency for redoperations. In particular, it applies CRDTs for blue operationsand globally serializes red operations.[24] presents an approach to support global boundaryconstraints for counter CRDTs. For a bounded counter, the“rights” to increment and decrement the counter value areallocated to the replicas. A replica can only update the counterusing its allocated right. It can “borrow” rights from otherreplicas when its allocated right is insufficient. This approachrequires that replicas with sufficient rights be available in orderto successfully perform an update.Our work does not support global transactions [11] orglobal boundary constraints [24]. It is known [25] that itis impossible to support certain global constraints withoutsynchronization and coordination among different sites. We re-pair asynchronously temporary violation of global constraintsthrough undo. Support of undo for row insertion and deletionin CRR is straightforward, as CLSet (Section III-C) is aspecialization of a generic undo support for CRDTs [19].AntidoteSQL [26] provides a SQL interface to a geo-replicated key-value store that adopts CRDT approaches in-cluding those reported in [11,24]. An application can declarecertain rules for resolving conflicting updates. Similar to ourwork, an application can choose between LWW-register andcounter CRDTs for conflicting attribute updates. In Anti-doteSQL, a deleted record is associated with a “deleted” flag,hence excluding the possibility for the record to be insertedback again. As AntidoteSQL is built on a key-value store, itdoes not benefit from the maturity of the RDB industry.Set CRDTsThe CLSet CRDT (Section III) is based on our earlierwork on undo support for state-based CRDTs [19]. Obviously,deletion (insertion) of an element can be regarded as an undoof a previous insertion (deletion) of the element.As discussed in Section III-A, there exist CRDTs forgeneral-purpose sets ([12,15,17,18]). The meta data of existingset CRDTs associated with each element are much larger thana single number (causal length) in our CLSet CRDT. In [27],we compare CLSet with existing set CRDTs in more detail.VIII. CONCLUSIONWe presented a two-layer system for multi-synchronousaccess to relational databases in a cloud-edge environment.The underlying CRR layer uses CRDTs to allow immediatedata access at edge and to guarantee data convergence whenthe edge devices are online. It also resolves violations ofintegrity constraints at merge by undoing offending updates. Akey enabling contribution is a new set CRDT based on causallengths. Applications access databases through the AR layerwith little or no concern of the underlying mechanism. Weimplemented a prototype on top of a data mapping librarythat is independent of any specific DBMS. Therefore serversin the cloud and edge devices can deploy different DBMSs.IX. ACKNOWLEDGMENTThe authors would like to thank the members of the COASTteam at Inria Nancy-Grand Est/Loria, in particular VictorienElvinger, for inspiring discussions.REFERENCES[1] A. Fox and E. A. Brewer, “Harvest, yield and scalable tolerant systems,”in The Seventh Workshop on Hot Topics in Operating Systems, 1999, pp.174–178.[2] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility ofconsistent, available, partition-tolerant web services,” SIGACT News,vol. 33, no. 2, pp. 51–59, 2002.[3] M. Shapiro, N. M. Preguic¸a, C. Baquero, and M. Zawirski, “Conflict-freereplicated data types,” in 13th International Symposium on Stabilization,Safety, and Security of Distributed Systems, (SSS 2011), 2011, pp. 386–400.[4] R. Brown, S. Cribbs, C. Meiklejohn, and S. Elliott, “Riak DT map: acomposable, convergent replicated dictionary,” in The First Workshopon the Principles and Practice of Eventual Consistency, 2014, pp. 1–1.[5] G. Oster, P. Urso, P. Molli, and A. Imine, “Data consistency for P2Pcollaborative editing,” in CSCW. ACM, 2006, pp. 259–268.[6] N. M. Preguic¸a, J. M. Marqu`es, M. Shapiro, and M. Letia, “A commu-tative replicated data type for cooperative editing,” in ICDCS, 2009, pp.395–403.[7] W. Yu, L. Andr´e, and C.-L. Ignat, “A CRDT supporting selective undofor collaborative text editing,” in DAIS, 2015, pp. 193–206.[8] M. Nicolas, V. Elvinger, G. Oster, C.-L. Ignat, and F. Charoy, “MUTE:A peer-to-peer web-based real-time collaborative editor,” in ECSCWPanels, Demos and Posters, 2017.[9] M. Kleppmann and A. R. Beresford, “A conflict-free replicated JSONdatatype,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 10, pp. 2733–2746, 2017.[10] M. Kleppmann, A. Wiggins, P. van Hardenberg, and M. McGranaghan,“Local-first software: you own your data, in spite of the cloud,” inProceedings of the 2019 ACM SIGPLAN International Symposium onNew Ideas, New Paradigms, and Reflections on Programming andSoftware, (Onward! 2019), 2019, pp. 154–178.[11] C. Li, D. Porto, A. Clement, J. Gehrke, N. M. Preguic¸a, and R. Ro-drigues, “Making geo-replicated systems fast as possible, consistentwhen necessary,” in 10th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI), 2012, pp. 265–278.[12] M. Shapiro, N. M. Preguic¸a, C. Baquero, and M. Zawirski, “A com-prehensive study of convergent and commutative replicated data types,”Rapport de recherche, vol. 7506, January 2011.[13] P. Johnson and R. Thomas, “The maintamance of duplicated databases,”Internet Request for Comments RFC 677, January 1976.[14] V. K. Garg, Introduction to Lattice Theory with Computer ScienceApplications. Wiley, 2015.[15] P. S. Almeida, A. Shoker, and C. Baquero, “Delta state replicated datatypes,” J. Parallel Distrib. Comput., vol. 111, pp. 162–173, 2018.[16] V. Enes, P. S. Almeida, C. Baquero, and J. Leit˜ao, “Efficient Synchro-nization of State-based CRDTs,” in IEEE 35th International Conferenceon Data Engineering (ICDE), April 2019.[17] A. Bieniusa, M. Zawirski, N. M. Preguic¸a, M. Shapiro, C. Baquero,V. Balegas, and S. Duarte, “A optimized conflict-free replicated set,”Rapport de recherche, vol. 8083, October 2012.[18] C. Meiklejohn and P. van Roy, “Lasp: a language for distributed,coordination-free programming,” in the 17th International Symposiumon Principles and Practice of Declarative Programming, 2015, pp. 184–195.[19] W. Yu, V. Elvinger, and C.-L. Ignat, “A generic undo support forstate-based CRDTs,” in 23rd International Conference on Principles ofDistributed Systems (OPODIS 2019), ser. LIPIcs, vol. 153, 2020, pp.14:1–14:17.[20] S. S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, and M. Leone,“Logical physical clocks,” in Principles of Distributed Systems(OPODIS), ser. LNCS, vol. 8878. Springer, 2014, pp. 17–32.[21] C. S. Meiklejohn, V. Enes, J. Yoo, C. Baquero, P. van Roy, andA. Bieniusa, “Practical evaluation of the lasp programming model atlarge scale: an experience report,” in the 19th International Symposiumon Principles and Practice of Declarative Programming, 2017, pp. 109–114.[22] D. Mealha, N. Preguic¸a, M. C. Gomes, and J. Leit˜ao, “Data replicationon the cloud/edge,” in Proceedings of the 6th Workshop on the Principlesand Practice of Consistency for Distributed Data (PaPoC), 2019, pp.7:1–7:7.[23] C. A. Ellis and S. J. Gibbs, “Concurrency control in groupware systems,”in SIGMOD. ACM, 1989, pp. 399–407.[24] V. Balegas, D. Serra, S. Duarte, C. Ferreira, M. Shapiro, R. Rodrigues,and N. M. Preguic¸a, “Extending eventually consistent cloud databasesfor enforcing numeric invariants,” in 34th IEEE Symposium on ReliableDistributed Systems (SRDS), 2015, pp. 31–36.[25] P. Bailis, A. Fekete, M. J. Franklin, A. Ghodsi, J. M. Hellerstein, andI. Stoica, “Coordination avoidance in database systems,” Proc. VLDBEndow., vol. 8, no. 3, pp. 185–196, 2014.[26] P. Lopes, J. Sousa, V. Balegas, C. Ferreira, S. Duarte, A. Bieniusa,R. Rodrigues, and N. M. Preguic¸a, “Antidote SQL: relaxed whenpossible, strict when necessary,” CoRR, vol. abs/1902.03576, 2019.[27] W. Yu and S. Rostad, “A low-cost set CRDT based on causal lengths,”in Proceedings of the 7th Workshop on the Principles and Practice ofConsistency for Distributed Data (PaPoC), 2020, pp. 5:1–5:6.
# Document TitlePondering a Monorepo Version Control SystemMonorepos have desirable features but git is the wrong version control system (VCS) to realize one. Here I document how the right one would look like.Why Monorepo?The idea of a monorepo is to put "everything" into a single version control system. In contrast, to multirepo where developers regularly use multiple repositories. I don't have a more precise definition.Google uses a monorepo. In 2016 Google had a 86TB repository and eventually developed a custom version control system for that. They report 30,000 commits per day and up to 800,000 file read queries per seconds. Microsofts has a 300GB git repository which requires significant changes and extensions to normal git. Facebook is doing something similar with Mercurial.An implication of this monorepo approach is that it usually contains packages for multiple programming languages. So you cannot rely on language-specific tools like pip or maven. Bazel seems to be a fitting and advanced build system for this which comes from the fact that it stems from Googles internal "Blaze". For more tooling considerations, there is an Awesome Monorepo list.Often, people try to store all dependencies in the monorepo as well. This might include tools like compilers and IDEs but probably not the operating system. The goal is reproducability and external dependencies should vanish.If you want to read more discussions on monorepos, read Advantages of monorepos by Dan Luu and browse All You Always Wanted to Know About Monorepo But Were Afraid to Ask.Most of the arguments for and against monorepos are strawman rants in my opinion. A monorepo does not guarantee a "single lint, build, test and release process" for example. You can have chaos in a monorepo and you can have order with multirepos. This is a question of process and not of repository structure.There is only one advantage: In a monorepo you can do an atomic change across everything. This is what enables you to change the API and update all users in a single commit. With multirepo you have an inherent race condition and eventually there will be special tooling around this fact. However, this "atomic change across everything" also requires special tooling eventually. Google invests heavily in into the clang ecosystem for this reason. Nothing is for free.That said, let's assume for now you we want to go for a monorepo.Why not git?If you talk about version control systems these days, people usually think about git. Everybody wants to use it despite gits well-known flaws and its UI which does not even try to hide implementation details. In the context of monorepos, the relevant flaw is that git scales poorly beyond some gigabytes of data.Git needs LFS or Annex to deal with large binary files.Plenty of git operations (e.g. git status) check every file in the checkout. Walking the whole checkout is not feasible for big monorepos.To lessen some pain, git can download only a limited part of the history (shallow clone) and show only parts of the repository (sparse checkout). Still, there is no way around the fact that you need a full copy of the current HEAD on your disk. This is not feasible for big monorepos either.As an alternative, I would suggest Subversion. The Apache Foundation has a big subversion repository which contains among others OpenOffice, Hadoop, Maven, httpd, Couchdb, Zookeeper, Tomcat, Xerces. Of course, Subversion has flaws itself and its development is very slow these days. For example, merging is painful. However, Google does not have branches in its monorepo. Instead, they use feature toggles. Maybe branches are conceptually wrong for big monorepos and we should not even try?I considered creating a new VCS because I see an open niche there which none of the Open Source VCSs can fill. One central insight is that a client will probably never look at every single file, so we must avoid any need to download a full copy. Working on the repository will be more like working on a network file system. Thus we lose the ability to work offline as a tradeoff.Integrated build systemI already wrote about merging version control and build system. In the discussions on that, I learned about proprietary solutions like Rational ClearCase and Vesta SCM. Here is the summary of these thoughts:We already committed to store the repo on the network. The build system also wants to store artifacts on the network, so why not store them in the repo as well? It implies that the VCS knows which files are generated.We also committed to put all tooling in the repo for the sake of reproducability. Thus, the build system can be simple. There is no need for a configure step because there is no relevant environment outside of the repo.Now consider the fact that no single machine might be able to generate all the artifacts. Some might require a Windows or Linux machine. Some might require special hardware. However, the artifact generation is reproducible so it does not matter which client does it. We might as well integrate continuous integration (CI) jobs into the system.Imagine this: You edit the code and commit a change. You already ran unit tests locally, so these artifacts are already there. During the next minutes you see more artifacts pop up which depended on your changes. These artifacts might be a Windows and a Linux and an OS X build. All these artifacts are naturally part of the version you committed, so if you switch to a different branch, the artifacts change automatically. There is no explicit back and forth with a CI system. Instead, the version-control-and-build-system just fills in the missing generated files for a new committed version.To implement this, we need a notification mechanism in our version control system. We still need special clients which are continously waiting for new commits and generate artifacts. The VCS must manage these clients and propagate artifacts as needed. Certainly, this is very much beyond the usual job of a VCS.More Design AspectsSince we want to store "everything" in the repo, we also want non-technical people to use it. It already resembles a network file system, so it should provide an interface nearly as easy to use. We want to enable designers to store Photoshop files in there. Managers should store Excel and Powerpoint files in there. I say "nearly" because we need additional concepts like versions and a commit mechanism which cannot be hidden from users.The wide range of users and security concerns require an access control mechanism with fine granularity (per file/directory). Since big monorepos exist in big organizations it naturally must be integrated into the surrounding system (LDAP, Exchange, etc).Monorepo is not for Open SourceThe VCS described above sounds great for big companies and many big projects. However, the Open Source world consists of small independent projects which are more loosely coupled. This loose coupling provides a certain resilience and diversity. While a monorepo allows you atomically change something everywhere, it also forces you to do it to some degree. Looser coupling means more flexibility on when you update dependencies, for example. The tradeoff is the inevitable chaos and you wonder if we really need so many build systems and scripting languages.Open Source projects usually start with a single developer and no long-term goals. For that use case git is perfect. Maybe certain big projects may benefit. For example, would Gnome adopt this? Even there, it seems the partition with multiple git repos works well enough.Why not build a Monorepo VCS?OSS projects have no need for it. Small companies are fine with git, because their repositories are small enough. Essentially, this is a product for enterprises.It makes no sense to build it as Open Source software in my spare time. It would be a worthwhile startup idea, but I have a family and no need to take the risk right now, and my current job is interesting enough. It is not a small project that can be done on the side at work. This is why I will not build it in the foreseeable feature although it is a fascinating technical challenge. Maybe someone else can, so I documented my thoughts here and can hopefully focus on other stuff now. Maybe Google is already building it.Some discussion on lobste.rs.The alternative for big companies is package management in my opinion. I looked around and ZeroInstall seems to be the only useable cross-platform language-independent package manager. Of course, it cares only about distributing artifacts. Its distributed approach provides a lot of flexibility on how you generate the artifacts which can be an advantage.Also, Amazon's build system provides valuable insights for manyrepo environments.© 2019-11-07How a VCS designed for monorepos would look like and why I don't build it.Share page onartikel (ältere) articles (older)homepage friend blogs publicationsportrait imageAndreas Zwinkau appreciates email to zwinkau@mailbox.org, if you have a comment.Anonymous feedback is welcome via admonymous.I'm writing a weekly newsletter Marketwise about prediction markets and how they are useful to make decisions.datenschutz
# Document Title1.0 Don't Stress!The feature sets of Fossil and Git overlap in many ways. Both are distributed version control systems which store a tree of check-in objects to a local repository clone. In both systems, the local clone starts out as a full copy of the remote parent. New content gets added to the local clone and then later optionally pushed up to the remote, and changes to the remote can be pulled down to the local clone at will. Both systems offer diffing, patching, branching, merging, cherrypicking, bisecting, private branches, a stash, etc.Fossil has inbound and outbound Git conversion features, so if you start out using one DVCS and later decide you like the other better, you can easily move your version-controlled file content.¹In this document, we set all of that similarity and interoperability aside and focus on the important differences between the two, especially those that impact the user experience.Keep in mind that you are reading this on a Fossil website, and though we try to be fair, the information here might be biased in favor of Fossil, if only because we spend most of our time using Fossil, not Git. Ask around for second opinions from people who have used both Fossil and Git.2.0 Differences Between Fossil And GitDifferences between Fossil and Git are summarized by the following table, with further description in the text that follows.| GIT | FOSSIL || File versioning only | VCS, tickets, wiki, docs, notes, forum, UI, RBAC || Sprawling, incoherent, and inefficient | Self-contained and efficient || Ad-hoc pile-of-files key/value database | The most popular database in the world || Portable to POSIX systems only | Runs just about anywhere || Bazaar-style development | Cathedral-style development || Designed for Linux kernel development | Designed for SQLite development || Many contributors | Select contributors || Focus on individual branches | Focus on the entire tree of changes || One check-out per repository | Many check-outs per repository || Remembers what you should have done | Remembers what you actually did || SHA-2 | SHA-3 |2.1 FeaturefulGit provides file versioning services only, whereas Fossil adds an integrated wiki, ticketing & bug tracking, embedded documentation, technical notes, and a web forum, all within a single nicely-designed skinnable web UI, protected by a fine-grained role-based access control system. These additional capabilities are available for Git as 3rd-party add-ons, but with Fossil they are integrated into the design. One way to describe Fossil is that it is "GitHub-in-a-box."For developers who choose to self-host projects (rather than using a 3rd-party service such as GitHub) Fossil is much easier to set up, since the stand-alone Fossil executable together with a 2-line CGI script suffice to instantiate a full-featured developer website. To accomplish the same using Git requires locating, installing, configuring, integrating, and managing a wide assortment of separate tools. Standing up a developer website using Fossil can be done in minutes, whereas doing the same using Git requires hours or days.Fossil is small, complete, and self-contained. If you clone Git's self-hosting repository, you get just Git's source code. If you clone Fossil's self-hosting repository, you get the entire Fossil website — source code, documentation, ticket history, and so forth.² That means you get a copy of this very article and all of its historical versions, plus the same for all of the other public content on this site.2.2 EfficientGit is actually a collection of many small tools, each doing one small part of the job, which can be recombined (by experts) to perform powerful operations. Git has a lot of complexity and many dependencies, so that most people end up installing it via some kind of package manager, simply because the creation of complicated binary packages is best delegated to people skilled in their creation. Normal Git users are not expected to build Git from source and install it themselves.Fossil is a single self-contained stand-alone executable with hardly any dependencies. Fossil can be run inside a minimally configured chroot jail, from a Windows memory stick, off a Raspberry Pi with a tiny SD card, etc. To install Fossil, one merely puts the executable somewhere in the $PATH. Fossil is straightforward to build and install, so that many Fossil users do in fact build and install "trunk" versions to get new features between formal releases.Some say that Git more closely adheres to the Unix philosophy, summarized as "many small tools, loosely joined," but we have many examples of other successful Unix software that violates that principle to good effect, from Apache to Python to ZFS. We can infer from that that this is not an absolute principle of good software design. Sometimes "many features, tightly-coupled" works better. What actually matters is effectiveness and efficiency. We believe Fossil achieves this.Git fails on efficiency once you add to it all of the third-party software needed to give it a Fossil-equivalent feature set. Consider GitLab, a third-party extension to Git wrapping it in many features, making it roughly Fossil-equivalent, though much more resource hungry and hence more costly to run than the equivalent Fossil setup. GitLab's basic requirements are easy to accept when you're dedicating a local rack server or blade to it, since its minimum requirements are more or less a description of the smallest thing you could call a "server" these days, but when you go to host that in the cloud, you can expect to pay about 8⨉ as much to comfortably host GitLab as for Fossil.³ This difference is largely due to basic technology choices: Ruby and PostgreSQL vs C and SQLite.The Fossil project itself is hosted on a very small VPS, and we've received many reports on the Fossil forum about people successfully hosting Fossil service on bare-bones $5/month VPS hosts, spare Raspberry Pi boards, and other small hosts.2.3 DurableThe baseline data structures for Fossil and Git are the same, modulo formatting details. Both systems manage a directed acyclic graph (DAG) of Merkle tree / block chain structured check-in objects. Check-ins are identified by a cryptographic hash of the check-in comment, and each check-in refers to its parent via its hash.The difference is that Git stores its objects as individual files in the .git folder or compressed into bespoke pack-files, whereas Fossil stores its objects in a SQLite database file using a hybrid NoSQL/relational data model of the check-in history. Git's data storage system is an ad-hoc pile-of-files key/value database, whereas Fossil uses a proven, heavily-tested, general-purpose, durable SQL database. This difference is more than an implementation detail. It has important practical consequences.With Git, one can easily locate the ancestors of a particular check-in by following the pointers embedded in the check-in object, but it is difficult to go the other direction and locate the descendants of a check-in. It is so difficult, in fact, that neither native Git nor GitHub provide this capability short of groveling the commit log. With Git, if you are looking at some historical check-in then you cannot ask "What came next?" or "What are the children of this check-in?"Fossil, on the other hand, parses essential information about check-ins (parents, children, committers, comments, files changed, etc.) into a relational database that can be easily queried using concise SQL statements to find both ancestors and descendants of a check-in. This is the hybrid data model mentioned above: Fossil manages your check-in and other data in a NoSQL block chain structured data store, but that's backed by a set of relational lookup tables for quick indexing into that artifact store. (See "Thoughts On The Design Of The Fossil DVCS" for more details.)Leaf check-ins in Git that lack a "ref" become "detached," making them difficult to locate and subject to garbage collection. This detached head state problem has caused untold grief for countless Git users. With Fossil, detached heads are simply impossible because we can always find our way back into the block chain using one or more of the relational indices it automatically manages for you.This design difference shows up in several other places within each tool. It is why Fossil's timeline is generally more detailed yet more clear than those available in Git front-ends. (Contrast this Fossil timeline with its closest equivalent in GitHub.) It's why there is no inverse of the cryptic @~ notation in Git, meaning "the parent of HEAD," which Fossil simply calls "prev", but there is a "next" special check-in name in Fossil. It is why Fossil has so many built-in status reports to help maintain situational awareness, aid comprehension, and avoid errors.These differences are due, in part, to Fossil's start a year later than Git: we were able to learn from its key design mistakes.2.4 PortableFossil is largely written in ISO C, almost purely conforming to the original 1989 standard. We make very little use of C99, and we do not knowingly make any use of C11. Fossil does call POSIX and Windows APIs where necessary, but it's about as portable as you can ask given that ISO C doesn't define all of the facilities Fossil needs to do its thing. (Network sockets, file locking, etc.) There are certainly well-known platforms Fossil hasn't been ported to yet, but that's most likely due to lack of interest rather than inherent difficulties in doing the port. We believe the most stringent limit on its portability is that it assumes at least a 32-bit CPU and several megs of flat-addressed memory.⁴ Fossil isn't quite as portable as SQLite, but it's close.Over half of the C code in Fossil is actually an embedded copy of the current version of SQLite. Much of what is Fossil-specific after you set SQLite itself aside is SQL code calling into SQLite. The number of lines of SQL code in Fossil isn't large by percentage, but since SQL is such an expressive, declarative language, it has an outsized contribution to Fossil's user-visible functionality.Fossil isn't entirely C and SQL code. Its web UI uses JavaScript where necessary.⁵ The server-side UI scripting usse a custom minimal Tcl dialect called TH1, which is embedeed into Fossil itself. Fossil's build system and test suite are largely based on Tcl.⁶ All of this is quite portable.About half of Git's code is POSIX C, and about a third is POSIX shell code. This is largely why the so-called "Git for Windows" distributions (both first-party and third-party) are actually an MSYS POSIX portability environment bundled with all of the Git stuff, because it would be too painful to port Git natively to Windows. Git is a foreign citizen on Windows, speaking to it only through a translator.⁷While Fossil does lean toward POSIX norms when given a choice — LF-only line endings are treated as first-class citizens over CR+LF, for example — the Windows build of Fossil is truly native.The third-party extensions to Git tend to follow this same pattern. GitLab isn't portable to Windows at all, for example. For that matter, GitLab isn't even officially supported on macOS, the BSDs, or uncommon Linuxes! We have many users who regularly build and run Fossil on all of these systems.2.5 Linux vs. SQLiteFossil and Git promote different development styles because each one was specifically designed to support the creator's main software development project: Linus Torvalds designed Git to support development of the Linux kernel, and D. Richard Hipp designed Fossil to support the development of SQLite. Both projects must rank high on any objective list of "most important FOSS projects," yet these two projects are almost entirely unlike one another, so it is natural that the DVCSes created to support these projects also differ in many ways.In the following sections, we will explain how four key differences between the Linux and SQLite software development projects dictated the design of each DVCS's low-friction usage path.When deciding between these two DVCSes, you should ask yourself, "Is my project more like Linux or more like SQLite?"2.5.1 Development OrganizationEric S. Raymond's seminal essay-turned-book "The Cathedral and the Bazaar" details the two major development organization styles found in FOSS projects. As it happens, Linux and SQLite fall on opposite sides of this dichotomy. Differing development organization styles dictate a different design and low-friction usage path in the tools created to support each project.Git promotes the Linux kernel's bazaar development style, in which a loosely-associated mass of developers contribute their work through a hierarchy of lieutenants who manage and clean up these contributions for consideration by Linus Torvalds, who has the power to cherrypick individual contributions into his version of the Linux kernel. Git allows an anonymous developer to rebase and push specific locally-named private branches, so that a Git repo clone often isn't really a clone at all: it may have an arbitrary number of differences relative to the repository it originally cloned from. Git encourages siloed development. Select work in a developer's local repository may remain private indefinitely.All of this is exactly what one wants when doing bazaar-style development.Fossil's normal mode of operation differs on every one of these points, with the specific designed-in goal of promoting SQLite's cathedral development model:Personal engagement: SQLite's developers know each other by name and work together daily on the project.Trust over hierarchy: SQLite's developers check changes into their local repository, and these are immediately and automatically sync'd up to the central repository; there is no "dictator and lieutenants" hierarchy as with Linux kernel contributions. D. Richard Hipp rarely overrides decisions made by those he has trusted with commit access on his repositories. Fossil allows you to give some users more power over what they can do with the repository, but Fossil does not otherwise directly support the enforcement of a development organization's social and power hierarchies. Fossil is a great fit for flat organizations.No easy drive-by contributions: Git pull requests offer a low-friction path to accepting drive-by contributions. Fossil's closest equivalent is its unique bundle feature, which requires higher engagement than firing off a PR.⁸ This difference comes directly from the initial designed purpose for each tool: the SQLite project doesn't accept outside contributions from previously-unknown developers, but the Linux kernel does.No rebasing: When your local repo clone syncs changes up to its parent, those changes are sent exactly as they were committed locally. There is no rebasing mechanism in Fossil, on purpose.Sync over push: Explicit pushes are uncommon in Fossil-based projects: the default is to rely on autosync mode instead, in which each commit syncs immediately to its parent repository. This is a mode so you can turn it off temporarily when needed, such as when working offline. Fossil is still a truly distributed version control system; it's just that its starting default is to assume you're rarely out of communication with the parent repo.This is not merely a reflection of modern always-connected computing environments. It is a conscious decision in direct support of SQLite's cathedral development model: we don't want developers going dark, then showing up weeks later with a massive bolus of changes for us to integrate all at once. Jim McCarthy put it well in his book on software project management, Dynamics of Software Development: "Beware of a guy in a room."Branch names sync: Unlike in Git, branch names in Fossil are not purely local labels. They sync along with everything else, so everyone sees the same set of branch names. Fossil's design choice here is a direct reflection of the Linux vs. SQLite project outlook: SQLite's developers collaborate closely on a single coherent project, whereas Linux's developers go off on tangents and occasionally sync changes up with each other.Private branches are rare: Private branches exist in Fossil, but they're normally used to handle rare exception cases, whereas in many Git projects, they're part of the straight-line development process.Identical clones: Fossil's autosync system tries to keep local clones identical to the repository it cloned from.Where Git encourages siloed development, Fossil fights against it. Fossil places a lot of emphasis on synchronizing everyone's work and on reporting on the state of the project and the work of its developers, so that everyone — especially the project leader — can maintain a better mental picture of what is happening, leading to better situational awareness.Each DVCS can be used in the opposite style, but doing so works against their low-friction paths.2.5.2 ScaleThe Linux kernel has a far bigger developer community than that of SQLite: there are thousands and thousands of contributors to Linux, most of whom do not know each others names. These thousands are responsible for producing roughly 89⨉ more code than is in SQLite. (10.7 MLOC vs. 0.12 MLOC according to SLOCCount.) The Linux kernel and its development process were already uncommonly large back in 2005 when Git was designed, specifically to support the consequences of having such a large set of developers working on such a large code base.95% of the code in SQLite comes from just four programmers, and 64% of it is from the lead developer alone. The SQLite developers know each other well and interact daily. Fossil was designed for this development model.We think you should ask yourself whether you have Linus Torvalds scale software configuration management problems or D. Richard Hipp scale problems when choosing your DVCS. An automotive air impact wrench running at 8000 RPM driving an M8 socket-cap bolt at 16 cm/s is not the best way to hang a picture on the living room wall.2.5.3 Accepting ContributionsAs of this writing, Git has received about 4.5⨉ as many commits as Fossil resulting in about 2.5⨉ as many lines of source code. The line count excludes tests and in-tree third-party dependencies. It does not exclude the default GUI for each, since it's integral for Fossil, so we count the size of gitk in this.It is obvious that Git is bigger in part because of its first-mover advantage, which resulted in a larger user community, which results in more contributions. But is that the only reason? We believe there are other relevant differences that also play into this which fall out of the "Linux vs. SQLite" framing: licensing, community structure, and how we react to drive-by contributions. In brief, it's harder to get a new feature into Fossil than into Git.A larger feature set is not necessarily a good thing. Git's command line interface is famously arcane. Masters of the arcane are able to do wizardly things, but only by studying their art deeply for years. This strikes us as a good thing only in cases where use of the tool itself is the primary point of that user's work.Almost no one uses a DVCS for its own sake; very few people get paid specifically in order to drive a DVCS. We use DVCSes as a tool to support some other effort, so we do not necessarily want the DVCS with the most features. We want a DVCS with easily internalized behavior so we can thoroughly master it despite spending only a small fraction of our working time thinking about the DVCS. We want to pick the tool up, use it quickly, and then set it aside in order to get back to our actual job as quickly as possible.Professional software developers in particular are prone to focusing on feature set sizes when choosing tools because this is sometimes a highly important consideration. They spend all day, every day, in their favorite text editors, and time they spend learning all of the arcana of their favorite programming languages is well-spent. Skills with these tools are direct productivity drivers, which in turn directly drives how much money a developer can make. (Or how much idle time they can afford to take, which amounts to the same thing.) But if you are a professional software developer, we want you to ask yourself a question: "How do I get paid more by mastering arcane features of my DVCS?" Unless you have a good answer to that, you probably do not want to be choosing a DVCS based on how many arcane features it has.The argument is similar for other types of users: if you are a hobbyist, how much time do you want to spend mastering your DVCSs instead of on the hobby supported by use of that DVCS?There is some minimal set of features required to achieve the purposes that drive our selection of a DVCS, but there is a level beyond which more features only slow us down while we're learning the tool, since we must plow through documentation on features we're not likely to ever use. When the number of features grows to the point where people of normal motivation cannot spend the time to master them all, the tool becomes less productive to use.The core developers of the Fossil project achieve a balance between feature set size and ease of use by carefully choosing which users to give commit bits to, then in being choosy about which of the contributed feature branches to merge down to trunk. We say "no" to a lot of feature proposals.The end result is that Fossil more closely adheres to the principle of least astonishment than Git does.2.5.4 Individual Branches vs. The Entire Change HistoryBoth Fossil and Git store history as a directed acyclic graph (DAG) of changes, but Git tends to focus more on individual branches of the DAG, whereas Fossil puts more emphasis on the entire DAG.For example, the default "sync" behavior in Git is to only sync a single branch, whereas with Fossil the only sync option it to sync the entire DAG. Git commands, GitHub, and GitLab tend to show only a single branch at a time, whereas Fossil usually shows all parallel branches at once. Git has commands like "rebase" that help keep all relevant changes on a single branch, whereas Fossil encourages a style of many concurrent branches constantly springing into existence, undergoing active development in parallel for a few days or weeks, then merging back into the main line and disappearing.This difference in emphasis arises from the different purposes of the two systems. Git focuses on individual branches, because that is exactly what you want for a highly-distributed bazaar-style project such as Linux. Linus Torvalds does not want to see every check-in by every contributor to Linux, as such extreme visibility does not scale well. But Fossil was written for the cathedral-style SQLite project with just a handful of active committers. Seeing all changes on all branches all at once helps keep the whole team up-to-date with what everybody else is doing, resulting in a more tightly focused and cohesive implementation.2.6 One vs. Many Check-outs per RepositoryA "repository" in Git is a pile-of-files in the .git subdirectory of a single check-out. The working check-out directory and the .git repository subdirectory are normally in the same directory within the filesystem.With Fossil, a "repository" is a single SQLite database file that can be stored anywhere. There can be multiple active check-outs from the same repository, perhaps open on different branches or on different snapshots of the same branch. It is common in Fossil to switch branches with a "cd" command between two check-out directories rather than switching to another branch in place within a single working directory. Long-running tests or builds can be running in one check-out while changes are being committed in another.From the start, Git has allowed symlinks to this .git directory from multiple working directories. The git init command offers the --separate-git-dir option to set this up automatically. Then in version 2.5, Git added the "git-worktree" feature to provide a higher-level management interface atop this basic mechanism. Use of this more closely emulates Fossil's decoupling of repository and working directory, but the fact remains that it is far more common in Git usage to simply switch a single working directory among branches in place.The main downside of that working style is that it invalidates all build objects created from files that change in switching between branches. When you have multiple working directories for a single repository, you can have a completely independent state in each working directory which is untouched by the "cd" command you use to switch among them.There are also practical consequences of the way .git links work that make multiple working directories in Git not quite interchangeable, as they are in Fossil.2.7 What you should have done vs. What you actually didGit puts a lot of emphasis on maintaining a "clean" check-in history. Extraneous and experimental branches by individual developers often never make it into the main repository. And branches are often rebased before being pushed, to make it appear as if development had been linear. Git strives to record what the development of a project should have looked like had there been no mistakes.Fossil, in contrast, puts more emphasis on recording exactly what happened, including all of the messy errors, dead-ends, experimental branches, and so forth. One might argue that this makes the history of a Fossil project "messy." But another point of view is that this makes the history "accurate." In actual practice, the superior reporting tools available in Fossil mean that the added "mess" is not a factor.Like Git, Fossil has an amend command for modifying prior commits, but unlike in Git, this works not by replacing data in the repository, but by adding a correction record to the repository that affects how later Fossil operations present the corrected data. The old information is still there in the repository, it is just overridden from the amendment point forward. For extreme situations, Fossil adds the shunning mechanism, but it has strict limitations that prevent global history rewrites.One commentator characterized Git as recording history according to the victors, whereas Fossil records history as it actually happened.2.8 Hash Algorithm: SHA-3 vs SHA-2 vs SHA-1Fossil started out using 160-bit SHA-1 hashes to identify check-ins, just as in Git. That changed in early 2017 when news of the SHAttered attack broke, demonstrating that SHA-1 collisions were now practical to create. Two weeks later, the creator of Fossil delivered a new release allowing a clean migration to 256-bit SHA-3 with full backwards compatibility to old SHA-1 based repositories.Here in mid-2019, that feature is now in every OS and package repository known to include Fossil so that the next release as of this writing (Fossil 2.10) will default to enforcing SHA-3 hashes by default. This not only solves the SHAttered problem, it should prevent a reoccurrence for the foreseeable future. Only repositories created before the transition to Fossil 2 are still using SHA-1, and then only if the repository's maintainer chose not to switch them into SHA-3 mode some time over the past 2 years.Meanwhile, the Git community took until August 2018 to announce their plan for solving the same problem by moving to SHA-256 (a variant of the older SHA-2 algorithm) and until February 2019 to release a version containing the change. It's looking like this will take years more to percolate through the community.The practical impact of SHAttered on structured data stores like the one in Git and Fossil isn't clear, but you want to have your repositories moved over to a stronger hash algorithm before someone figures out how to make use of the weaknesses in the old one. Fossil's developers moved on this problem quickly and had a widely-deployed solution to it years ago.3.0 Missing FeaturesAlthough there is a large overlap in capability between Fossil and Git, there are many areas where one system has a feature that is simply missing in the other. We covered most of those above, but there are a few remaining feature differences we haven't gotten to yet.3.1 Features found in Fossil but missing from GitThe fossil all commandFossil keeps track of all repositories and check-outs and allows operations over all of them with a single command. For example, in Fossil is possible to request a pull of all repositories on a laptop from their respective servers, prior to taking the laptop off network. Or it is possible to do "fossil all changes" to see if there are any uncommitted changes that were overlooked prior to the end of the workday.The fossil undo commandWhenever Fossil is told to modify the local checkout in some destructive way (fossil rm, fossil update, fossil revert, etc.) Fossil remembers the prior state and is able to return the local check-out directory to its prior state with a simple "fossil undo" command. You cannot undo a commit, since writes to the actual repository — as opposed to the local check-out directory — are more or less permanent, on purpose, but as long as the change is simply staged locally, Fossil makes undo easier than in Git.3.2 Features found in Git but missing from FossilRebaseBecause of its emphasis on recording history exactly as it happened, rather than as we would have liked it to happen, Fossil deliberately does not provide a "rebase" command. One can rebase manually in Fossil, with sufficient perseverance, but it is not something that can be done with a single command.Push or pull a single branchThe fossil push, fossil pull, and fossil sync commands do not provide the capability to push or pull individual branches. Pushing and pulling in Fossil is all or nothing. This is in keeping with Fossil's emphasis on maintaining a complete record and on sharing everything between all developers.Asides and DigressionsMany things are lost in making a Git mirror of a Fossil repo due to limitations of Git relative to Fossil. GitHub adds some of these missing features to stock Git, but because they're not part of Git proper, exporting a Fossil repository to GitHub will still not include them; Fossil tickets do not become GitHub issues, for example.The fossil-scm.org web site is actually hosted in several parts, so that it is not strictly true that "everything" on it is in the self-hosting Fossil project repo. The web forum is hosted as a separate Fossil repo from the main Fossil self-hosting repo for administration reasons, and the Download page content isn't normally sync'd with a "fossil clone" command unless you add the "-u" option. (See "How the Download Page Works" for details.) There may also be some purely static elements of the web site served via D. Richard Hipp's own lightweight web server, althttpd, which is configured as a front end to Fossil running in CGI mode on these sites.That estimate is based on pricing at Digital Ocean in mid-2019: Fossil will run just fine on the smallest instance they offer, at US $5/month, but the closest match to GitLab's minimum requirements among Digital Ocean's offerings currently costs $40/month.This means you can give up waiting for Fossil to be ported to the PDP-11, but we remain hopeful that someone may eventually port it to z/OS.We try to keep use of Javascript to a minimum in the web UI, and we always try to provide sensible fallbacks for those that run their browsers with Javascript disabled. Some features of the web UI simply won't run without Javascript, but the UI behavior does degrade gracefully."Why is there all this Tcl in and around Fossil?" you may ask. It is because D. Richard Hipp is a long-time Tcl user and contributor. SQLite started out as an embedded database for Tcl specifically. ([Reference]) When he then created Fossil to manage the development of SQLite, it was natural for him to use Tcl-based tools for its scripting, build system, test system, etc. It came full circle in 2011 when the Tcl and Tk projects moved from CVS to Fossil.A minority of the pieces of the Git core software suite are written in other languages, primarily Perl, Python, and Tcl. (e.g. git-send-mail, git-p4, and gitk, respectively.) Although these interpreters are quite portable, they aren't installed by default everywhere, and on some platforms you can't count on them at all. (Not just Windows, but also the BSDs and many other non-Linux platforms.) This expands the dependency footprint of Git considerably. It is why the current Git for Windows distribution is 44.7 MiB but the current fossil.exe zip file for Windows is 2.24 MiB. Fossil is much smaller despite using a roughly similar amount of high-level scripting code because its interpreters are compact and built into Fossil itself.Both Fossil and Git support patch(1) files, a common way to allow drive-by contributions, but it's a lossy contribution path for both systems. Unlike Git PRs and Fossil bundles, patch files collapse multiple checkins together, they don't include check-in comments, and they cannot encode changes made above the individual file content layer: you lose branching decisions, tag changes, file renames, and more when using patch files.This page was generated in about 0.011s by Fossil 2.10 [29141af7af] 2019-08-09 21:08:46
# Document TitlePart 5: Pseudo-edgesMay 07, 2019This is the fifth (and final planned) post in a series on some new ideas in version control. To start at the beginning, go here.The goal of this post is to describe pseudo-edges: what they are, how to compute them efficiently, and how to update them efficiently upon small changes. To recall the important points from the last post:We (pretend, for now, that we) represent the state of the repository as a graph in memory: one node for every line, with a directed edges that enforce ordering constraints between two lines. Each line has a flag that says whether it is deleted or not.The current output of the repository consists of just those nodes that are not deleted, and there is an ordering constraint between two nodes if there is a path in the graph between them, but note that the path is allowed to go through deleted nodes.Applying a patch to the repository is very efficient: the complexity of applying a patch is proportional to the number of changes it makes.Rendering a the current output to a file is potentially very expensive: its complexity requires traversing the entire graph, including nodes that are marked as deleted. To the extent we can, we’d like to reduce this complexity to the number of live nodes in the graph.The main idea for solving this is to add “pseudo-edges” to the graph: for every path that connects two live nodes through a sequence of deleted nodes, add a corresponding edge to the graph. Once this is done, we can render the current output without traversing the deleted parts of the graph, because every ordering contraint that used to depend on some deleted parts is now represented by some pseudo-edge. Here’s an example: the deleted nodes are in gray, and the pseudo-edge that they induce is the dashed arrow.We haven’t really solved anything yet, though: once we have the pseudo-edges, we can efficiently render the output, but how do we compute the pseudo-edges? The naive algorithm (look at every pair of live nodes, and check if they’re connected by a path of deleted nodes) still depends on the number of deleted nodes. Clearly, what we need is some sort of incremental way to update the pseudo-edges.Deferring pseudo-edgesThe easiest way that we can reduce the amount of time required for computing pseudo-edges is simply to do it rarely. Specifically, remember that applying a patch can be very fast, and that pseudo-edges only need to be computed when outputting a file. So, obviously, we should only update the pseudo-edges when it’s time to actually output the file. This sounds trivial, but it can actually be significant. Imagine, for example, that you’re cloning a repository that has a long history; let’s say it has n patches, each of which has a constant size, and let’s assume that computing pseudo-edges takes time O(m), where m is the size of the history. Cloning a repository involves downloading all of those patches, and then applying them one-by-one. If we recompute the pseudo-edges after every patch application, the total amount of time required to clone the repository is O(n^2); if we apply all the patches first and only compute the pseudo-edges at the end, the total time is O(n).You can see how ojo implements this deferred pseudo-edge computation here: first, it applies all of the patches; then it recomputes the pseudo-edges.Connected deleted componentsDeferring the pseudo-edge computation certainly helps, but we’d also like to speed up the computation itself. The main idea is to avoid unnecessary recomputation by only examining parts of the graph that might have actually changed. At this point, I need to admit that I don’t know whether what I’m about to propose is the best way of updating the pseudo-edges. In particular, its efficiency rests on a bunch of assumptions about what sort of graphs we’re likely to encounter. I haven’t made any attempt to test these assumptions on actual large repositories (although that’s something I’d like to try in the future).The main assumption is that while there may be many deleted nodes, they tend to be collected into a large number of connected components, each of which tends to be small. What’s more, each patch (I’ll assume) tends to only affect a small number of these connected components. In other words, the plan will be:keep track (incrementally) of connected components made up of deleted nodes,when applying or reverting a patch, figure out which connected components were touched, and only recompute paths among the live nodes that are on the boundary of one of the dirty connected components.Before talking about algorithms, here are some pictures that should help unpack what it is that I actually mean. Here is a graph containing three connected components of deleted nodes (represented by the rounded rectangles):When I delete node h, it gets added to one of the connected components, and I can update relevant pseudo-edges without looking at the other two connected components:If I delete node d then it will cause all of the connected components to merge:This isn’t hard to handle, it just means that we should run our pseudo-edge-checking algorithm on the merged component.Maintaining the componentsTo maintain the partition of deleted nodes into connected components, we use a disjoint-set data structure. This is very fast (pretty close to constant time) when applying patches, because applying patches can only enlarge deleted components. It’s slower when reverting patches, because the disjoint-set algorithm doesn’t allow splitting: when reverting patches, connected components could split into smaller ones. Our approach is to defer the splitting: we just mark the original connected component as dirty. When it comes time to compute the pseudo-edges, we explore the original component, and figure out what the new connected pieces are.The disjoint-set data structure is implemented in the ojo_partition subcrate. It appears in the Graggle struct; note also the dirty_reps member: that’s for keeping track of which parts in the partition have been modified by a patch and require recomputing pseudo-edges.We recompute the components here. Specifically, we consider the subgraph consisting only of nodes that belong to one of the dirty connected components. We run Tarjan’s algorithm on that subgraph to find out what the new connected components are. On each of those components, we recompute the pseudo-edges.Recomputing the pseudo-edgesThe algorithm for this is: after deleting the node, look at the deleted connected component that it belongs to, including the “boundary” consisting of live nodes:Using depth-first search, check which of the live boundary nodes (in this case, just a and i) are connected by a path within that component (in this case, they are). If so, add a pseudo-edge. The complexity of this algorithm is O(nm), where n is the number of boundary nodes, and m is the total number of nodes in the component, including the boundary (because we need to run n DFSes, and each one takes O(m) time). The hope here is that m and n are small, even for large histories. For example, I hope that n is almost always 2; at least, this is the case if the final live graph is totally ordered.This algorithm is implemented here.Unapplying, and pseudo-edge reasonsThere’s one more wrinkle in the pseudo-edge computation, and it has to do with reverting patches: if applying a patch created a pseudo-edge, removing a patch might cause that pseudo-edge to get deleted. But we have to be very careful when doing so, because a pseudo-edge might have multiple reasons for existing. You can see why in this example from before:The pseudo-edge from a to d is caused independently by the both the b -> c component and the cy -> cl -> e component. If by unapplying some patch we destroy the b -> c component but leave the cy -> cl -> e component untouched, we have to be sure not to delete the pseudo-edge from a to d.The solution to this is to track to “reasons” for pseudo-edges, where each “reason” is a deleted connected component. This is a many-to-many mapping between connected deleted components and pseudo-edges, and it’s stored in the pseudo_edge_reasons and reason_pseudo_edges members of the GraggleData struct. Once we store pseudo-edge reasons, it’s easy to figure out when a pseudo-edge needs deleting: whenever its last reason becomes obsolete.Pseudo-edge spamming: an optimizationWe’ve finished describing ojo’s algorithm for keeping pseudo-edges up to date, but there’s stil room for improvement. Here, I’ll describe a potential optimization that I haven’t implemented yet. It’s based on a simple, but non-quite-correct, algorithm for adding pseudo-edges incrementally: every time you mark a node as deleted, add a pseudo-edge from each of its in-neighbors to each of its out-neighbors. I call this “pseudo-edge spamming” because it just eagerly throws in as many pseudo-edges as needed. In pictures, if we have this graphand we delete the “deleted” line, then we’ll add a pseudo-edge from the in-neighbor of “deleted” (namely, “first”) to the out-neighbor of “deleted” (namely, “last”).This algorithm has two problems. The first is that it isn’t complete: you might also need to add pseudo-edges when adding an edge where at least one end is deleted. Consider this example, where our graph consists of two disconnected parts.If we add an edge from “deleted 1” to “deleted 2”, clearly we also need to add a pseudo-edge between each of the “first” nodes and each of the “last” nodes. In order to handle this case, we really do need to explore the deleted connected component (which could be slow).The second problem with our pseudo-edge spamming algorithm is that it doesn’t handle reverting patches: it only describes how to add pseudo-edges, not delete them.The nice thing about pseudo-edge spamming is that even if it isn’t completely correct, it can be used as a fast-path in the correct algorithm: when applying a patch, if it modifies the boundary of a deleted connected component that isn’t already dirty, use pseudo-edge spamming to update the pseudo-edges (and don’t mark the component as dirty). In every other case, fall back to the previous algorithm.
# Document TitleLocal-first softwareYou own your data, in spite of the cloud[Ink & Switch Logo]Martin KleppmannAdam WigginsPeter van HardenbergMark McGranaghanApril 2019Cloud apps like Google Docs and Trello are popular because they enable real-time collaboration with colleagues, and they make it easy for us to access our work from all of our devices. However, by centralizing data storage on servers, cloud apps also take away ownership and agency from users. If a service shuts down, the software stops functioning, and data created with that software is lost.In this article we propose “local-first software”: a set of principles for software that enables both collaboration and ownership for users. Local-first ideals include the ability to work offline and collaborate across multiple devices, while also improving the security, privacy, long-term preservation, and user control of data.We survey existing approaches to data storage and sharing, ranging from email attachments to web apps to Firebase-backed mobile apps, and we examine the trade-offs of each. We look at Conflict-free Replicated Data Types (CRDTs): data structures that are multi-user from the ground up while also being fundamentally local and private. CRDTs have the potential to be a foundational technology for realizing local-first software.We share some of our findings from developing local-first software prototypes at Ink & Switch over the course of several years. These experiments test the viability of CRDTs in practice, and explore the user interface challenges for this new data model. Lastly, we suggest some next steps for moving towards local-first software: for researchers, for app developers, and a startup opportunity for entrepreneurs.This article has also been published in PDF format in the proceedings of the Onward! 2019 conference. Please cite it as:Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. Local-first software: you own your data, in spite of the cloud. 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), October 2019, pages 154–178. doi:10.1145/3359591.3359737We welcome your feedback: @inkandswitch or hello@inkandswitch.com.ContentsMotivation: collaboration and ownershipSeven ideals for local-first software1. No spinners: your work at your fingertips2. Your work is not trapped on one device3. The network is optional4. Seamless collaboration with your colleagues5. The Long Now6. Security and privacy by default7. You retain ultimate ownership and controlExisting data storage and sharing modelsHow application architecture affects user experienceDeveloper infrastructure for building appsTowards a better futureCRDTs as a foundational technologyInk & Switch prototypesHow you can helpConclusionsAcknowledgmentsMotivation: collaboration and ownershipIt’s amazing how easily we can collaborate online nowadays. We use Google Docs to collaborate on documents, spreadsheets and presentations; in Figma we work together on user interface designs; we communicate with colleagues using Slack; we track tasks in Trello; and so on. We depend on these and many other online services, e.g. for taking notes, planning projects or events, remembering contacts, and a whole raft of business uses.We will call these services “cloud apps,” but you could just as well call them “SaaS” or “web-based apps.” What they have in common is that we typically access them through a web browser or through mobile apps, and that they store their data on a server.Today’s cloud apps offer big benefits compared to earlier generations of software: seamless collaboration, and being able to access data from any device. As we run more and more of our lives and work through these cloud apps, they become more and more critical to us. The more time we invest in using one of these apps, the more valuable the data in it becomes to us.However, in our research we have spoken to a lot of creative professionals, and in that process we have also learned about the downsides of cloud apps.When you have put a lot of creative energy and effort into making something, you tend to have a deep emotional attachment to it. If you do creative work, this probably seems familiar. (When we say “creative work,” we mean not just visual art, or music, or poetry — many other activities, such as explaining a technical topic, implementing an intricate algorithm, designing a user interface, or figuring out how to lead a team towards some goal are also creative efforts.)Our research on software that supports the creative process is discussed further in our articles Capstone, a tablet for thinking and The iPad as a fast, precise tool for creativity.In the process of performing that creative work, you typically produce files and data: documents, presentations, spreadsheets, code, notes, drawings, and so on. And you will want to keep that data: for reference and inspiration in the future, to include it in a portfolio, or simply to archive because you feel proud of it. It is important to feel ownership of that data, because the creative expression is something so personal.Unfortunately, cloud apps are problematic in this regard. Although they let you access your data anywhere, all data access must go via the server, and you can only do the things that the server will let you do. In a sense, you don’t have full ownership of that data — the cloud provider does. In the words of a bumper sticker: “There is no cloud, it’s just someone else’s computer.”We use the term “ownership” not in the sense of intellectual property law and copyright, but rather as the creator’s perceived relationship to their data. We discuss this notion in a later section.When data is stored on “someone else’s computer”, that third party assumes a degree of control over that data. Cloud apps are provided as a service; if the service is unavailable, you cannot use the software, and you can no longer access your data created with that software. If the service shuts down, even though you might be able to export your data, without the servers there is normally no way for you to continue running your own copy of that software. Thus, you are at the mercy of the company providing the service.Before web apps came along, we had what we might call “old-fashioned” apps: programs running on your local computer, reading and writing files on the local disk. We still use a lot of applications of this type today: text editors and IDEs, Git and other version control systems, and many specialized software packages such as graphics applications or CAD software fall in this category.The software we are talking about in this article are apps for creating documents or files (such as text, graphics, spreadsheets, CAD drawings, or music), or personal data repositories (such as notes, calendars, to-do lists, or password managers). We are not talking about implementing things like banking services, e-commerce, social networking, ride-sharing, or similar services, which are well served by centralized systems.In old-fashioned apps, the data lives in files on your local disk, so you have full agency and ownership of that data: you can do anything you like, including long-term archiving, making backups, manipulating the files using other programs, or deleting the files if you no longer want them. You don’t need anybody’s permission to access your files, since they are yours. You don’t have to depend on servers operated by another company.To sum up: the cloud gives us collaboration, but old-fashioned apps give us ownership. Can’t we have the best of both worlds?We would like both the convenient cross-device access and real-time collaboration provided by cloud apps, and also the personal ownership of your own data embodied by “old-fashioned” software.Seven ideals for local-first softwareWe believe that data ownership and real-time collaboration are not at odds with each other. It is possible to create software that has all the advantages of cloud apps, while also allowing you to retain full ownership of the data, documents and files you create.We call this type of software local-first software, since it prioritizes the use of local storage (the disk built into your computer) and local networks (such as your home WiFi) over servers in remote datacenters.In cloud apps, the data on the server is treated as the primary, authoritative copy of the data; if a client has a copy of the data, it is merely a cache that is subordinate to the server. Any data modification must be sent to the server, otherwise it “didn’t happen.” In local-first applications we swap these roles: we treat the copy of the data on your local device — your laptop, tablet, or phone — as the primary copy. Servers still exist, but they hold secondary copies of your data in order to assist with access from multiple devices. As we shall see, this change in perspective has profound implications.Here are seven ideals we would like to strive for in local-first software.1. No spinners: your work at your fingertipsMuch of today’s software feels slower than previous generations of software. Even though CPUs have become ever faster, there is often a perceptible delay between some user input (e.g. clicking a button, or hitting a key) and the corresponding result appearing on the display. In previous work we measured the performance of modern software and analyzed why these delays occur.Server-to-server round-trip times between various locations worldwideServer-to-server round-trip times between AWS datacenters in various locations worldwide. Data from: Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “Highly Available Transactions: Virtues and Limitations,” VLDB 2014.With cloud apps, since the primary copy of the data is on a server, all data modifications, and many data lookups, require a round-trip to a server. Depending on where you live, the server may well be located on another continent, so the speed of light places a limit on how fast the software can be.The user interface may try to hide that latency by showing the operation as if it were complete, even though the request is still in progress — a pattern known as Optimistic UI — but until the request is complete, there is always the possibility that it may fail (for example, due to an unstable Internet connection). Thus, an optimistic UI still sometimes exposes the latency of the network round-trip when an error occurs.Local-first software is different: because it keeps the primary copy of the data on the local device, there is never a need for the user to wait for a request to a server to complete. All operations can be handled by reading and writing files on the local disk, and data synchronization with other devices happens quietly in the background.While this by itself does not guarantee that the software will be fast, we expect that local-first software has the potential to respond near-instantaneously to user input, never needing to show you a spinner while you wait, and allowing you to operate with your data at your fingertips.2. Your work is not trapped on one deviceUsers today rely on several computing devices to do their work, and modern applications must support such workflows. For example, users may capture ideas on the go using their smartphone, organize and think through those ideas on a tablet, and then type up the outcome as a document on their laptop.This means that while local-first apps keep their data in local storage on each device, it is also necessary for that data to be synchronized across all of the devices on which a user does their work. Various data synchronization technologies exist, and we discuss them in detail in a later section.Most cross-device sync services also store a copy of the data on a server, which provides a convenient off-site backup for the data. These solutions work quite well as long as each file is only edited by one person at a time. If several people edit the same file at the same time, conflicts may arise, which we discuss in the section on collaboration.3. The network is optionalPersonal mobile devices move through areas of varying network availability: unreliable coffee shop WiFi, while on a plane or on a train going through a tunnel, in an elevator or a parking garage. In developing countries or rural areas, infrastructure for Internet access is sometimes patchy. While traveling internationally, many mobile users disable cellular data due to the cost of roaming. Overall, there is plenty of need for offline-capable apps, such as for researchers or journalists who need to write while in the field.“Old-fashioned” apps work fine without an Internet connection, but cloud apps typically don’t work while offline. For several years the Offline First movement has been encouraging developers of web and mobile apps to improve offline support, but in practice it has been difficult to retrofit offline support to cloud apps, because tools and libraries designed for a server-centric model do not easily adapt to situations in which users make edits while offline.Although it is possible to make web apps work offline, it can be difficult for a user to know whether all the necessary code and data for an application have been downloaded.Since local-first applications store the primary copy of their data in each device’s local filesystem, the user can read and write this data anytime, even while offline. It is then synchronized with other devices sometime later, when a network connection is available. The data synchronization need not necessarily go via the Internet: local-first apps could also use Bluetooth or local WiFi to sync data to nearby devices.Moreover, for good offline support it is desirable for the software to run as a locally installed executable on your device, rather than a tab in a web browser. For mobile apps it is already standard that the whole app is downloaded and installed before it is used.4. Seamless collaboration with your colleaguesCollaboration typically requires that several people contribute material to a document or file. However, in old-fashioned software it is problematic for several people to work on the same file at the same time: the result is often a conflict. In text files such as source code, resolving conflicts is tedious and annoying, and the task quickly becomes very difficult or impossible for complex file formats such as spreadsheets or graphics documents. Hence, collaborators may have to agree up front who is going to edit a file, and only have one person at a time who may make changes.Finder window showing conflicted files in DropboxA “conflicted copy” on Dropbox. The user must merge the changes manually.A note with conflicting edits in EvernoteIn Evernote, if a note is changed concurrently, it is moved to a “conflicting changes” notebook, and there is nothing to support the user in resolving the situation — not even a facility to compare the different versions of a note.Resolving a Git merge conflict with DiffMergeIn Git and other version control systems, several people may modify the same file in different commits. Combining those changes often results in merge conflicts, which can be resolved using specialized tools (such as DiffMerge, shown here). These tools are primarily designed for line-oriented text files such as source code; for other file formats, tool support is much weaker.On the other hand, cloud apps such as Google Docs have vastly simplified collaboration by allowing multiple users to edit a document simultaneously, without having to send files back and forth by email and without worrying about conflicts. Users have come to expect this kind of seamless real-time collaboration in a wide range of applications.In local-first apps, our ideal is to support real-time collaboration that is on par with the best cloud apps today, or better. Achieving this goal is one of the biggest challenges in realizing local-first software, but we believe it is possible: in a later section we discuss technologies that enable real-time collaboration in a local-first setting.Moreover, we expect that local-first apps can support various workflows for collaboration. Besides having several people edit the same document in real-time, it is sometimes useful for one person to tentatively propose changes that can be reviewed and selectively applied by someone else. Google Docs supports this workflow with its suggesting mode, and pull requests serve this purpose in GitHub.Suggesting changes in Google DocsIn Google Docs, collaborators can either edit the document directly, or they can suggest changes, which can then be accepted or rejected by the document owner.A pull request on GitHubThe collaboration workflow on GitHub is based on pull requests. A user may change multiple source files in multiple commits, and submit them as a proposed change to a project. Other users may review and amend the pull request before it is finally merged or rejected.5. The Long NowAn important aspect of data ownership is that you can continue accessing the data for a long time in the future. When you do some work with local-first software, your work should continue to be accessible indefinitely, even after the company that produced the software is gone.Cuneiform script on clay tablet, ca. 3000 BCECuneiform script on clay tablet, ca. 3000 BCE. Image from Wikimedia Commons“Old-fashioned” apps continue to work forever, as long as you have a copy of the data and some way of running the software. Even if the software author goes bust, you can continue running the last released version of the software. Even if the operating system and the computer it runs on become obsolete, you can still run the software in a virtual machine or emulator. As storage media evolve over the decades, you can copy your files to new storage media and continue to access them.The Internet Archive maintains a collection of historical software that can be run using an emulator in a modern web browser; enthusiasts at the English Amiga Board share tips on running historical software.On the other hand, cloud apps depend on the service continuing to be available: if the service is unavailable, you cannot use the software, and you can no longer access your data created with that software. This means you are betting that the creators of the software will continue supporting it for a long time — at least as long as you care about the data.Our incredible journey is a blog that documents startup products getting shut down after an acquisition.Although there does not seem to be a great danger of Google shutting down Google Docs anytime soon, popular products do sometimes get shut down or lose data, so we know to be careful. And even with long-lived software there is the risk that the pricing or features change in a way you don’t like, and with a cloud app, continuing to use the old version is not an option — you will be upgraded whether you like it or not.Local-first software enables greater longevity because your data, and the software that is needed to read and modify your data, are all stored locally on your computer. We believe this is important not just for your own sake, but also for future historians who will want to read the documents we create today. Without longevity of our data, we risk creating what Vint Cerf calls a “digital Dark Age.”We have previously written about long-term archiving of web pages. For an interesting discussion of long-term data preservation, see “The Cuneiform Tablets of 2015”, a paper by Long Tien Nguyen and Alan Kay at Onward! 2015.Some file formats (such as plain text, JPEG, and PDF) are so ubiquitous that they will probably be readable for centuries to come. The US Library of Congress also recommends XML, JSON, or SQLite as archival formats for datasets. However, in order to read less common file formats and to preserve interactivity, you need to be able to run the original software (if necessary, in a virtual machine or emulator). Local-first software enables this.6. Security and privacy by defaultOne problem with the architecture of cloud apps is that they store all the data from all of their users in a centralized database. This large collection of data is an attractive target for attackers: a rogue employee, or a hacker who gains access to the company’s servers, can read and tamper with all of your data. Such security breaches are sadly terrifyingly common, and with cloud apps we are unfortunately at the mercy of the provider.While Google has a world-class security team, the sad reality is that most companies do not. And while Google is good at defending your data against external attackers, the company internally is free to use your data in a myriad ways, such as feeding your data into its machine learning systems.Quoting from the Google Drive terms of service: “Our automated systems analyze your content to provide you personally relevant product features, such as customized search results, and spam and malware detection.”Maybe you feel that your data would not be of interest to any attacker. However, for many professions, dealing with sensitive data is an important part of their work. For example, medical professionals handle sensitive patient data, investigative journalists handle confidential information from sources, governments and diplomatic representatives conduct sensitive negotiations, and so on. Many of these professionals cannot use cloud apps due to regulatory compliance and confidentiality obligations.Local-first apps, on the other hand, have better privacy and security built in at the core. Your local devices store only your own data, avoiding the centralized cloud database holding everybody’s data. Local-first apps can use end-to-end encryption so that any servers that store a copy of your files only hold encrypted data that they cannot read.Modern messaging apps like iMessage, WhatsApp and Signal already use end-to-end encryption, Keybase provides encrypted file sharing and messaging, and Tarsnap takes this approach for backups. We hope to see this trend expand to other kinds of software as well.7. You retain ultimate ownership and controlWith cloud apps, the service provider has the power to restrict user access: for example, in October 2017, several Google Docs users were locked out of their documents because an automated system incorrectly flagged these documents as abusive. In local-first apps, the ownership of data is vested in the user.To disambiguate “ownership” in this context: we don’t mean it in the legal sense of intellectual property. A word processor, for example, should be oblivious to the question of who owns the copyright in the text being edited. Instead we mean ownership in the sense of user agency, autonomy, and control over data. You should be able to copy and modify data in any way, write down any thought, and no company should restrict what you are allowed to do.Under the European Convention on Human Rights, your freedom of thought and opinion is unconditional — the state may never interfere with it, since it is yours alone — whereas freedom of expression (including freedom of speech) can be restricted in certain ways, since it affects other people. Communication services like social networks convey expression, but the raw notes and unpublished work of a creative person are a way of developing thoughts and opinions, and thus warrant unconditional protection.In cloud apps, the ways in which you can access and modify your data are limited by the APIs, user interfaces, and terms of service of the service provider. With local-first software, all of the bytes that comprise your data are stored on your own device, so you have the freedom to process this data in arbitrary ways.With data ownership comes responsibility: maintaining backups or other preventative measures against data loss, protecting against ransomware, and general organizing and managing of file archives. For many professional and creative users, as introduced in the introduction, we believe that the trade-off of more responsibility in exchange for more ownership is desirable. Consider a significant personal creation, such as a PhD thesis or the raw footage of a film. For these you might be willing to take responsibility for storage and backups in order to be certain that your data is safe and fully under your control.In our opinion, maintaining control and ownership of data does not mean that the software must necessarily be open source. Although the freedom to modify software enhances user agency, it is possible for commercial and closed-source software to satisfy the local-first ideals, as long as it does not artificially restrict what users can do with their files. Examples of such artificial restrictions are PDF files that disable operations like printing, eBook readers that interfere with copy-paste, and DRM on media files.Existing data storage and sharing modelsWe believe professional and creative users deserve software that realizes the local-first goals, helping them collaborate seamlessly while also allowing them to retain full ownership of their work. If we can give users these qualities in the software they use to do their most important work, we can help them be better at what they do, and potentially make a significant difference to many people’s professional lives.However, while the ideals of local-first software may resonate with you, you may still be wondering how achievable they are in practice. Are they just utopian thinking?In the remainder of this article we discuss what it means to realize local-first software in practice. We look at a wide range of existing technologies and break down how well they satisfy the local-first ideals. In the following tables, ✓ means the technology meets the ideal, — means it partially meets the ideal, and ✗ means it does not meet the ideal.As we shall see, many technologies satisfy some of the goals, but none are able to satisfy them all. Finally, we examine a technique from the cutting edge of computer science research that might be a foundational piece in realizing local-first software in the future.How application architecture affects user experienceLet’s start by examining software from the end user’s perspective, and break down how well different software architectures meet the seven goals of local-first software. In the next section we compare storage technologies and APIs that are used by software engineers to build applications.Files and email attachments1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlFiles + email attachments ✓ — ✓ ✗ ✓ — ✓Viewed through the lens of our seven goals, traditional files have many desirable properties: they can be viewed and edited offline, they give full control to users, and they can readily be backed up and preserved for the long term. Software relying on local files also has the potential to be very fast.However, accessing files from multiple devices is trickier. It is possible to transfer a file across devices using various technologies:Sending it back and forth by email;Passing a USB drive back and forth;Via a distributed file system such as a NAS server, NFS, FTP, or rsync;Using a cloud file storage service like Dropbox, Google Drive, or OneDrive (see later section);Using a version control system such as Git (see later section).Of these, email attachments are probably the most common sharing mechanism, especially among users who are not technical experts. Attachments are easy to understand and trustworthy. Once you have a copy of a document, it does not spontaneously change: if you view an email six months later, the attachments are still there in their original form. Unlike a web app, an attachment can be opened without any additional login process.The weakest point of email attachments is collaboration. Generally, only one person at a time can make changes to a file, otherwise a difficult manual merge is required. File versioning quickly becomes messy: a back-and-forth email thread with attachments often leads to filenames such as Budget draft 2 (Jane's version) final final 3.xls.Nevertheless, for apps that want to incorporate local-first ideas, a good starting point is to offer an export feature that produces a widely-supported file format (e.g. plain text, PDF, PNG, or JPEG) and allows it to be shared e.g. via email attachment, Slack, or WhatsApp.Web apps: Google Docs, Trello, Figma, Pinterest, etc.1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlGoogle Docs — ✓ — ✓ — ✗ —Trello — ✓ — ✓ — ✗ ✗Pinterest ✗ ✓ ✗ ✓ ✗ ✗ ✗At the opposite end of the spectrum are pure web apps, where the user’s local software (web browser or mobile app) is a thin client and the data storage resides on a server. The server typically uses a large-scale database in which the data of millions of users are all mixed together in one giant collection.Web apps have set the standard for real-time collaboration. As a user you can trust that when you open a document on any device, you are seeing the most current and up-to-date version. This is so overwhelmingly useful for team work that these applications have become dominant. Even traditionally local-only software like Microsoft Office is making the transition to cloud services, with Office 365 eclipsing locally-installed Office as of 2017.With the rise of remote work and distributed teams, real-time collaborative productivity tools are becoming even more important. Ten users on a team video call can bring up the same Trello board and each make edits on their own computer while simultaneously seeing what other users are doing.The flip side to this is a total loss of ownership and control: the data on the server is what counts, and any data on your client device is unimportant — it is merely a cache. Most web apps have little or no support for offline working: if your network hiccups for even a moment, you are locked out of your work mid-sentence.Offline indicator in Google DocsIf Google Docs detects that it is offline, it blocks editing of the document.A few of the best web apps hide the latency of server communication using JavaScript, and try to provide limited offline support (for example, the Google Docs offline plugin). However, these efforts appear retrofitted to an application architecture that is fundamentally centered on synchronous interaction with a server. Users report mixed results when trying to work offline.A negative user review of the Google Docs offline extensionA negative user review of the Google Docs offline extension.Some web apps, for example Milanote and Figma, offer installable desktop clients that are essentially repackaged web browsers. If you try to use these clients to access your work while your network is intermittent, while the vendor’s servers are experiencing an outage, or after the vendor has been acquired and shut down, it becomes clear that your work was never truly yours.Offline error message in FigmaThe Figma desktop client in action.Dropbox, Google Drive, Box, OneDrive, etc.1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlDropbox ✓ — — ✗ ✓ — ✓Cloud-based file sync products like Dropbox, Google Drive, Box, or OneDrive make files available on multiple devices. On desktop operating systems (Windows, Linux, Mac OS) these tools work by watching a designated folder on the local file system. Any software on your computer can read and write files in this folder, and whenever a file is changed on one computer, it is automatically copied to all of your other computers.As these tools use the local filesystem, they have many attractive properties: access to local files is fast, and working offline is no problem (files edited offline are synced the next time an Internet connection is available). If the sync service were shut down, your files would still remain unharmed on your local disk, and it would be easy to switch to a different syncing service. If your computer’s hard drive fails, you can restore your work simply by installing the app and waiting for it to sync. This provides good longevity and control over your data.However, on mobile platforms (iOS and Android), Dropbox and its cousins use a completely different model. The mobile apps do not synchronize an entire folder — instead, they are thin clients that fetch your data from a server one file at a time, and by default they do not work offline. There is a “Make available offline” option, but you need to remember to invoke it ahead of going offline, it is clumsy, and only works when the app is open. The Dropbox API is also very server-centric.The Dropbox mobile app showing a spinner while waiting to download a fileUsers of the Dropbox mobile app spend a lot of time looking at spinners, a stark contrast to the at-your-fingertips feeling of the Dropbox desktop product.The weakest point of file sync products is the lack of real-time collaboration: if the same file is edited on two different devices, the result is a conflict that needs to be merged manually, as discussed previously. The fact that these tools synchronize files in any format is both a strength (compatibility with any application) and a weakness (inability to perform format-specific merges).Git and GitHub1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlGit+GitHub ✓ — ✓ — ✓ — ✓Git and GitHub are primarily used by software engineers to collaborate on source code. They are perhaps the closest thing we have to a true local-first software package: compared to server-centric version control systems such as Subversion, Git works fully offline, it is fast, it gives full control to users, and it is suitable for long-term preservation of data. This is the case because a Git repository on your local filesystem is a primary copy of the data, and is not subordinate to any server.We focus on Git/GitHub here as the most successful examples, but these lessons also apply to other distributed revision control tools like Mercurial or Darcs, and other repository hosting services such as GitLab or Bitbucket. In principle it is possible to collaborate without a repository service, e.g. by sending patch files by email, but the majority of Git users rely on GitHub.A repository hosting service like GitHub enables collaboration around Git repositories, accessing data from multiple devices, as well as providing a backup and archival location. Support for mobile devices is currently weak, although Working Copy is a promising Git client for iOS. GitHub stores repositories unencrypted; if stronger privacy is required, it is possible for you to run your own repository server.We think the Git model points the way toward a future for local-first software. However, as it currently stands, Git has two major weaknesses:Git is excellent for asynchronous collaboration, especially using pull requests, which take a coarse-grained set of changes and allow them to be discussed and amended before merging them into the shared master branch. But Git has no capability for real-time, fine-grained collaboration, such as the automatic, instantaneous merging that occurs in tools like Google Docs, Trello, and Figma.Git is highly optimized for code and similar line-based text files; other file formats are treated as binary blobs that cannot meaningfully be edited or merged. Despite GitHub’s efforts to display and compare images, prose, and CAD files, non-textual file formats remain second-class in Git.It’s interesting to note that most software engineers have been reluctant to embrace cloud software for their editors, IDEs, runtime environments, and build tools. In theory, we might expect this demographic of sophisticated users to embrace newer technologies sooner than other types of users. But if you ask an engineer why they don’t use a cloud-based editor like Cloud9 or Repl.it, or a runtime environment like Colaboratory, the answers will usually include “it’s too slow” or “I don’t trust it” or “I want my code on my local system.” These sentiments seem to reflect some of the same motivations as local-first software. If we as developers want these things for ourselves and our work, perhaps we might imagine that other types of creative professionals would want these same qualities for their own work.Developer infrastructure for building appsNow that we have examined the user experience of a range of applications through the lens of the local-first ideals, let’s switch mindsets to that of an application developer. If you are creating an app and want to offer users some or all of the local-first experience, what are your options for data storage and synchronization infrastructure?Web app (thin client)1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlWeb apps ✗ ✓ ✗ ✓ ✗ ✗ ✗A web app in its purest form is usually a Rails, Django, PHP, or Node.js program running on a server, storing its data in a SQL or NoSQL database, and serving web pages over HTTPS. All of the data is on the server, and the user’s web browser is only a thin client.This architecture offers many benefits: zero installation (just visit a URL), and nothing for the user to manage, as all data is stored and managed in one place by the engineering and DevOps professionals who deploy the application. Users can access the application from all of their devices, and colleagues can easily collaborate by logging in to the same application.JavaScript frameworks such as Meteor and ShareDB, and services such as Pusher and Ably, make it easier to add real-time collaboration features to web applications, building on top of lower-level protocols such as WebSocket.On the other hand, a web app that needs to perform a request to a server for every user action is going to be slow. It is possible to hide the round-trip times in some cases by using client-side JavaScript, but these approaches quickly break down if the user’s internet connection is unstable.Despite many efforts to make web browsers more offline-friendly (manifests, localStorage, service workers, and Progressive Web Apps, among others), the architecture of web apps remains fundamentally server-centric. Offline support is an afterthought in most web apps, and the result is accordingly fragile. In many web browsers, if the user clears their cookies, all data in local storage is also deleted; while this is not a problem for a cache, it makes the browser’s local storage unsuitable for storing data of any long-term importance.News website The Guardian documents how they used service workers to build an offline experience for their users.Relying on third-party web apps also scores poorly in terms of longevity, privacy, and user control. It is possible to improve these properties if the web app is open source and users are willing to self-host their own instances of the server. However, we believe that self-hosting is not a viable option for the vast majority of users who do not want to become system administrators; moreover, most web apps are closed source, ruling out this option entirely.All in all, we speculate that web apps will never be able to provide all the local-first properties we are looking for, due to the fundamental thin-client nature of the platform. By choosing to build a web app, you are choosing the path of data belonging to you and your company, not to your users.Mobile app with local storage (thick client)1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlThick client ✓ — ✓ ✗ — ✗ ✗iOS and Android apps are locally installed software, with the entire app binary downloaded and installed before the app is run. Many apps are nevertheless thin clients, similarly to web apps, which require a server in order to function (for example, Twitter, Yelp, or Facebook). Without a reliable Internet connection, these apps give you spinners, error messages, and unexpected behavior.However, there is another category of mobile apps that are more in line with the local-first ideals. These apps store data on the local device in the first instance, using a persistence layer like SQLite, Core Data, or just plain files. Some of these (such as Clue or Things) started life as a single-user app without any server, and then added a cloud backend later, as a way to sync between devices or share data with other users.These thick-client apps have the advantage of being fast and working offline, because the server sync happens in the background. They generally continue working if the server is shut down. The degree to which they offer privacy and user control over data varies depending on the app in question.Things get more difficult if the data may be modified on multiple devices or by multiple collaborating users. The developers of mobile apps are generally experts in end-user app development, not in distributed systems. We have seen multiple app development teams writing their own ad-hoc diffing, merging, and conflict resolution algorithms, and the resulting data sync solutions are often unreliable and brittle. A more specialized storage backend, as discussed in the next section, can help.Backend-as-a-Service: Firebase, CloudKit, Realm1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlFirebase, CloudKit, Realm — ✓ ✓ — ✗ ✗ ✗Firebase is the most successful of mobile backend-as-a-service options. It is essentially a local on-device database combined with a cloud database service and data synchronization between the two. Firebase allows sharing of data across multiple devices, and it supports offline use. However, as a proprietary hosted service, we give it a low score for privacy and longevity.Another popular backend-as-a-service was Parse, but it was acquired and then shut down by Facebook in 2017. Apps relying on it were forced to move to other backend services, underlining the importance of longevity.Firebase offers a great experience for you, the developer: you can view, edit, and delete data in a free-form way in the Firebase console. But the user does not have a comparable way of accessing, manipulating and managing their data, leaving the user with little ownership and control.The Firebase console, where data can be viewed and editedThe Firebase console: great for developers, off-limits for the end user.Apple’s CloudKit offers a Firebase-like experience for apps willing to limit themselves to the iOS and Mac platforms. It is a key-value store with syncing, good offline capabilities, and it has the added benefit of being built into the platform (thereby sidestepping the clumsiness of users having to create an account and log in). It’s a great choice for indie iOS developers and is used to good effect by tools like Ulysses, Bear, Overcast, and many more.The preferences dialog of Ulysses, with the iCloud option checkedWith one checkbox, Ulysses syncs work across all of the user’s connected devices, thanks to its use of CloudKit.Another project in this vein is Realm. This persistence library for iOS gained popularity compared to Core Data due to its cleaner API. The client-side library for local persistence is called Realm Database, while the associated Firebase-like backend service is called Realm Object Server. Notably, the object server is open source and self-hostable, which reduces the risk of being locked in to a service that might one day disappear.Mobile apps that treat the on-device data as the primary copy (or at least more than a disposable cache), and use sync services like Firebase or iCloud, get us a good bit of the way toward local-first software.CouchDB1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlCouchDB — — ✓ ✗ — — —CouchDB is a database that is notable for pioneering a multi-master replication approach: several machines each have a fully-fledged copy of the database, each replica can independently make changes to the data, and any pair of replicas can synchronize with each other to exchange the latest changes. CouchDB is designed for use on servers; Cloudant provides a hosted version; PouchDB and Hoodie are sibling projects that use the same sync protocol but are designed to run on end-user devices.Philosophically, CouchDB is closely aligned to the local-first principles, as evidenced in particular by the CouchDB book, which provides an excellent introduction to relevant topics such as distributed consistency, replication, change notifications, and multiversion concurrency control.While CouchDB/PouchDB allow multiple devices to concurrently make changes to a database, these changes lead to conflicts that need to be explicitly resolved by application code. This conflict resolution code is difficult to write correctly, making CouchDB impractical for applications with very fine-grained collaboration, like in Google Docs, where every keystroke is potentially an individual change.In practice, the CouchDB model has not been widely adopted. Various reasons have been cited for this: scalability problems when a separate database per user is required; difficulty embedding the JavaScript client in native apps on iOS and Android; the problem of conflict resolution; the unfamiliar MapReduce model for performing queries; and more. All in all, while we agree with much of the philosophy behind CouchDB, we feel that the implementation has not been able to realize the local-first vision in practice.Towards a better futureAs we have shown, none of the existing data layers for application development fully satisfy the local-first ideals. Thus, three years ago, our lab set out to search for a solution that gives seven green checkmarks.1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User control??? ✓ ✓ ✓ ✓ ✓ ✓ ✓We have found some technologies that appear to be promising foundations for local-first ideals. Most notably are the family of distributed systems algorithms called Conflict-free Replicated Data Types (CRDTs).CRDTs as a foundational technologyCRDTs emerged from academic computer science research in 2011. They are general-purpose data structures, like hash maps and lists, but the special thing about them is that they are multi-user from the ground up.Every application needs some data structures to store its document state. For example, if your application is a text editor, the core data structure is the array of characters that make up the document. If your application is a spreadsheet, the data structure is a matrix of cells containing text, numbers, or formulas referencing other cells. If it is a vector graphics application, the data structure is a tree of graphical objects such as text objects, rectangles, lines, and other shapes.If you are building a single-user application, you would maintain those data structures in memory using model objects, hash maps, lists, records/structs and the like. If you are building a collaborative multi-user application, you can swap out those data structures for CRDTs.Two devices initially have the same to-do list. On device 1, a new item is added to the list using the .push() method, which appends a new item to the end of a list. Concurrently, the first item is marked as done on device 2. After the two devices communicate, the CRDT automatically merges the states so that both changes take effect.The diagram above shows an example of a to-do list application backed by a CRDT with a JSON data model. Users can view and modify the application state on their local device, even while offline. The CRDT keeps track of any changes that are made, and syncs the changes with other devices in the background when a network connection is available.If the state was concurrently modified on different devices, the CRDT merges those changes. For example, if users concurrently add new items to the to-do list on different devices, the merged state contains all of the added items in a consistent order. Concurrent changes to different objects can also be merged easily. The only type of change that a CRDT cannot automatically resolve is when multiple users concurrently update the same property of the same object; in this case, the CRDT keeps track of the conflicting values, and leaves it to be resolved by the application or the user.Thus, CRDTs have some similarity to version control systems like Git, except that they operate on richer data types than text files. CRDTs can sync their state via any communication channel (e.g. via a server, over a peer-to-peer connection, by Bluetooth between local devices, or even on a USB stick). The changes tracked by a CRDT can be as small as a single keystroke, enabling Google Docs-style real-time collaboration. But you could also collect a larger set of changes and send them to collaborators as a batch, more like a pull request in Git. Because the data structures are general-purpose, we can develop general-purpose tools for storage, communication, and management of CRDTs, saving us from having to re-implement those things in every single app.For a more technical introduction to CRDTs we suggest:Alexei Baboulevitch’s Data Laced with HistoryMartin Kleppmann’s Convergence vs Consensus (slides)Shapiro et al.’s comprehensive surveyAttiya et al.’s formal specification of collaborative text editingGomes et al.’s formal verification of CRDTsInk & Switch has developed an open-source, JavaScript CRDT implementation called Automerge. It is based on our earlier research on JSON CRDTs. We have then combined Automerge with the Dat networking stack to form Hypermerge. We do not claim that these libraries fully realize local-first ideals — more work is still required.However, based on our experience with them, we believe that CRDTs have the potential to be a foundation for a new generation of software. Just as packet switching was an enabling technology for the Internet and the web, or as capacitive touchscreens were an enabling technology for smartphones, so we think CRDTs may be the foundation for collaborative software that gives users full ownership of their data.Ink & Switch prototypesWhile academic research has made good progress designing the algorithms for CRDTs and verifying their theoretical correctness, there is so far relatively little industrial use of these technologies. Moreover, most industrial CRDT use has been in server-centric computing, but we believe this technology has significant potential in client-side applications for creative work.Server-centric systems using CRDTs include Azure Cosmos DB, Redis, Riak, Weave Mesh, SoundCloud’s Roshi, and Facebook’s OpenR. However, we are most interested in the use of CRDTs on end-user devices.This was the motivation for our lab to embark on a series of experimental prototypes with collaborative, local-first applications built on CRDTs. Each prototype offered an end-user experience modeled after an existing app for creative work such as Trello, Figma, or Milanote.These experiments explored questions in three areas:Technology viability. How close are CRDTs to being usable for working software? What do we need for network communication, or installation of the software to begin with?User experience. How does local-first software feel to use? Can we get a seamless Google Docs-like real-time collaboration experience without an authoritative centralized server? How about a Git-like, offline-friendly, asynchronous collaboration experience for data types other than source code? And generally, how are user interfaces different without a centralized server?Developer experience. For an app developer, how does the use of a CRDT-based data layer compare to existing storage layers like a SQL database, a filesystem, or Core Data? Is a distributed system harder to write software for? Do we need schemas and type checking? What will developers use for debugging and introspection of their application’s data layer?We built three prototypes using Electron, JavaScript, and React. This gave us the rapid development capability of web technologies while also giving our users a piece of software they can download and install, which we discovered is an important part of the local-first feeling of ownership.Kanban boardTrellis is a Kanban board modeled after the popular Trello project management software.Screenshot of Trellis, a clone of TrelloTrellis offers a Trello-like experience with local-first software. The change history on the right reflects changes made by all users active in the document.On this project we experimented with WebRTC for the network communication layer.On the user experience side, we designed a rudimentary “change history” inspired by Git and Google Docs’ “See New Changes” that allows users to see the operations on their Kanban board. This includes stepping back in time to view earlier states of the document.Watch Trellis in action with the demo video or download a release and try it yourself.Collaborative drawingPixelPusher is a collaborative drawing program, bringing a Figma-like real-time experience to Javier Valencia’s Pixel Art to CSS.Screenshot of the PixelPusher user interfaceDrawing together in real-time. A URL at the top offers a quick way to share this document with other users. The “Versions” panel on the right shows all branches of the current document. The arrow buttons offer instant merging between branches.On this project we experimented with network communication via peer-to-peer libraries from the Dat project.User experience experiments include URLs for document sharing, a visual branch/merge facility inspired by Git, a conflict-resolution mechanism that highlights conflicted pixels in red, and basic user identity via user-drawn avatars.Read the full project report or download a release to try it yourself.Media canvasPushPin is a mixed media canvas workspace similar to Miro or Milanote. As our third project built on Automerge, it’s the most fully-realized of these three. Real use by our team and external test users put more strain on the underlying data layer.Screenshot of PushPin, showing images and text cards on a canvasPushPin’s canvas mixes text, images, discussion threads, and web links. Users see each other via presence avatars in the toolbar, and navigate between their own documents using the URL bar.PushPin explored nested and connected shared documents, varied renderers for CRDT documents, a more advanced identity system that included an “outbox” model for sharing, and support for sharing ephemeral data such as selection highlights.Watch the PushPin demo video or download a release and try it yourself.FindingsOur goal in developing the three prototypes Trellis, PixelPusher and PushPin was to evaluate the technology viability, user experience, and developer experience of local-first software and CRDTs. We tested the prototypes by regularly using them within the development team (consisting of five members), reflecting critically on our experiences developing the software, and by conducting individual usability tests with approximately ten external users. The external users included professional designers, product managers, and software engineers. We did not follow a formal evaluation methodology, but rather took an exploratory approach to discovering the strengths and weaknesses of our prototypes.In this section we outline the lessons we learned from building and using these prototypes. While these findings are somewhat subjective, we believe they nevertheless contain valuable insights, because we have gone further than other projects down the path towards production-ready local-first applications based on CRDTs.CRDT technology works.From the beginning we were pleasantly surprised by the reliability of Automerge. App developers on our team were able to integrate the library with relative ease, and the automatic merging of data was almost always straightforward and seamless.The user experience with offline work is splendid.The process of going offline, continuing to work for as long as you want, and then reconnecting to merge changes with colleagues worked well. While other applications on the system threw up errors (“offline! warning!”) and blocked the user from working, the local-first prototypes function normally regardless of network status. Unlike browser-based systems, there is never any anxiety about whether the application will work or the data will be there when the user needs it. This gives the user a feeling of ownership over their tools and their work, just as we had hoped.Developer experience is viable when combined with Functional Reactive Programming (FRP).The FRP model of React fits well with CRDTs. A data layer based on CRDTs means the user’s document is simultaneously getting updates from the local user (e.g. as they type into a text document) but also from the network (as other users and other devices make changes to the document).Because the FRP model reliably synchronizes the visible state of the application with the underlying state of the shared document, the developer is freed from the tedious work of tracking changes arriving from other users and reconciling them with the current view. Also, by ensuring all changes to the underlying state are made through a single function (a “reducer”), it’s easy to ensure that all relevant local changes are sent to other users.The result of this model was that all of our prototypes realized real-time collaboration and full offline capability with little effort from the application developer. This is a significant benefit as it allows app developers to focus on their application rather than the challenges of data distribution.Conflicts are not as significant a problem as we feared.We are often asked about the effectiveness of automatic merging, and many people assume that application-specific conflict resolution mechanisms are required. However, we found that users surprisingly rarely encounter conflicts in their work when collaborating with others, and that generic resolution mechanisms work well. The reasons for this are:Automerge tracks changes at a fine-grained level, and takes datatype semantics into account. For example, if two users concurrently insert items at the same position into an array, Automerge combines these changes by positioning the two new items in a deterministic order. In contrast, a textual version control system like Git would treat this situation as a conflict requiring manual resolution.Users have an intuitive sense of human collaboration and avoid creating conflicts with their collaborators. For example, when users are collaboratively editing an article, they may agree in advance who will be working on which section for a period of time, and avoid concurrently modifying the same section.When different users concurrently modify different parts of the document state, Automerge will merge these changes cleanly without difficulty. With the Kanban app, for example, one user could post a comment on a card and another could move it to another column, and the merged result will reflect both of these changes. Conflicts arise only if users concurrently modify the same property of the same object: for example, if two users concurrently change the position of the same image object on a canvas. In such cases, it is often arbitrary how they are resolved and satisfactory either way.Automerge’s data structures come with a small set of default resolution policies for concurrent changes. In principle, one might expect different applications to require different merge semantics. However, in all the prototypes we developed, we found that the default merge semantics to be sufficient, and we have so far not identified any case requiring customised semantics. We hypothesise that this is the case generally, and we hope that future research will be able to further test this hypothesis.Visualizing document history is important.In a distributed collaborative system another user can deliver any number of changes to you at any moment. Unlike centralized systems, where servers mediate change, local-first applications need to find their own solutions to these problems. Without the right tools, it can be difficult to understand how a document came to look the way it does, what versions of the document exist, or where contributions came from.In the Trellis project we experimented with a “time travel” interface, allowing a user to move back in time to see earlier states of a merged document, and automatically highlighting recently changed elements as changes are received from other users. The ability to traverse a potentially complex merged document history in a linear fashion helps to provide context and could become a universal tool for understanding collaboration.URLs are a good mechanism for sharing.We experimented with a number of mechanisms for sharing documents with other users, and found that a URL model, inspired by the web, makes the most sense to users and developers. URLs can be copied and pasted, and shared via communication channels such as email or chat. Access permissions for documents beyond secret URLs remain an open research question.Peer-to-peer systems are never fully “online” or “offline” and it can be hard to reason about how data moves in them.A traditional centralized system is generally “up” or “down,” states defined by each client by their ability to maintain a steady network connection to the server. The server determines the truth of a given piece of data.In a decentralized system, we can have a kaleidoscopic complexity to our data. Any user may have a different perspective on what data they either have, choose to share, or accept. For example, one user’s edits to a document might be on their laptop on an airplane; when the plane lands and the computer reconnects, those changes are distributed to other users. Other users might choose to accept all, some, or none of those changes to their version of the document.Different versions of a document can lead to confusion. As with a Git repository, what a particular user sees in the “master” branch is a function of the last time they communicated with other users. Newly arriving changes might unexpectedly modify parts of the document you are working on, but manually merging every change from every user is tedious. Decentralized documents enable users to be in control over their own data, but further study is needed to understand what this means in practical user-interface terms.CRDTs accumulate a large change history, which creates performance problems.Our team used PushPin for “real” documents such as sprint planning. Performance and memory/disk usage quickly became a problem because CRDTs store all history, including character-by-character text edits. These pile up, but can’t easily be truncated because it’s impossible to know when someone might reconnect to your shared document after six months away and need to merge changes from that point forward.We continue to optimize Automerge, but this is a major area of ongoing work.Network communication remains an unsolved problem.CRDT algorithms provide only for the merging of data, but say nothing about how different users’ edits arrive on the same physical computer.In these experiments we tried network communication via WebRTC; a “sneakernet” implementation of copying files around with Dropbox and USB keys; possible use of the IPFS protocols; and eventually settled on the Hypercore peer-to-peer libraries from Dat.CRDTs do not require a peer-to-peer networking layer; using a server for communication is fine for CRDTs. However, to fully realize the longevity goal of local-first software, we want applications to outlive any backend services managed by their vendors, so a decentralized solution is the logical end goal.The use of P2P technologies in our prototypes yielded mixed results. On one hand, these technologies are nowhere near production-ready: NAT traversal, in particular, is unreliable depending on the particular router or network topology where the user is currently connected. But the promise suggested by P2P protocols and the Decentralized Web community is substantial. Live collaboration between computers without Internet access feels like magic in a world that has come to depend on centralized APIs.Cloud servers still have their place for discovery, backup, and burst compute.A real-time collaborative prototype like PushPin lets users share their documents with other users without an intermediating server. This is excellent for privacy and ownership, but can result in situations where a user shares a document, and then closes their laptop lid before the other user has connected. If the users are not online at the same time, they cannot connect to each other.Servers thus have a role to play in the local-first world — not as central authorities, but as “cloud peers” that support client applications without being on the critical path. For example, a cloud peer that stores a copy of the document, and forwards it to other peers when they come online, could solve the closed-laptop problem above.Hashbase is an example of a cloud peer and bridge for Dat and Beaker Browser.Similarly, cloud peers could be:an archival/backup location (especially for phones or other devices with limited storage);a bridge to traditional server APIs (such as weather forecasts or a stock tickers);a provider of burst computing resources (like rendering a video using a powerful GPU).The key difference between traditional systems and local-first systems is not an absence of servers, but a change in their responsibilities: they are in a supporting role, not the source of truth.How you can helpThese experiments suggest that local-first software is possible. Collaboration and ownership are not at odds with each other — we can get the best of both worlds, and users can benefit.However, the underlying technologies are still a work in progress. They are good for developing prototypes, and we hope that they will evolve and stabilize in the coming years, but realistically, it is not yet advisable to replace a proven product like Firebase with an experimental project like Automerge in a production setting today.If you believe in a local-first future, as we do, what can you (and all of us in the technology field) do to move us toward it? Here are some suggestions.For distributed systems and programming languages researchersLocal-first software has benefited tremendously from recent research into distributed systems, including CRDTs and peer-to-peer technologies. The current research community is making excellent progress in improving the performance and power of CRDTs and we eagerly await further results from that work. Still, there are interesting opportunities for further work.Most CRDT research operates in a model where all collaborators immediately apply their edits to a single version of a document. However, practical local-first applications require more flexibility: users must have the freedom to reject edits made by another collaborator, or to make private changes to a version of the document that is not shared with others. A user might want to apply changes speculatively or reformat their change history. These concepts are well understood in the distributed source control world as “branches,” “forks,” “rebasing,” and so on. There is little work to date on understanding the algorithms and programming models for collaboration in situations where multiple document versions and branches exist side-by-side.We see further interesting problems around types, schema migrations, and compatibility. Different collaborators may be using different versions of an application, potentially with different features. As there is no central database server, there is no authoritative “current” schema for the data. How can we write software so that varying application versions can safely interoperate, even as data formats evolve? This question has analogues in cloud-based API design, but a local-first setting provides additional challenges.For Human-Computer Interaction (HCI) researchersFor centralized systems, there are ample examples in the field today of applications that indicate their “sync” state with a server. Decentralized systems have a whole host of interesting new opportunities to explore user interface challenges.We hope researchers will consider how to communicate online and offline states, or available and unavailable states for systems where any other user may hold a different copy of data. How should we think about connectivity when everyone is a peer? What does it mean to be “online” when we can collaborate directly with other nodes without access to the wider Internet?Example Git commit history as visualized by GitXThe “railroad track” model, as used in GitX for visualizing the structure of source code history in a Git repository.When every document can develop a complex version history, simply through daily operation, an acute problem arises: how do we communicate this version history to users? How should users think about versioning, share and accept changes, and understand how their documents came to be a certain way when there is no central source of truth? Today there are two mainstream models for change management: a source-code model of diffs and patches, and a Google Docs model of suggestions and comments. Are these the best we can do? How do we generalize these ideas to data formats that are not text? We are eager to see what can be discovered.While centralized systems rely heavily on access control and permissions, the same concepts do not directly apply in a local-first context. For example, any user who has a copy of some data cannot be prevented from locally modifying it; however, other users may choose whether or not to subscribe to those changes. How should users think about sharing, permissions, and feedback? If we can’t remove documents from others’ computers, what does it mean to “stop sharing” with someone?We believe that the assumption of centralization is deeply ingrained in our user experiences today, and we are only beginning to discover the consequences of changing that assumption. We hope these open questions will inspire researchers to explore what we believe is an untapped area.For practitionersIf you’re a software engineer, designer, product manager, or independent app developer working on production-ready software today, how can you help? We suggest taking incremental steps toward a local-first future. Start by scoring your app:1. Fast 2. Multi-device 3. Offline 4. Collaboration 5. Longevity 6. Privacy 7. User controlYour appThen some strategies for improving each area:Fast. Aggressive caching and downloading resources ahead of time can be a way to prevent the user from seeing spinners when they open your app or a document they previously had open. Trust the local cache by default instead of making the user wait for a network fetch.Multi-device. Syncing infrastructure like Firebase and iCloud make multi-device support relatively painless, although they do introduce longevity and privacy concerns. Self-hosted infrastructure like Realm Object Server provides an alternative trade-off.Offline. In the web world, Progressive Web Apps offer features like Service Workers and app manifests that can help. In the mobile world, be aware of WebKit frames and other network-dependent components. Test your app by turning off your WiFi, or using traffic shapers such as the Chrome Dev Tools network condition simulator or the iOS network link conditioner.Collaboration. Besides CRDTs, the more established technology for real-time collaboration is Operational Transformation (OT), as implemented e.g. in ShareDB.Longevity. Make sure your software can easily export to flattened, standard formats like JSON or PDF. For example: mass export such as Google Takeout; continuous backup into stable file formats such as in GoodNotes; and JSON download of documents such as in Trello.Privacy. Cloud apps are fundamentally non-private, with employees of the company and governments able to peek at user data at any time. But for mobile or desktop applications, try to make clear to users when the data is stored only on their device versus being transmitted to a backend.User control. Can users easily back up, duplicate, or delete some or all of their documents within your application? Often this involves re-implementing all the basic filesystem operations, as Google Docs has done with Google Drive.Call for startupsIf you are an entrepreneur interested in building developer infrastructure, all of the above suggests an interesting market opportunity: “Firebase for CRDTs.”Such a startup would need to offer a great developer experience and a local persistence library (something like SQLite or Realm). It would need to be available for mobile platforms (iOS, Android), native desktop (Windows, Mac, Linux), and web technologies (Electron, Progressive Web Apps).User control, privacy, multi-device support, and collaboration would all be baked in. Application developers could focus on building their app, knowing that the easiest implementation path would also given them top marks on the local-first scorecard. As litmus test to see if you have succeeded, we suggest: do all your customers’ apps continue working in perpetuity, even if all servers are shut down?We believe the “Firebase for CRDTs” opportunity will be huge as CRDTs come of age. We’d like to hear from you if you’re working on this.ConclusionsComputers are one of the most important creative tools mankind has ever produced. Software has become the conduit through which our work is done and the repository in which that work resides.In the pursuit of better tools we moved many applications to the cloud. Cloud software is in many regards superior to “old-fashioned” software: it offers collaborative, always-up-to-date applications, accessible from anywhere in the world. We no longer worry about what software version we are running, or what machine a file lives on.However, in the cloud, ownership of data is vested in the servers, not the users, and so we became borrowers of our own data. The documents created in cloud apps are destined to disappear when the creators of those services cease to maintain them. Cloud services defy long-term preservation. No Wayback Machine can restore a sunsetted web application. The Internet Archive cannot preserve your Google Docs.In this article we explored a new way forward for software of the future. We have shown that it is possible for users to retain ownership and control of their data, while also benefiting from the features we associate with the cloud: seamless collaboration and access from anywhere. It is possible to get the best of both worlds.But more work is needed to realize the local-first approach in practice. Application developers can take incremental steps, such as improving offline support and making better use of on-device storage. Researchers can continue improving the algorithms, programming models, and user interfaces for local-first software. Entrepreneurs can develop foundational technologies such as CRDTs and peer-to-peer networking into mature products able to power the next generation of applications.Today it is easy to create a web application in which the server takes ownership of all the data. But it is too hard to build collaborative software that respects users’ ownership and agency. In order to shift the balance, we need to improve the tools for developing local-first software. We hope that you will join us.We welcome your thoughts, questions, or critique: @inkandswitch or hello@inkandswitch.com.AcknowledgmentsMartin Kleppmann is supported by a grant from The Boeing Company. Thank you to our collaborators at Ink & Switch who worked on the prototypes discussed above: Julia Roggatz, Orion Henry, Roshan Choxi, Jeff Peterson, Jim Pick, and Ignatius Gilfedder. Thank you also to Heidi Howard, Roly Perera, and to the anonymous reviewers from Onward! for feedback on a draft of this article.
Part 4: Line IDsFebruary 25, 2019I’ve written quite a bit about the theory of patches and merging, but nothing yet about how to actually implement anything efficiently. That will be the subject of this post, and probably some future posts too. Algorithms and efficiency are not really discussed in the original paper, so most of this material I learned from reading the pijul source code. Having said that, my main focus here is on broader ideas and algorithms, and so you shouldn’t assume that anything written here is an accurate reflection of pijul (plus, my pijul knowledge is about 2 years out of date by now).Three sizesBefore getting to the juicy details, we have to decide what it means for things to be fast. In a VCS, there are three different size scales that we need to think about. From smallest to biggest, we have:The size of the change. Like, if we’re changing one line in a giant file, then the size of the change is just the length of the one line that we’re changing.The size of the current output. In the case of ojo (which just tracks a single file), this is just the size of the file.The size of the history. This includes everything that has ever been in the repository, so if the repository has been active for years then the size of the history could be much bigger than the current size.The first obvious requirement is that the size of a patch should be proportional to the size of the change. This sounds almost too obvious to mention, but remember the definition of a patch from here: a patch consists of a source file, a target file, and a function from one to the other that has certain additional properties. If we were to naively translate this definition into code, the size of a patch would be proportional to the size of the entire file.Of course, this is a solved problem in the world of UNIX-style diffs (which I mentioned all the way back in the first post). The problem is to adapt the diff approach to our mathematical patch framework; for example, the fact that our files need not even be ordered means that it doesn’t make sense to talk about inserting a line “after line 62.”The key to solving this turns out to be to unique IDs: give every line in the entire history of the repository a unique ID. This isn’t even very difficult: we can give every patch in the history of the repository a unique ID by hashing its contents. For each patch, we can enumerate the lines that it adds and then for the rest of time, we can uniquely refer to those lines like “the third line added by patch Ar8f.”Representing patchesOnce we’ve added unique IDs to every line, it becomes pretty easy to encode patches compactly. For example, suppose we want to describe this patch:Here, dA is the unique ID of the patch that introduced the to-do, shoes, and garbage lines, and x5 is the unique ID of the patch that we want to describe. Anyway, the patch is now easy to describe by using the unique IDs to specify what we want to do: delete the line with ID dA/1, add the line with ID x5/0 and contents “work”, and add an edge from the line dA/2 to the line x5/0.Let’s have a quick look at how this is implemented in ojo, by taking a peek at the API docs. Patches, funnily enough, are represented by the Patch struct, which basically consists of metadata (author, commit message, timestamp) and a list of Changes. The Changes are the most interesting part, and they look like this:pub enum Change {NewNode { id: NodeId, contents: Vec<u8> },DeleteNode { id: NodeId },NewEdge { src: NodeId, dest: NodeId },}In other words, the example that we saw above is basically all there is to it, as far as patches go.If you want to see what actual patches look like in actual usage, you can do that too because ojo keeps all of its data in human-readable text. After installing ojo (with cargo install ojo), you can create a new repository (with ojo init), edit the file ojo_file.txt with your favorite editor, and then:$ ojo patch create -m "Initial commit" -a MeCreated patch PMyANESmvMQ8WR8ccSKpnH8pLc-uyt0jzGkauJBWeqx4=$ ojo patch export -o out.txt PSc97nCk9oRrRl-2IW3H8TYVtA0hArdVtj5F0f4YSqqs=Successfully wrote the file 'out.txt'Now look in out.txt to see your NewNodes and NewEdges in all their glory.AntiquingI introduced unique IDs as a way to achieve compact representations of patches, but it turns out that they also solve a problem that I promised to explain two years ago: how do I compute the “most antique” version of a patch? Or equivalently, if I have some patch but I want to apply it to a slightly different repository, how do I know whether I can do that? With our description of patches above, this is completely trivial: a patch can only add lines, delete lines, or add edges. Adding lines is always valid, no matter what the repository contains. Deleting lines and adding edges can be done if and only if the lines to delete, or the lines to connect, exist in the repository. Since lines have unique IDs, checking this is unambiguous. Actually, it’s really easy because the line IDs are tied to the patch that introduced them: a patch can be applied if and only if all the patch IDs that it refers to have already been applied. For obvious reasons, we refer to these as “dependencies”: the dependencies of a patch are all the other patches that it refers to in DeleteNode and NewEdge commands. You can see this in action here.By the way, this method will always give a minimal set of dependencies (in other words, the most antique version of a patch), but it isn’t necessarily the right thing to do. For example, if a patch deletes a line then it seems reasonable for it to also depend on the lines adjacent to the deleted line. Ojo might do this in the future, but for now it sticks to the minimal dependencies.Applying patchesNow that we know how to compactly represent patches, how quickly can we apply them? To get really into detail here, we’d need to talk about how the state of the repository is represented on disk (which is an interesting topic on its own, but a bit out of scope for this post). Let’s just pretend for now that the current state of the repository is stored as a graph in memory, using some general-purpose crate (like, say, petgraph). Each node in the graph needs to store the contents of the corresponding line, as well as a “tombstone” saying whether it has been deleted (see the first post). Assuming we can add nodes and edges in constant time (like, say, in petgraph), applying a single change is a constant time operation. That means the time it takes to apply the whole patch is proportional to the number of changes. That’s the best we could hope for, so we’re done, right? What was even the point of the part about three size scales?Revenge of the ghostsImagine you have a file that contains three lines:first linesecond linelast linebut behind the scenes, there are a bunch of lines that used to be there. So ojo’s representation of your file might look like:Now let’s imagine that we delete “second line.” The patch to do this consists of a single DeleteLine command, and it takes almost no time to apply:Now that we have this internal representation, ojo needs to create a file on disk showing the new state. That is, we want to somehow go from the internal representation above to the filefirst linelast lineDo you see the problem? Even though the output file is only two lines long, in order to produce it we need to visit all of the lines that used to be there but have since been deleted. In other words, we can apply patches quickly (in timescale 1), but rendering the output file is slow (in timescale 3). For a real VCS that tracks decade-old repositories, that clearly isn’t going to fly.Pseudo-edgesThere are several ingredients that go into supporting fast rendering of output files (fast here means “timescale 2, most of the time”, which is the best that we can hope for). Those are going to be the subject of the next post. So that you have something to think about until then, let me get you started: the key idea is to introduce “pseudo-edges,” which are edges that we insert on our own in order to allow us to “skip” large regions of deleted lines. In the example above, the goal is to actually generate this graph:These extra edges will allow us to quickly render output files, but they open up a new can of worms (and were the source of several subtle bugs in pijul last time I used it (i.e. two years ago)): how do we know when to add (or remove, or update) pseudo-edges? Keep in mind that we aren’t willing to traverse the entire graph to compute the pseudo-edges, because that would defeat the purpose.
# Document TitleIs there a VCS for a person who doesn't want to spend every day thinking about how they are supposed to use their VCS?Mercurial. It is quite good. I used it for quite some while for my personal stuff, but migrated to git becausr it is a useful thing to know if I ever decide to take a dev job.CVS is easy to set up for local, personal use (set CVS_HOME to a local directory). I used it for a few years, initially out of curiosity, before using Mercurial.I still use RCS for single files (mostly plain text documents).Darcs and Pijul are interesting projects that have patches as a primitive. Darcs is known to be slow on bigger projects, and AFAIK Pijul is an attempt at a similar but faster system using Rust (Darcs is written in Haskell).8 voteshereticalgorithmFebruary 22, 2019(edited February 22, 2019)LinkIt might be a problem of git's model clashing with your intuition of how a VCS should work.Darcs (more stable, but has performance issues with merging) & Pijul (experimental, faster algorithms) both operate on a patch-based model. Instead of tracking a series of snapshots in time (and branches horizontally), they track sets of changes (and possible combinations).Users report that this model eliminates the eldrich horror unexpected behavior that lurks dreaming beneath R'lyeh is sometimes obscured by the porcelain interface but still lurks within the plumbing.That being said, those two projects are step closer towards "academic elegance/purity" (reflecting their theoretical origins) and away from "fast and dirty hacks" (Git was an emergency replacement for a proprietary VCS that suddenly cut ties w/ the Linux Foundation). This may make them a bad cultural/philosophical fit for an EE (or exactly what you've been missing!). YMMV6 votesteaearlgraycoldFebruary 22, 2019LinkFossil is also an option.Personally I like git and would use it even for small projects.4 votesbabypuncherFebruary 22, 2019LinkGit is as simple or complicated as you want it to be. Find a good GUI for it and forget about all the features you don't want to use. I don't think any "simpler" VCS will actually be easier in your day to day use than Git, if you're using the right client that offers a workflow you can learn easily.3 votesInherentlyGloomyFebruary 22, 2019LinkMercurial is a well established and widely used VCS. Subversion is another one, although I rarely hear people speak well of it.CVS is an older one that's usually found in legacy systems. It does handle large files and binaries pretty well, to it's credit.I've heard good things about Bazaar, but I haven't used it personally.1 voteSilbernFebruary 22, 2019LinkParentSubversion was a very popular system back in its day, and tons of open source projects used it. The thing about Subversion is that it's a hierarchical VCS vs Git, which is distributed. Some of the greatest proponents of Git, and the greatest opponents of Subversion, were open source projects, for whom Git worked vastly better with their workflow of merging patches freeform and experimenting with different branches. If you want a hierarchical system, with its pros and cons, Subversion is actually a pretty good choice.4 votesAmarokFebruary 22, 2019(edited February 22, 2019)LinkParentThat's what we were using where I worked... once I managed to pull the 16GBish of data the company had accumulated over decades inside visual sourcesafe out and process it into something that subversion could import without losing all of the versioning history. That was a fun science project. The perl scripts took almost three days to finish processing it all.We got pretty drunk to celebrate the day we retired VSS. Good riddance.We picked subversion partly because it was an optimal target, built similar to VSS... but more than that, the windows svn server was tied properly into active directory, so I could manage access to the source code using AD groups just like everything else. At the time there weren't many alternatives to do that, and git was a young pup freshly minted without that capability. TortoiseSVN integrated perfectly with windows explorer and we had modules that made subversion into a visual studio native. Everything just worked, no hassles, good security, lots of convenience.1 votemeghanFebruary 22, 2019LinkGit is too complicated on purpose because the core program is a CLI app. If you don't want to have to deal with git, then get a GUI app for it and you'll never have touch the commands again if you don't want to. Some good options are https://desktop.github.com/ and https://www.sourcetreeapp.com/mftrhuFebruary 22, 2019LinkIf you just want to track changes to a single file, RCS is ancient but it works well enough.I also use it for another reason (AKA laziness): the comma vee (,v) files it creates are pretty distinctive, and I check in my system configuration files when I fiddle with them - both to keep a history, and to quickly find the ones I modified via locate.
Part 3: Graggles can have cyclesFebruary 19, 2019Almost two years ago, I promised a series of three posts about version control. The first two (here and here) introduced a new (at the time) framework for version control. The third post, which I never finished, was going to talk about the datastructures and algorithms used in pijul, a version control system built around that new framework. The problem is that pijul is a complex piece of software, and so I had lots of trouble wrapping my head around it.Two years later, I’m finally ready to continue with this series of posts (but having learned from my earlier mistakes, I’m not going to predict the total number of posts ahead of time). In the meantime, I’ve written my own toy version control system (VCS) to help me understand what’s going on. It’s called ojo, and it’s extremely primitive: to start with, it can only track a single file. However, it is (just barely) sophisticated enough to demonstrate the important ideas. I’m also doing my best to make the code is clear and well-documented.Graggles can have cyclesAs I try and ease back into this whole blogging business, let me just start with a short answer for something that several people have asked me (and which also confused me at some point). Graggles (which, as described in the earlier posts are a kind of generalized file in which the lines are not necessarily ordered, but instead form a directed graph) are not DAGs; that is, they can have cycles. To see why, suppose we start out with this graggleThe reason this thing isn’t a file is because there’s no prescribed order between the “shoes” line and the “garbage” line. Now suppose that my wife and I independently flatten this graggle, but in different ways (because apparently she doesn’t care if I get my feet wet).Merging these two flattenings will produce the following graggle:Notice the cycle between “shoes” and “garbage!”Although I was surprised when I first noticed that graggles could have cycles, if you think about it a bit more then it makes a lot of sense: one graggle put a “garbage” dependency on “shoes” and the other put a “shoe” dependency on “garbage,” and so when you merge them a cycle naturally pops out.
a while ago I did a video on get and whyI thought get needed to get good as Iput it and wound up talking about fossilwhich is normally my go-to sourcecontrol manager or revision controlsystem version control system orwhatever you use to refer to thesethings I got a request to talk aboutfossil and this was something I actuallyhad planned to do I have been busy andhad some other things I wanted to do butI'm more or less prepared to do this now[Music]youI'm not going to get a hardcore intodetailsmostly because Richard hip already hasan absolutely great presentation on someissues with get and why fossil has abetter approach in his opinion he's theauthor of fossil so you know obviousbiases but inevitably when you are theauthor of a product you designed it acertain way because you had needs for itto be that certain waywe'll start off with installing it ifyou're on Linux or BSD or anything likethat the installation procedures fairlystraightforward use your package managerit's going to be called fossil end ofstory on Windows if you are running arecent enough version of Windows youwill be able to use find package andfossilor it'll say from the chocolatey sourcethere is a fossil package available ifyou are not on a recent enough versionof Windows and the fine package andother package management commandlets arenot present you can install chocolatydirectly although you can also just goto the fossil web page and download thebinary there the only thing you have tobe aware of with Windows is by defaultthe search path is a little not whatcommandline junkies typically expect butit's it's not particularly hard to haveit set up now in my case I have it setup so that the anything that chocolateyand others it wind up installing getsautomatically put in there so what I cando is install package and fossil becausethere's only one provider I don't needthe specifier for brighter and then Ican just go ahead and get this installedand now should be good to go grantedwith a very odd-looking command name butbecause you can tell it gets a bitconfused but you know it still runsexactly the same on say like Linux thatwould look a lot cleanerfossilsomething a little bit different frommany other version control systemsespecially distributed version controlsystems that sort of confuses somepeople fossil is almost like a hybrid ofthe centralized version control anddecentralized version controland it follows this in a lot of waysone thing you'll need to keep in mindmuch like with centralized versioncontrol is that the repository isseparate from your working directoryso no the creator and the actualrepository first now I'm going to moveover to a different Drive whichapparentlyyes it does why are you white why thathabit is signed to a difficult it'sdifferentnowoh yeah I have a issue to resolveinvolving this connection not being madeinitially so I was trying to connectwhen I was trying to move to the drivewhen the connection hasn't been made yetit used to auto mount it and then afteran update that stopped and I justhaven't tracked down the error becauseof not that big of a deal but Iobviously need to get around to doingthat so I actually want to create thethe repository itself over here and thisis done through fossil I did I thinkjust Newell workI mean using get so much recently andmuch as I hate yetyouusing it for all the stuff that I hopepeople will contribute to is caused meto forget some of the fossil commandsbut I think it's just in it so I havethe actual repository name and what thisis actually it should be unsurprising ifyou recognize who Richard hip is isactually a sequel light databaseis it just a single file is therepository and it's obviously set up aspecial way but it's just a sequel latedatabase I know you can name thisanything the extension doesn't matterand in my case I'm just going to createa repository for the spin objects Ihaven't really done anything with themso but I have a I have a test file inthat for the EBS code extension I wasdeveloping so we can check that in thereand it should work fineyoushould not take this longyouno I'll go yeah look like it wentthrough okay okayso one of the first things you're goingto want to do of course is configure itand this is one of the actually reallybig selling points except it's freebehind fossilyouyouyouI don't have to do anythingthere's already awebsites with branches and tags and evena ticketing system a wiki system andconfiguration is done through hereassuming that it doesn'tokay we're good we're goodone of the things I'm going to do isthat's an interesting rendering erryouI would not recommend doing this in themajority of cases but since I am theonly developer and I don't want to messaround withother things and just giving myself fullpermissionsin a team setting like if anybody elsewinds up contributing to this which isunlikely because the spin objects thingis not actually going to be public thenI would scale back my own permissionsand you know assign them just thepermissions they need and and similarbut it's a very convenient way ofmanaging this kind of stuff and ofcourse contact info I'm not going tobother with that and a password I willdeal with this laterone of the other admin tasks I stronglyrecommend doing at the beginning Ibelieve it's hereyes but the actual project name which ofcourse you do not want a named fossilprojectnow we'll just leave it as that it's notreally that important it's an internalproject so it's not like this would besomething that search engines wouldreally scan because it's internaland that should go through Ida notentirely sure why it doesn't go throughthe very first time when you had appliedbut did as you can see did did actuallygo through one setting I stronglyrecommend doing if you are intermixingoperating systemsis the carriage return newline globlet's just just do thatif you're only if you're in a controlledenvironment where you can guarantee thatthe only checkouts and the only placeyou get commits from and all the editorsettings and everything follows aspecific convention then you don'treally need to worry about that but ifyou're very intermixed just just that'sgoing to help out a lotother than thatthis is good for now you know obviouslyfill it out more as time goes on butthat's basically itanything you want to add to the wiki didall that stuff just do whatever you feellikeand we can go ahead and stop that andyes but as you notice this this isn'treally anything you can work with soagain much like with a centralizedversion control which you actually needto do is check out from this repositorynow true to decentralized systems youcan actually have tons of theserepositories exist on the same machineon different machines inthey can all you know push to each otherpull from each other all of the thingsthat you would expect to see withdecentralized our district yetdecentralized version control arepresent they're justyou can reasonably operate fossil like aversion control as well which actuallyhas quite a bit of uses especially whenyou have say like a team of developerson one machine or one one server thathosts the repository they can all havetheir working directories and then justcommit to the same repository then pushout from the business on to say get wellthat's not going to be github but thatthat same idea they can push out tochisel app or wherever else or in thecase of say multiple teams within anorganization they can each team can havetheir own repository on their system andthen you know push and pull throughoutthe entire organization as opposed towith say get where the repository andthe working directory are the same thingyou're left with say III I've seensomebody complain once that they had 32different instances of the samerepository on the machine just becausethey needed four but the entire team 32different checkoutsit's a little crazyespecially trying to synchronize thechanges between all of themso what we can do for this I'm justgoing to move back over to the desktopand we can go into the existingdirectory and then it should just befossil open and spin opticsand then we're openthis directory just became the workingdirectory for this repository and I canshow that through I think its fossilstatusyes so we have a National empty check-innow one of the things I commented onreally liking about fossil and justfinding absolutely obnoxious about gitis the way changes to existing files aredone in fossil when we want to add afile it can be done through a simple adand then I can just go ahead and add thethe test object now when I want tocommit that I can go ahead and commitand give it a message by default if youjust fossil commit it'll bring up thesystem default text editor I don't wantto configure this on Windows so I'm justpassing it the the messages through thezoo the command line which is finebecause both the command prompt andPowerShell have what's the term for themthe command line editors anyways sowe're not talking about the really olddays where you just did in once and youcould you know move through and edit themessageand of course because that's what it isnow even though we're in the workingdirectory and not on the repositorybecause the working directory knows whatrepository it came from we can still dothisthe only difference is because we're inthe working directory it opens from thetime line because that's usually themost interesting thing when you'reworking on things sothere was nothing for me to configurehere I didn't eat need to set up awebserver I didn't need to set up somethird-party component to display thisstuff I have avisual representation of the timelineyoubut this doesn't really show off what Iwas talking about just yet because thefile there already a you know addingthis the first time is basically thesame way as it's done in and get justcalled code so Bo we open up vs code andedit the filenothing in here at all yet so let's justgo through andso we've got the changes saved with getwhat you would need to do is add it addwhat's called staging you staged thechanges and then when you go to committhe changes it actually commits themwith fossil it just immediatelyrecognized that has changed because thefile already exists in the repository itgoes through and checks like is this onethe same do I need to do anything withit and just since it's changedautomatically knows that those changesshould be committed now this is thedefault behavior it is still possible todo staging like with git with fossilit's just that the default behavior iswhat at least hip believes is the morecommon case and at least with the way Idevelop this approach is definitely mycommon case very rarely do I want toonly commit part of what I've actuallychangedwe can go through andnow remember I didn't actually add thisback or anything so if I were to justcommit that like with yet it would nothave actually committed the changes madeyouif we gold view befile for this specific check-in you cansee that it actually didautomaticallycheck this commit thisthis is absolutely the single biggestreason why I prefer fossil over yet andover most version controlsit might just be a way I do developmentbut overwhelmingly Icheck the the changes I make I want tocheck them in all at onceI don't work on things unrelated to whatI amchecking inthat makes sense III don't ever wind uphaving a needshouldn't say that I rarely wind up everhaving a need to do partial check-insI've done them twice or a lot of commitseasily you know easily 600 plus commitsI've done them twiceso to have something like this where theonly time you were ever adding a file iswhen it's truly a new file is just amajor help because I never wind up withsituations in fossil where I forgot tostage part of my commit and then have tomake a committee meeting after going ohyeah I forgot to add these things herethey arenow the UI the the automatic webinterface and that the fact that you canactually very easily couple this into anexisting website usingCGI just absolutely huge reasons why Iprefer fossil along with like I hadmentioned the fact that the repositoryin the working directory are separate soin team scenarios you can have the teamshare one repository and have severalworking directories from it and then youknow all the deep decentralizedinteractions that you need to can bedone based on the different repositoriesfor the different teams instead of arepository for every single developeryou can get this thing kind of set upwith get although fossil has anotherconcept that makes it quite easier towork with called auto-sync where if it'senabled andgotta remember where it wasauto-sync yep auto-sync is onif I were to check out from therepository I had set up and so like onanother machine create a check out basedon this repository we just set up hereif auto sync is on what happens when youcommit to it is it's also automaticallysyncs with the repository that it wascloned from I'm in cloned not checkedout checks out sorry for workingdirectories I'm in cloned but it'll autosync between the clones which isextremely useful behavior in in thetypical way that at least I'm seeingthings are doneI would absolutely love something likethis to exist forget because I've seenin all too often when I migrated mystuff over to get I will commit changesand totally forget to push them to gethub and so I can rack up change sets to3 I've seen six beforewhy they're not showing up on github Ihave to realize that oh yeah I didn'tsync those as well now a lot of editorswill have a commit and sync or commitand push button that you can press[Music]which helps if you're using thatspecific editor but doesn't help ifyou're using the command line and justof a hack around a feature that githubwould that get would benefit from butjust doesn't haveare more technical matters as well forwhy I prefer fossil over get like I hadsaid way earlier richard hip has a greatpresentation on this then that I willlink to that down in the videodescription definitely check that outit goes into considerably more detailI'm just talking here about my biggestreasons for preferring fossil so roughlyhow things are done in fossil so that itcan be you can compare see kind of whereI'm coming fromyeah the link for richard hips talk willbe down in the video descriptiondefinitely check that outthere are a lot of lot of technicalreasons as well for why like auditingpurposes and stuff why fossilI would say is the superior versioncontrol now go to
# Document TitleBeyond Gitby Paweł Świątkowski27 May 2017If you are a young developer chances are that Git is the only version control system you know and use. There is nothing wrong with it, as this is probably the best one. However, it’s worth to know that other systems exist too and sometimes offer interesting features as well.In this post I present a list of VCSs I came across during my programming education and career.Concurrent Versions System (CVS)I admit I never used this one. The only contact I had with it was often when I criticized SVN. More experienced developers used to say:Dude, you haven’t used CVS, so shut up!This is probably the oldest VCS, at least from the ones that were widely adopted. It was first released in 1986 and last update is from 2009. Initially it was a set of shell scripts, but later it was rewritten in C. The “features” that led it to die out included:Non-atomic operations – when you pulled (whatever it was called) from the server and for example your connection broke, it was possible that only part of files were updated.No diff for changing filename – it was stored as deletion and addition anew.No file deletionSubversion (SVN)Subversion is a more or less direct successor to CVS. It was designed to solve its most dire problems and became a de-facto standard for many years. I used SVN in my second job for a long time (after that we managed to migrate to Git).SNV offered atomic operations and ability to rename files or delete them. However, compared to Git it still suffers from poor branching system. Branches are in fact separate directories in the repository and to apply some changes to many branches you have to prepare a patch on one branch and apply it to the other. There is no branch merging and creating a new branch is not a simple and cost-free.One more important difference: SVN (and CVS) are not distributed, which means they require a central server where main repository copy is stored. It also means that they can’t really be used offline.Still, Subversion has (or had) some upsides over Git:Because of it’s directory-based structure, it is possible to checkout any part of the tree. For example, you could checkout only assets if you are a graphic designer. This led to a custom to keep many projects in one huge repository.Subversion was slightly better at handling binary files. First of all, Git was reported to crash with too many binaries in the repo, while SVN worked fine. Secondly, those files tended to take a bit less space. However, later Git LFS was released, which in tipped the balance on Git side.Because of it’s features, SVN is still used today not only as legacy VCS, but sometimes as a valid choice. For example, Arch Linux’s ABS system was ported from RSync to SVN not too long ago.Mercurial (HG)Mercurial and Git appeared more or less in the same time. The reason was that previous popular VCS I was not aware of (BitKeeper) stopped to offer its free version (now it’s back again). And the problem was that Linux kernel was developed using BitKeeper. Both Mercurial and Git were started as successors to BK, but finally Git won. Mercurial is written mostly in Python with parts in C. It was released in 2005 and is still actively developed.I haven’t actually ever found any crucial differences between HG and Git. For a longer period of time I prefered the former as I felt that it’s better at conflict solving (Git was failing to resolve basically every time back then). Also, BitBucket was a hosting for Mercurial repositories and it offered private repos, which Github did not. It’s worth noting that back then BitBucket did not support Git (it started in 2008 and Git support was added in late 2011).I know a couple of teams that choose Mercurial over Git for their projects today.DarcsDarcs always felt a bit eerie to me. It’s written in Haskell and the language’s ecosystem is probably the only place where it gained some popularity. It was advertised as truly distributed VCS, but probably Git caught up since then, because differences between those two seem not that important. Plus, Darcs still does not have local branches.Latest release of Darcs is from September 2016.BazaarWith its name derived probably from famous Eric Raymond’s essay, Bazaar is another child of year 2005 (like Mercurial or Git). It’s written in Python and is part of GNU Project. Because of that, it gained some popularity in open source world. It is also sponsored and maintained by Canonical, creators of Ubuntu.Bazaar offers choice between centralized (like SVN) and distributed (like Git) workflows. Its repositories may be hosted on GNU Savannah, Launchpad and SourceForge. Last release was in February 2016.FossilWritten in C, born in 2006, Fossil is interesting, though not really popular, alternative to other version control systems. It includes not only files, revisions etc., but also bug tracker and wiki for the project. It also has built-in web server. It’s under active development. Most prominent project using Fossil is probably Tcl/Tk.PijulPijul is a new kid in the block. It’s written in Rust and attempts to solve performance issues of Darcs and security issues (!) of Git. It is advertised as one of the fastest and having one of the best conflicts-resolving features. It is also based on “mathematically sound theory of patches”.It still does not have version 1.0 though, but it’s definitely worth watching, as it might be a thing some day. You can host your Pijul repository at nest.pijul.com.Summary| Name | Language | First release | Last release (as of May 2017) || CVS | C | 1986 | May 2008 || Subversion | C | 2000 | November 2016 || Mercurial | Python, C | 2005 | May 2017 || Darcs | Haskell | 2003 | September 2016 || Bazaar | Python | 2005 | February 2016 || Fossil | C | 2006 | May 2017 || Pijul | Rust | 2015 (?) | May 2017 || Git | C | 2005 | May 2017 |
# Document TitlePart 2: Merging, patches, and pijulMay 13, 2017In the last post, I talked about a mathematical framework for a version control system (VCS) without merge conflicts. In this post I’ll explore pijul, which is a VCS based on a similar system. Note that pijul is under heavy development; this post is based on a development snapshot (I almost called it a “git” snapshot by mistake), and might be out of date by the time you read it.The main goal of this post is to describe how pijul handles what other VCSes call conflicts. We’ll see some examples where pijul’s approach works better than git’s, and I’ll discuss why.Some basicsI don’t want to write a full pijul tutorial here, but I do need to mention the basic commands if you’re to have any hope of understanding the rest of the post. Fortunately, pijul commands have pretty close analogues in other VCSes.pijul init creates a pijul repository, much like git init or hg init.pijul add tells pijul that it should start tracking a file, much like git add or hg add.pijul record looks for changes in the working directory and records a patch with those changes, so it’s similar to git commit or hg commit. Unlike those two (and much like darcs record), pijul record asks a million questions before doing anything; you probably want to use the -a option to stop it.pijul fork creates a new branch, like git branch. Unlike git branch, which creates a copy of the current branch, pijul fork defaults to creating a copy of the master branch. (This is a bug, apparently.)pijul apply adds a patch to the current branch, like git cherry-pick.pijul pull fetches and merges another branch into your current branch. The other branch could be a remote branch, but it could also just be a branch in the local repository.Dealing with conflictsAs I explained in the last post, pijul differs from other VCSes by not having merge conflicts. Instead, it has (what I call) graggles, which are different from files in that their lines form a directed acyclic graph instead of a totally ordered list. The thing about graggles is that you can’t really work with them (for example, by opening them in an editor), so pijul doesn’t let you actually see the graggles: it stores them as graggles internally, but renders them as files for you to edit. As an example, we’ll create a graggle by asking pijul to perform the following merge:Here are the pijul commands to do this:$ pijul init# Create the initial file and record it.$ cat > todo.txt << EOF> to-do> * work> EOF$ pijul add todo.txt$ pijul record -a -m todo# Switch to a new branch and add the shoes line.$ pijul fork --branch=master shoes$ sed -i '2i* shoes' todo.txt$ pijul record -a -m shoes# Switch to a third branch and add the garbage line.$ pijul fork --branch=master garbage$ sed -i '2i* garbage' todo.txt$ pijul record -a -m garbage# Now merge in the "shoes" change to the "garbage" branch.$ pijul pull . --from-branch shoesThe first thing to notice after running those commands is that pijul doesn’t complain about any conflicts (this is not intentional; it’s a known issue). Anyway, if you run the above commands then the final, merged version of todo.txt will look like this:That’s… a little disappointing, maybe, especially since pijul was supposed to free us from merge conflicts, and this looks a lot like a merge conflict. The point, though, is that pijul has to somehow produce a file – one that the operating system and your editor can understand – from the graggle that it maintains internally. The output format just happens to look a bit like what other VCSes output when they need you to resolve a merge conflict.As it stands, pijul doesn’t have a very user-friendly way to actually see its internal graggles. But with a little effort, you can figure it out. The secret is the commandRUST_LOG="libpijul::backend=debug" pijul info --debugFor every branch, this will create a file named debug_<branchname> which describes, in graphviz’s dot format, the graggles contained in that branch. That file’s a bit hard to read since it doesn’t directly tell you the actual contents of any line; in place of, for example, “to-do”, it just has a giant hex string corresponding to pijul’s internal identifiers for that line. To decode everything, you’ll need to look at the terminal output of that pijul command above. Part of it should look like this:DEBUG:libpijul::backend::dump: ============= dumping ContentsDEBUG:libpijul::backend::dump: > Key { patch: PatchId 0x0414005c0c2122ca, line: LineId(0x0200000000000000) } Value (0) { value: [Ok("")] }DEBUG:libpijul::backend::dump: > Key { patch: PatchId 0x0414005c0c2122ca, line: LineId(0x0300000000000000) } Value (12) { value: [Ok("to-do\n")] }By cross-referencing that output with the contents of debug_<branchname>, you can reconstruct pijul’s internal graggles. Just this once, I’ve done it for you, and the result is exactly as it should be:What should I do with a conflict?Since pijul will happily work with graggles internally, you could in principle ignore a conflict and work on other things. That’s probably a bad idea for several reasons (for starters, there are no good tools for working with graggles, and their presence will probably break your build). So here’s my unsolicited opinion: when you have a conflict, you should resolve it ASAP. In the example above, all we need to do is remove the >>> and <<< lines and then record the changes:$ sed -i 3D;5D todo.txt$ pijul record -a -m resolveTo back up my recommendation for immediate flattening, I’ll give an example where pijul’s graggle-to-file rendering is lossy. Here are two different graggles:But pijul renders both in the same way:This is a perfectly good representation of the graggle on the right, but it loses information from the one on the left (such as the fact that both “home” lines are the same, and the fact that “shop” and “home” don’t have a prescribed order). The good news here is that as long as your graggle came from merging two files, then pijul’s rendering is lossless. That means you can avoid the problem by flattening your graggles to files after every merge (i.e., by resolving your merge conflicts immediately). Like cockroaches, graggles are important for the ecosystem as a whole, but you should still flatten them as soon as they appear.Case study 1: reverting an old commitIt’s (unfortunately) common to discover that an old commit introduced a show-stopper bug. On the bright side, every VCS worth its salt has some way of undoing the problematic commit without throwing away everything else you’ve written since then. But if the problematic commit predates a merge conflict, undoing it can be painful.As an illustration of what pijul brings to the table, we’ll look at a situation where pijul’s conflict-avoidance saves the day (at least, compared to git; darcs also does ok here). We’ll start with the example merge from before, including our manual graggle resolution:Then we’ll ask pijul to revert the “shoes” patch:$ pijul unrecord --patch=<hash-of-shoes-patch>$ pijul revertThe result? We didn’t have any conflicts while reverting the old patch, and the final file is exactly what we expected:Let’s try the same thing with git:$ git init# Create the initial file and record it.$ cat > todo.txt << EOF> to-do> * work> EOF$ git add todo.txt$ git commit -a -m todo# Switch to a new branch and add the shoes line.$ git checkout -b shoes$ sed -i '2i* shoes' todo.txt$ git commit -a -m shoes# Switch to a third branch and add the garbage line.$ git checkout -b garbage master$ sed -i '2i* garbage' todo.txt$ git commit -a -m garbage# Now merge in the "shoes" change to the "garbage" branch.$ git merge shoesAuto-merging todo.txtCONFLICT (content): Merge conflict in todo.txtAutomatic merge failed; fix conflicts and then commit the result.That was expected: there’s a conflict, so we have to resolve it. So I edited todo.txt and manually resolved the conflict. Then,# Commit the manual resolution.$ git commit -a -m merge# Try to revert the shoes patch.$ git revert <hash-of-shoes-patch>error: could not revert 4dcf1ae... shoeshint: after resolving the conflicts, mark the corrected pathshint: with 'git add <paths>' or 'git rm <paths>'hint: and commit the result with 'git commit'Since git can’t “see through” my manual merge resolution, it can’t handle reverting the patch by itself. I have to manually resolve the conflicting patches both when applying and reverting.I won’t bore you with long command listings for other VCSes, but you can test them out yourself! I’ve tried mercurial (which does about the same as git in this example) and darcs (which does about the same as pijul in this example).A little warning about pijul unrecordI’m doing my best to present roughly equivalent command sequences for pijul and git, but there’s something important you should know about the difference between pijul unrecord and git revert: pijul unrecord modifies the history of the repository, as though the unrecorded patch never existed. In this way, pijul unrecord is a bit like a selective version of git reset. This is probably not the functionality that you want, especially if you’re working on a public repository. Pijul actually does have the internal capability to do something closer to git revert (i.e., undo a patch while keeping it in the history), but it isn’t yet user-accessible.Sets of patchesThe time has come again to throw around some fancy math words. First, associativity. As you might remember, a binary operator (call it +) is associative if (x + y) + z = x + (y + z) for any x, y, and z. The great thing about associative operators is that you never need parentheses: you can just write x + y + z and there’s no ambiguity. Associativity automatically extends to more than three things: there’s also no ambiguity with w + x + y + z.The previous paragraph is relevant to patches because perfect merging is associative, in the following sense: if I have multiple patches (let’s say three to keep the diagrams manageable) then there’s a unique way to perfectly merge them all together. That three-way merge can be written as combinations of two-way merges in multiple different ways, but every way that I write it gives the same result. Let’s have some pictures. Here are my three patches:And here’s one way I could merge them all together: first, merge patches p and q:Then, merge patches pm (remember, that’s the patch I get from applying p and then m, which in the diagram above is the same as qn) and r:Another way would be to first merge q and r, and then merge p in to the result:Yet a third way would be to merge p and q, then merge q and r, and finally merge the results of those merges. This one gives a nice, symmetric picture:The great thing about our mathematical foundation from the previous post is that all these merges produce the same result. And I don’t just mean that they give the same final file: they also result in the same patches, meaning that everyone will always agree on which lines in the final file came from where. There isn’t even anything special about the initial configuration (three patches coming out of a single file). I could start with an arbitrarily complex history, and there would be an unambiguous way to merge together all of the patches that it contains. In this sense, we can say that the current state of a pijul branch is determined by a set of patches; this is in contrast to most existing VCSes, where the order in which patches are merged also matters.Reordering and antiquing patchesOne of the things you might have heard about pijul is that it can reorder patches (i.e. that they are commutative). This is not 100% accurate, and it might also be a bit confusing if you paid attention in my last post. That’s because a patch, according to the definition I gave before, includes its input file. So if you have a patch p that turns file A into file B and a patch q that turns file B into file C, then it makes sense to apply p and then r but not the other way around. It turns out that pijul has a nice trick up its sleeve, which allows you to reorder patches as long as they don’t “depend” (and I’ll explain what that means precisely) on each other.The key idea behind reordering patches is something I call “antiquing.” Consider the following sequenced patches:According to how we defined patches, the second patch (let’s call it the garbage patch) has to be applied after the first one (the shoes patch). On the other hand, it’s pretty obvious just by staring at them that the garbage patch doesn’t depend on the shoes patch. In particular, the following parallel patches convey exactly the same information, without the dependencies:How do I know for sure that they convey the same information? Because if we take the perfect merge of the diagram above then we get back the original sequenced diagram by following the top path in the merge!This example motivates the following definition: given a pair of patches p and q in sequence:we say that q can be antiqued if there exists some patch a(q) starting at O such that the perfect merge between p and a(q) involves q:In a case like this, we can just forget about q entirely, since a(q) carries the same information. I call it antiquing because it’s like making q look older than it really is.One great thing about the “sets of patches” thing above is that it let us easily generalize antiquing from pairs of patches to arbitrarily complicated histories. I’ll skip the details, but the idea is that you keep antiquing a patch – moving it back and back in the history – until you can’t any more. The fact that perfect merges are associative implies, as it turns out, that every patch has a unique “most antique” version. The set of patches leading into the most antique version of q are called q’s dependencies. For example, here is a pair of patches where the second one cannot be antiqued (as an exercise, try to explain why not):Since the second patch can’t be made any more antique, the first patch above is a dependency of the second one. In my next post, I’ll come back to antiquing (and specifically, the question of how to efficiently find the most antique version of a patch).I promised to talk about reordering patches, so why did I spend paragraphs going on about antiques? The point is that (again, because of the associative property of perfect merges) patches in “parallel” can be applied in any order. The point of antiquing is to make patches as parallel as possible, and so then we can be maximally flexible about ordering them.That last bit is important, so it’s worth saying again (and with a picture): patches in sequencecannot be re-ordered; the same information represented in parallel using an antique of qis much more flexible.Case study 2: parallel developmentSince I’ve gone on for so long about reordering patches, let’s have an example showing what it’s good for. Let me start with some good news: you don’t need to know about antiquing to use pijul, because pijul does it all for you: whenever pijul records a patch, it automatically records the most antique version of that patch. All you’ll notice is the extra flexibility it brings.We’ll simulate (a toy example of) a common scenario: you’re maintaining a long-running branch of a project that’s under active development (maybe you’re working on a large experimental feature). Occasionally, you need to exchange some changes with the master branch. Finally (maybe your experimental feature was a huge success) you want to merge everything back into master.Specifically, we’re going to do the following experiment in both pijul and git. The master branch will evolve in the following sequence:On our private branch, we’ll begin from the same initial file. We’ll start by applying the urgent fix from the master branch (it fixed a critical bug, so we can’t wait):Then we’ll get to implementing our fancy experimental features:I’ll leave out the (long) command listings needed to implement the steps above in pijul and git, but let me mention the one step that we didn’t cover before: in order to apply the urgent fix from master, we say$ pijul changes --branch master # Look for the patch you want$ pijul apply <hash-of-the-patch>In git, of course, we’ll use cherry-pick.Now for the results. In pijul, merging our branch with the master branch gives no surprises:In git, we get a conflict:There’s something else a bit funny with git’s behavior here: if we resolve the conflict and look at the history, there are two copies of the urgent fix, with two different hashes. Since git doesn’t understand patch reordering like pijul does, git cherry-pick and pijul apply work in slightly different ways: pijul apply just adds another patch into your set of patches, while git cherry-pick actually creates a new patch that looks a bit like the original. From then on, git sees the original patch and its cherry-picked one as two different patches, which (as we’ve seen) creates problems from merging down the line. And it gets worse: reverting one of the copies of the urgent fix (try it!) gives pretty strange results.By playing around with this example, you can get git to do some slightly surprising things. (For example, by inserting an extra merge in the right place, you can get the conflict to go away. That’s because git has a heuristic where if it sees two different patches doing the same thing, it suppresses the conflict.)Pijul, on the other hand, understood that the urgent fix could be incorporated into my private branch with no lossy modifications. That’s because pijul silently antiqued the urgent fix, so that the divergence between the master branch and my own branch became irrelevant.ConclusionSo hopefully you have some idea now of what pijul can and can’t do for you. It’s an actively developed implementation of an exciting (for me, at least) new way of looking at patches and merges, and it has a simple, fast, and totally lossless merge algorithm with nice properties.Will it dethrone git? Certainly not yet. For a start, it’s still alpha-quality and under heavy development; not only should you be worried about your data, it has several UI warts as well. Looking toward the future, I can see reasonable arguments in both directions.Arguing against pijul’s future world domination, you could question the relevance of the examples I’ve shown. How often do you really end up tripping on git’s little corner cases? Would the time saved from pijul’s improvements actually justify the cost of switching? Those are totally reasonable questions, and I don’t know the answer.But here’s a more optimistic point of view: pijul’s effortless merging and reordering might really lead to new and productive workflows. Are you old enough to remember when git was new and most people were still on SVN (or even CVS)? Lots of people were (quite reasonably) skeptical. “Who cares about easy branching? It’s better to merge changes immediately anyway.” Or, “who cares about distributed repositories? We have a central server, so we may as well use it.” Those arguments sound silly now that we’re all used to DVCSes and the workflow improvements that they bring, but it took time and experimentation to develop those workflows, and the gains weren’t always obvious beforehand. Could the same progression happen with pijul?In the next post, I’ll take a look at pijul’s innards, focussing particularly on how it represents your precious data.AcknowledgementI’d like to thank Pierre-Étienne Meunier for his comments and corrections on a draft of this post. Of course, any errors that remain are my own responsibility.
# Document TitleA new versionHomeArchiveAboutPart 1: Merging and patchesMay 08, 2017A recent paper suggested a new mathematical point of view on version control. I first found out about it from pijul, a new version control system (VCS) that is loosely inspired by that paper. But if you poke around the pijul home page, you won’t find many details about what makes it different from existing VCSes. So I did a bit of digging, and this series of blog posts is the result.In the first part (i.e. this one), I’ll go over some of the theory developed in the paper. In particular, I’ll describe a way to think about patches and merging that is guaranteed to never, ever have a merge conflict. In the second part, I’ll show how pijul puts that theory into action, and in the third part I’ll dig into pijul’s implementation.Before getting into some patch theory, a quick caveat: any real VCS needs to deal with a lot of tedious details (directories, binary files, file renaming, etc.). In order to get straight to the interesting new ideas, I’ll be skipping all that. For the purposes of these posts, a VCS only needs to keep track of a single file, which you should think of as a list of lines.PatchesA patch is the difference between two files. Later in this series we’ll be looking at some wild new ideas, so let’s start with something familiar and comforting. The kind of patches we’ll discuss here go back to the early days of Unix:a patch works line-by-line (as opposed to, for example, word-by-word); anda patch can add new lines, but not modify existing lines.In order to actually have a useful VCS, you need to be able to delete lines also. But deleting lines turns out to add some complications, so we’ll deal with them later.For an example, let’s start with a simple file: my to-do list for this morning.Looking back at the list, I realize that I forgot something important. Here’s the new one:To go from the original to-do list to the new one, I added the line with the socks. In the format of the original Unix “diff” utility, the patch would look like this:The “1a2” line is a code saying that we’re going to add something after line 1 of the input file, and the next bit is obviously telling us what to insert.Since this blog isn’t a command line tool, we’ll represent patches with pretty diagrams instead of flat files. Here’s how we’ll draw the patch above:Hopefully it’s self-explanatory, but just in case: an arrow goes from left to right to indicate that the line on the right is the same as the one on the left. Lines on the right with no arrow coming in are the ones that got added. Since patches aren’t allowed to re-order the lines, the lines are guaranteed not to cross.There’s something implicit in our notation that really needs to be said out loud: for us, a patch is tied to a specific input file. This is the first point where we diverge from the classic Unix ways: the classic Unix patch that we produced using “diff” could in principle be applied to any input file, and it would still insert “* put on socks” after the first line. In many cases that wouldn’t be what you want, but sometimes it is.MergingThe best thing about patches is that they can enable multiple people to edit the same file and then merge their changes afterwards. Let’s suppose that my wife also decides to put things on my to-do list: she takes the original file and adds a line:Now there are two new versions of my to-do list: mine with the socks, and my wife’s with the garbage. Let’s draw them all together:This brings us to merging: since I’d prefer to have my to-do list as a single file, I want to merge my wife’s changes and my own. In this example, it’s pretty obvious what the result should be, but let’s look at the general problem of merging. We’ll do this slowly and carefully, and our endpoint might be different from what you’re used to.Patch compositionFirst, I need to introduce some notation for an obvious concept: the composition of two patches is the patch that you would get by applying one patch and then applying the other. Since a “patch” for us also includes the original file, you can’t just compose any two old patches. If p is a patch taking the file O to the file A and r is a patch taking A to B, then you can compose the two (but only in one order!) to obtain a patch from O to B. I’ll write this composition as pr: first apply p, then r.It’s pretty easy to visualize patch composition using our diagrams: to compute the composition of two paths, just “follow the arrows”to get the (dotted red) patch going from O to B.Merging as compositionI’m going to define carefully what a merge is in terms of patch composition. I’ll do this in a very math-professor kind of way: I’ll give a precise definition, followed by some examples, and only afterwards will I explain why the definition makes sense. So here’s the definition: if p and q are two different patches taking the file O to the files A and B respectively, a merge of p and q is a pair of patches r and s such thatr and s take A and B respectively to a common output file M, andpr = qs.We can illustrate this definition with a simple diagram, where the capital letters denote files, and the lower-case letters are patches going between them:Instead of saying that pr = qs, a mathematician (or anyone who wants to sound fancy) would say that the diagram above commutes.Here is an example of a merge:And here is an example of something that is not a merge:This is not a merge because it fails the condition pr = qs: composing the patches along the top path givesbut composing them along the bottom path givesSpecifically, the two patches disagree on which of the shoes in the final list came from the original file. This is the real meaning underlying the condition pr = qs: it means that there will never be any ambiguity about which lines came from where. If you’re used to using blame or annotate commands with your favorite VCS, you can probably imagine why this sort of ambiguity would be bad.A historical noteMerging patches is an old idea, of course, and so I just want to briefly explain how the presentation above differs from “traditional” merging: traditionally, merging was defined by algorithms (of which there are many). These algorithms would try to automatically find a good merge; if they couldn’t, you would be asked to supply one instead.We’ll take a different approach: instead of starting with an algorithm, we’ll start with a list of properties that we want a good merge to satisfy. At the end, we’ll find that there’s a unique merge that satisfies all these properties (and fortunately for us, there will also be an efficient algorithm to find it).Merges aren’t uniqueThe main problem with merges is that they aren’t unique. This isn’t a huge problem by itself: lots of great things aren’t unique. The problem is that we usually want to merge automatically, and an automatic system needs an unambiguous answer. Eventually, we’ll deal with this by defining a special class of merges (called perfect merges) which will be unique. Before that, we’ll explore the problem with some examples.A silly exampleLet’s start with a silly example, in which our merge tool decides to add some extra nonsense:No sane merge tool would ever do that, of course, but it’s still a valid merge according to our rule in the last section. Clearly, we’ll have to tighten up the rules to exclude this case.A serious exampleHere is a more difficult situation with two merges that are actually reasonable:Both of these merges are valid according to our rules above, but you need to actually know what the lines mean in order to decide that the first merge is better (especially if it’s raining outside). Any reasonable automatic merging tool would refuse to choose, instead requiring its user to do the merge manually.The examples above are pretty simple, but how would you decide in general whether a merge is unambiguous and can be performed automatically? In existing tools, the details depend on the merging algorithm. Since we started off with a non-algorithmic approach, let’s see where that leads: instead of specifying explicitly which merges we can do, we’ll describe the properties that an ideal merge should have.Perfect mergesThe main idea behind the definition I’m about to give is that it will never cause any regrets. That is, no matter what happens in the future, we can always represent the history just as well through the merge as we could using the original branches. Obviously, that’s a nice property to have; personally, I think it’s non-obvious why it’s a good choice as the defining property of the ideal merge, but we’ll get to that later.Ok, here it comes. Consider a merge:And now suppose that the original creators of patches p and q continued working on their own personal branches, which merged sometime in the future at the file F:We say that the merge (r, s) is a perfect merge if for every possible choice of the merge (u, v), there is a unique patch w so that u = rw and v = sw. (In math terms, the diagram commutes.) We’re going to call w a continuation, since it tells us how to continue working from the merged file. To repeat, a merge is perfect if for every possible future, there is a unique continuation.A perfect mergeLet’s do a few examples to explore the various corners of our definition. First, an example of a perfect merge:It takes a bit of effort to actually prove that this is a perfect merge; I’ll leave that as an exercise. It’s more interesting to see some examples that fail to be perfect.A silly exampleLet’s start with the silly example of a merge that introduced an unnecessary line:This turns out (surprise, surprise) not to be a perfect merge. To understand how our definition of merge perfection excludes merges like this, here is an example of a possible future without a continuation:Since our patches can’t delete lines, there’s no way to get from merged to future.A serious exampleHere’s another example, the case where there is an ambiguity in the order of two lines in the merged file:This one fails to be a perfect merge because there is a future with no valid continuation: imagine that my wife and I manually created the desired merge.Now what patch (call it w) could be put between merged and future to make everything commute? The only possibility iswhich isn’t a legal patch because patches aren’t allowed to swap lines.Terminological remarksIf you’ve been casually reading about pijul, you might have encountered the word “pushout.” It turns out that the pattern we used for defining a perfect merge is very common in math. Specifically, in category theory, suppose you have the following diagram (in which capital letters are objects and lowercase letters are morphisms):If for every u and v there is a unique w such that the diagram commutes, then (r, s) is said to be the pushout of (p, q). In other words, what we called a “perfect merge” above could also be called a “pushout in the category with files as objects and patches as morphisms.” For most of this article, we’ll ignore the general math terminology in favor of language that’s more intuitive and specific to files and patches.Conflicts and gragglesThe main problem with perfect merges is that they don’t always exist. In fact, we already saw an example:The pair of patches above has no perfect merge. We haven’t actually proved it, but intuitively it’s pretty clear, and we also discussed earlier why one potential merge fails to be perfect. Ok, so not every pair of patches can be merged perfectly. You probably knew that already, since that’s where merge conflicts come from: the VCS doesn’t know how to merge patches on its own, so you need to manually resolve some conflicts.Now we come to the coolest part of the paper: a totally different idea for dealing with merge conflicts. The critical part is that instead of making do with an imperfect merge, we enlarge the set of objects that the merge can produce. That is, not every pair of patches can be perfectly merged to a file, but maybe they can be merged to something else. This idea is extremely common in math, and there’s even some general abstract nonsense showing that it can always be done: there’s an abstract way to generalize files so that every pair of patches of generalized files can be perfectly merged. The miraculous part here is that in this particular case, the abstract nonsense condenses into something completely explicit and manageable.GragglesA file is an ordered list of lines. A graggle1 (a mixture of “graph” and “file”) is a directed graph of lines. (Yes, I know it’s a terrible name, but it’s better than “object in the free finite cocompletion of the category of files and patches,” which is what the paper calls it.) In other words, whereas a file insists on having its lines in a strict linear order, a graggle allows them to be any directed graph. It’s pretty easy to see how relaxing the strict ordering of lines solves our earlier merging issues. For example, here’s a perfect merge of the sort that caused us problems before:In retrospect, this is a pretty obvious solution: if we don’t know what order shoes and garbage should go in, we should just produce an output that doesn’t specify the order. What’s a bit less obvious (but is proved in the paper) is that when we work in the world of graggles instead of the world of files, every pair of patches has a unique perfect merge. What’s even cooler is that the perfect merge is easy to compute. I’ll describe it in a second, but first I have to say how patches generalize to graggles.A patch between two graggles (say, A and B) is a function (call it p) from the lines of A to the lines of B that respects the partial order, in the sense that if there is a path from x to y in A then there is a path from p(x) to p(y) in B. (This condition is an extension of the fact that a patch between two files isn’t allowed to change the order.) Here’s an example:The perfect mergeAnd now for the merge algorithm: let’s say we have a patch p going from the graggle A to the graggle B and another patch q going from A to C. To compute the perfect merge of p and q,write down the graggles B and C next to each other, and thenwhenever a line in B and a line in C share a “parent” in A, collapse them into a single line.That’s it: two steps. Here’s the algorithm at work on our previous example: we want to merge these two patches:So first, we write down the two to-be-merged files next to each other:For the second step, we see that both of the “to-do” lines came from the same line in the original file, so we combine those two into one. After doing the same to the “work” lines, we get the desired output:Working with graggles.By generalizing files to graggles, we got a very nice benefit: every pair of patches has a (unique) perfect merge, and we can compute it easily. But there’s an obvious flaw: all the tools that we use (editors, compilers, etc.) work on files, not graggles. This is where the paper stops providing guidance, but there is an easy solution: whenever a merge results in something that isn’t a file, just make a new patch that turns it into a file. We’ll call this flattening, and here’s an example:That looks like a merge conflict!If your eyes haven’t glazed over by now (sorry, it’s been a long post), you might be feeling a bit cheated: I promised you a new framework that avoids the pitfalls of manual merge resolution, but flattening looks an awful lot like manual merge resolution. I’ll answer this criticism in more detail in the next post, where I demonstrate the pijul tool and how it differs from git. But here’s a little teaser: the difference between flattening and manual merge resolution is that flattening is completely transparent to the VCS: it’s just a patch like any other. That means we can do fun things, like re-ordering or reverting patches, even in the presence of conflicting merges. More on that in the next post.Deleting linesIt’s time to finally address something I put off way at the beginning of the post: the system I described was based on patches that can’t delete lines, and we obviously need to allow deletions in any practical system. Unfortunately, the paper doesn’t help here: it claims that you can incorporate deletion into the system I described without really changing anything, but there’s a bug in the paper. Specifically, if you tweak the definitions to allow deletion then the category of graggles turns out not to be closed under pushouts any more. Here’s an example where the merge algorithm in the paper turns out not to be perfect:(Since this post has dragged on long enough, I’ll leave it as an exercise to figure out what the problem is).Ghost linesFortunately, there’s a trick to emulate line deletion in our original patch system. I got this idea from pijul, but I’ll present it in a slightly different way. The idea is to allow “ghost” lines instead of actually deleting them. That is, we mark every line in our graggle as either “live” or “ghost.” Then we add one extra rule to our patches: a live line can turn into a ghost line, but not the other way around. We’ll draw ghost lines in gray, and arrows pointing to ghost lines will be dashed. Here’s a patch that deletes the “shoes” line.The last remaining piece is to extend the perfect merge algorithm to cover our new graggles with ghost lines. This turns out to be easy; here’s the new algorithm:Write down side-by-side the two graggles to be merged.For every pair of lines with a common parent, “collapse” them into a single line, and if one of them was a ghost, make the collapsed line a ghost.The bit in italics is the only new part, and it barely adds any extra complexity.ConclusionI showed you (in great detail) a mathy way of thinking about patches in a VCS, although I haven’t shown a whole lot of motivation for it yet. At the very least, though, next time someone starts droning on about “patch theory,” you’ll have some idea what they’re talking about.In the next post, I’ll talk about pijul, a VCS that is loosely based around the algorithms I described in this post. There you’ll get to see some (toy) examples where pijul’s solid mathematical underpinnings help it to avoid corner cases that trip up some more established VCSes.AcknowledgementI’d like to thank Pierre-Étienne Meunier for his comments and corrections on a draft of this post. Of course, any errors that remain are my own responsibility.1: An earlier version of this post called them “digles” (for directed graph file), but a couple years later I decided that “graggles” sounds a bit better. Plus, if you mispronounce it a little, it fits in the pijul’s whole bird theme.« Part 2: Merging, patches, and pijul
all right thanks Dean thanks everyonethis is my first software a conferenceever simple little impressed or I'mgoing to talk about the project I'vebeen working on for about a year nowit's called people from the name of theSpanish bird it's we're trying to dosame version control and by saying likeI'm going to like this talk mostly aboutwhat sane means and what vertical shouldbe like so it all started because I wastrying to surf long Baker my coop inthis project myself we're trying toconvince our co-authors and an academicpaper to use version control instead ofjust emails and like like reservingfiles for a week and analytic file rightso so but well flow happens to be one ofthe core darks developers which is alike rather old ish version controlsystem and so we couldn't reallyconvince him to use anything else so wetry to convince our co-authors to startinstalling darks and windows and thenlike setting up SSH keys and pushing topushing patches and pulling stuff andthat didn't really work out so well alsoturns out it's darks has performanceproblems and get as simplicity problemsso well I wasn't really easy to doanything about this and that's prettymuch the situation with vertical rolltoday like most people most non hackersdon't even use it so it's write widelyregardless of hackers thing most peoplelike use extremely basic versioncultural even like top like worldexperts in in computer science still uselike fight locking in emails and thatmakes us lose a significant amount ofdata and or time and even the situationwith programmers and sorry say isn'tmuch better because like as soon asdistributed version controls forinvented their like if they were socomplicated and hard to use that likebusiness is even started to resentrelies them and use the use the use themin a centralized way in order to masterthe Beast andPramod this functional programming asfinally convincing people that they canreally tackle everything like all feelthe computer science that were that werepreviously regarded as like elite fieldjust like Ashley showed us for operatingsystems we're trying to replace C++ withrust recently also JavaScript it's likeit's trying trying to get replaced byElm package managers and Linuxdistributions are getting replaced bylike nicks it's not widely accepted sofar but and so these talks about lettingadenine other functional tool to thatcollection and it's calledwe start out to replace gets probablythis goal is as ambitious as for us toreplace it C++ or maybe more but butanyway so before starting this talk Ijust want you all to like get give somebasic concepts about what I well I thinkfunctional programming is so most peoplehave their own definition feel free toagree or not with this so I thinkfunctional so by what I like infunctional languages like rust is thatthere are static types so we don't likewe can reason about code easily there'sfine control of immutability it's likewe're not mutating some some states allthe time we're just thinking abouttransforms and stuff in functions mostoperations are atomic which is what mostpeople use when they use like anysoftware at all and it's somethingthat's not really happening in most mostsoftware Mo's tools and we also likememory safe seaweed we don't want likeyou just corrupt our back hand orcorrect our storage our files orsomething we don't want to lose datajust because we're using it like usingtypes in the wrong way so how does thatapply to has there any beneficial toversion control well because the way weuse version call today has mostly aboutStates we're talking about cometsif we use gifts we're talking aboutcomments and coming so state foolthey're just basically a hash like apatch to like make it easier to expectbut it's just in optimizations just tomake it easier to store and and theirhash of the whole state the repositorythat's where the comment is and will youcommit something you just advance thehead of the repository bye-bye one smallnotch and you has to all state of therepository that's the name of your newnew comment and that's quite differentfrom patches so patches are all abouttransforms they are all about likebringing like any stays like applyingwhen you apply a patch you can basicallyapply a patch to a file and it's likethat's what most people think um it'sour with the first first discover guessbut but then they realize it's notthat's not the case and so this can havemajor benefits like for instance in likeany system that uses three-way mergesuch as like git mercurial lesbian CVSand like most others they don't havethat really cool property that we callthe sensitivity in algebra and thatproperty is the following so if you havetwo of two people at least in Bob andyou're not really good communicationright you're a couplewell at least bride's like the red patchhere and Bob in thrall that writes thatlike first a blue patch and then a greatgreen patch well depending on the orderin which you merge things depending onwhat like your merge strategy you mightget different results in three-way mergelike if the patch are exactly the same Ilike first I try to merge the blue patchfirst and then the green patchI might guess in get sometimes I mightget different results then if I do theother you have a theme like on the imageon the top if I merge the two branches Imay get one result and if I merge thetwo like the exact same patches and theyare not conflicting they are not likethey are not weird the situation likeyou're dealing they're really differentparts of the file sometimesBob's patches might get merged in partsof the files is never seen and when youconsider applying tools like that tolike highly sensitive code such thatit's like cryptography libraries it'skind of scary but it's like whateveryone does and soso what's even worse is that you canactually you can actually tell when thishits you like if you're a gates user andthis happened to you there's no way itdoesn't wait to tell get says yeah I'vemerged it's fine it works and you don'tyou don't notice it maybe it maybe iteven complies like there are real-worldexamples of this this thing it's notjust an implementation bug that can beworked around it's like a fundamentalproblem in the three-way merge algorithmall right so we're we're not the firstones to believe this is wrong there wasanother version called system I talkedabout in my intro slide which was calleddarks it's still still called darks bythe way so they so they are their maintheir main principle is that you say twopatches are okayed of like fighting witheach other down conflicts if theycommute so with that that means it'ssimply that you can apply them in theorder well then you're like oh theyalmost commute it's not exactly like iffor you would like do W RL into algebrait's not really commuting because yousometimes you may have to change thelike line numbers like for instance ifBob at the line called which containingjust F and main and slang one and atleast adds a line that like line 10saying println hello world when youmerge them well you might have to mergeat least Elise's line after Bob's lineat line 11 instead of 10 but that's whatthat's kind of commuting anyway sothat's our that's the situation withdarks when it commutes it's finewhat if it doesn't commute which happenssometimes well it's an under commandedpart the algorithm so no one reallyknows what it does flow home myco-author on people is one of the coredarks developers and he told me like Iwas pretty confident in darks beforethey told me at some point yeahwe have no absolutely no clue what itdoes it seems to work most of the timeit's like it's highlights conflicts butwe don't really know other than that sothat praises obviously issues aboutcorrectness so the situation with darksnot much much better than we were kidswell at least highlights complex andwarns the user about complexbut then what it does no one knowsanother problem is slightly biggerthat's the main reason why darks was inbut I've been done even though it camebefore before get is that it sometimesexponentially slow so exponentially slowin the size of history number of patchesin their position and so that somepeople to device the like merge overthat we can work through where you liketry to synchronize all the patches youmade during the week and instead ofsynchronizing like while you're workingon it you're just like waiting forFriday night to arrive start to mergelike booster patches and well hopefullyby Monday morning when you come back toofficewell the patches are merged butsometimes not so this is this is no thismakes it kind of like not reallyacceptable and this is also one of thereasons why we could not convince ourcolleagues to out to use it because theytried to use itnaively thought yeah well patches wellwe cannot understand what they are solet's try to push about bunch of patchesrepository and well they were surprisedto have to wait for like half an hourmerge so it's not usually like that likewhen we write we're developing peoplewere using darks to develop it and it'slike we've never had we've never everruns you that's brought that problem butthat's also because we're like we knowhow it works we know what the drawbacksare we know when the exponentially isslow merges happen so but that's notacceptable like that it's it's not likeI wouldn't recommend it to anyone andthat's where people comes in so we tryto after our day trying to convince ourcolleagues to use this we just like grabgrab a beer and started discussing aboutlike the here if patchy is what whichcould be like what what good way wouldbe like you have a cool patch algebra wecould use and that would be simple andeasy to use and that's where we'restarting to learning about categorytheory so we don't exactly startedlearning learning it back then becauseit's a kind of it's now it's not theeasiest theory on earth on earth butwell so whether it is what it isbasically it's a general theory oftransformationsof things so clearly theorists like tosee it as like the general theory ofeverything the there are numerousattempts to rewrite all mathematics in acategorical language and I think it'spretty successful except that no one canunderstand this so it's already wellwritten we have all mathematics in therebut no one knows what it does all rightso but it's pretty good for us thoughbecause it's a so we're trying to talkabout changes and files and this is atheory of changes of on things so oneparticularly cool concept so this youcan you see all like I I should likethese things in darks these arecommutative diagrams and categorytheories try like drawing them basicallyall day long if several colleaguesworking mates they have their whiteboardthrough that these diagrams sometimes in3d sometimes it like we are interleavedarrows and no one knows but what theyare doing because they're yeah they'retalking about transformations on thingsall right so in particular when verycool concepts of category theory is theconcept of push out so push have is adepreciated of two patches is like forin our particular case the push out oftwo patches would be a file such that nomatter what you do after these twopatches that yield the common Stateslike a common file so if at least in Bobwrite some stuff and then later they canfind some patches to agree in a commonfilewell the via star here is a we call itthe push out if you can reach the caryou can reach any common state state inthe future from that star that clearalright so anything you can do to reachcommon stage you can also reach it fromthe Prashad so that means the push outis kind of a minimal common state thatcan be like that you that you reallyneed to reach leg and it's really watchwhat we want in a version control systemwe really want push out we want minimumcommon states that any furtherdevelopments any further work in therepository can also be reached from thatcommon States that's what a mergereally is and it's it's it's it's so therun problem is that it's not the case atall categories have push outs have all -shouts like we see that like thetranslation of that in English is it'snot clear than any editing any - of anycouple of editing operations on a filewill always be merge about likesometimes we have conflicts we all knowlike most programmers know bits are inconflicts so it's not the case thattries have piles and patches have old -shouts but what's cool but categorytheory is that there's a solution youjust need to dig into them like prettybig books of like abstract diagrams thatI don't fit in but then you canultimately find something called thefree conservative code completion of acategory and that's a way toartificially add all push outs into thecategory so that means when you when youhave a category of files and passesbetween files the free conservative codecompletion is a construction that willautomatically give you a category like ageneralized generalization of files sothat's like that generalization willhave all push outs so in people if wetranslate it into like files and patchesfor instance if Bob adds the linecountry like saying just print a lineBob and at least at the lines in justprint a line at least when they mergethat in people what they get is just agraph of lines where the two lines areadded they're not comparable if not yetsaid what like how they should compareand how they they should relate to eachother in the file you may be at least iswrite anybody's right maybe you're bothright maybe they're both right but in adifferent orderyou don't mean oh that's a conflict andso what that theory or the categorytheory gives you here as ageneralization of files its which iswell in most cases a little more complexthat gets more the the benefits for athree lines file or maybe not obviousbut when files get really large it'sit's really great to have that soundtheory that backs you up and right sothat's that's what people is aboutthat's that's how it works so I want tosay before moving on that this is quitedifferent from see our LEDs sociology'sare conflict-free replicated data typesand people is more like conflicttolerant replicated data types so we'renot resolving all the conflicts all thetime we're just rather than that we'rejust like accepting conflicts as part ofthe system like the data structure canhandle like it can it can representconflicts doesn't have to resolve themall the time and so that's what makes itfast that's what makes people reallyfast because you can apply many patchesthey are conflicting but that's alrightand then once you are done applying abunch of patches you're you can justlike detect conflicts and I'll put thefiles and that's it so what what wouldthe situation be with theologies if wewere trying to apply your leans to thisproblem of like merging patches well intheory oddities you need to always orderso the way it gets real complex is byordering all operation determinedeterministically so it finds whateversolution like whatever deterministicmerge algorithm they can find to likejust order order things all theiroperations in an arbitrary butdeterministic order like for instance wecan have a rule same lease Elisa'spatches always can always come first andwhen there are two conflicting patchesfrom at least I'll just take like takethem in alphabetical order for instanceand so you know in our case that wouldlike lead to the following file like twolines one saying print a line at leastone same printer and Bob and the userwould barely see anything they would belike yeah that's that's it the merge hassucceeded but that doesn't that isn'treally right because that like it theuser should that be we at least bewarned that there's a conflict and theyshould do something about it all rightso the end result of that it's well thefury is a slightly more complicated inwhat I've explained but not much morethe result of that is that we developeda sound theory of fashionssound the giraffe patches it has thefollowingvery cool properties so bear with me fora moment what I'm right like sayingthese like bad words so they're the thefury is like the or algebra iscommutativeit means you can like if pet cheesedon't depend on each other you can applythem in any order so that's that's whatyou would expect from that cheese rightit's associative so we resolved theinitial the initial problem and startedthis talk with so you can you canbasically like no matter how you mergepatches if the merge is right likethere's only one solution to merge andyou're not you're like you're notgetting into trouble like by you'renever like do you're never doing whatget does which is like merging thingsfrom Bob in parts of the fight is neverseen that's what associativity is aboutand also one very cool property is thatall past is about semantic inverse whichmeans when you're when you've read in apatch you can you can derive in otherpatch permits that has the like theopposite effects and then like push itto other predators you cancel previouspatches the thing is you can since youcan it's it's all commits allcommutative so you can basically computean inverse for pets you boost like 1,000patches ago and it all just works and itit does even better than just workingit's also pretty fast so merging andapplying patches it's basically the sameoperation in our systemwell it's possibly the last complicatedslide I've written like there I'll moveon to like easier stuff after that soour complexity for people we knowcomplexity here our complexity is likelinear inside of the patch and the guywith make in the size of history so it'slike it doesn't like you can have anarbitrarily big history it's not itdoesn't really matter like you can youcan apply the new patch your new patch Pafter an arbitrary alerts arbitrarilylarge number of patches it doesn'tmatter it always works the same likealmost and this is this is actually thatwas surprising to us but it's actuallybetter than three-way merge which alsoadds in a like square factor of the sizeof the filethat's actually really also observedthat in real world cases where we'retrying to benchmark like in early stagesof our development we were trying toburnish mark it against like really fastcompetitors such as yet and as file sizeincrease we not we actually noticed thedifference in performance to the pointthat people who was actually faster andget when merging really large patches onreally large price which I'm not reallysure is a real-world case but anyway andso that brings me to the last part ofthis talk so why what made Russ a cooltool to work with what did we like andrust and how it helped us build thisthis cool new system so one thing iswe're working on our algorithmsmathematical objects and that means weneed types we need to be able to reasonabout our code and that's very importantto us like we couldn't really like wewant to develop a sound theory ofpatches and we couldn't like theimplement it and then rely on like ourintuition to build correct CC Curcio C++code that would be like yeah maybe thetheory is correct then what about theimplementation so because we have typesand rust that makes it easy to do so butwe also want to be fast like as I saidlike the complexity theory stuff tellsus that we have the we have thepotential to become faster than gatesand we really want to exploit thatpotential and we really want to to havelike to be as fast as we can and rustthat I was also asked to do that becausewe can add like we can use a like fastback hands we can add like roll pointersto like the parameter memory like I'mMaps and what nowthat's really cool and that wasn't thecase in the early prototypes that werereally not in other languages but alsomore maybe more importantly so westarted doing that because we couldn'tconvince Windows users to install darkcinder machines and and that's and andour goal was to be as inclusive aspossible was to like green versioncontrol to everyone but we cannot reallydo that ifcan only tell you a small portion ofcomputer users which are like expertLinux users and so what we really lovedin rust is that we can write clients andservers for real word protocols so I hadto write some of that myself but it wasactually quite pleasant right but alsoyeah there was a there's a Lord unit anHTTP stack working out pretty well and Iwrote the message library alright and sowe finally got Windows support and soI'd like to thank the rust developersfor that that's that's really awesomethanks ok so as part of like what neededto be Bradon because rust pretty younglanguage there aren't that manylibraries so there are I just wanted toconclude with two really hard thingslike two really two things that I reallydid didn't put didn't think I would haveto write when I when I started itsprojects so when as a as a project thatI call sonically I'll just finish wordfor dictionary there was infinite timewell security has a transactional onthis b3 with like many others like LM DBif you know an MV like many otherdatabase backends but it's particularlyis that it has a fork operation thattrends in logarithmic time like a fastfor corporationso for cooperation means you can clonethe database in like log log in time andthen have two copies of the databasebehave as two different databases so Ineeded that to implement branches andstill work in progress because it'sreally hard to do I'm coming back to itin a minuteand then there's a another bacterialagent just like crossed fingers like Icrossed it's called fresh it's nestedcurrent and server library they wroteit's it's been made like it's reallyentirely in rust and it's been it'sgotten rid of all unsafe blocks twoweeks ago thanks to our brian smithwho's notokay so the trickiest parts thetrickiest thing that I had to do inwhere this is sanic area so why was ittricky well because rust always wants tofree everything before the programcloses and when you're writing adatabase back-end it doesn't reallysound right you won't like something toremain on disk after the program closesso that was really hard so you have todo like manual memory management you'relike--you're yourself and that's notreally when you're used to a functionalprogramming high scale or cameras likecoming back to manual memory managementisn't free Pleasants so how does it workso I'm going to explain anyway how rusthelped us do that so how does it workit's just not going to be very technicalhere but like most database and Giantsare like sorry some cool database isenjoying that I liked are based on betrees so B trees are basically liketrees made of blocks in each block yoursthere are like ordered elements there'sno word list of elements there's aconstant number of elements and betweenthese elements you have pointers toother other blocks to children blocksand it's so insertion happens itbelieves and then while the blocks platewhen they're gonna get too big butthat's well that's how B trees workthere's a good B tree library in thestandard brush library so now that'scool but we can actually use it to likeit doesn't allocate inside the files soyeah it's allocates the program's memoryso we have to write a like a differentlike a new library for B trees thatwould work in files and so the mainthing I've wish I knew when I startedwriting that as about iterators so waitI'm coming back to it so when whensometimes in B trees you have to mergeblocks when they're when they get tolike well when they get to like underfool we have to merge them and somethingI wasn't doing in the beginning becauselike since since I was doing manualmemory management I figured I could likelike the roothabits came back and I was like okaylet's do manual like manual stuff rollpointers and manual things and that'snot really the right solution Rusbecause you can use like cool rightthings anyway even if you have to rollpointers so main thing I've learnedwhile doing this is iterators and how touse them so merging pages like mergingblocks like this can be done in thefollowing wayso if right you're basically right ineach other like two iterators one foreach page there will like give you yieldlike all elements all successiveelements in the two pages and you canthen chain them and add other elementsin the middle that's exactly what youwant to do with your merchant pages andwell deletions and B trees are usuallypretty tricky and that allows you to ourget tested really easily and so the mainreason why I've why I really like thisis that when you're prototyping usuallyyour your head is kind of in a messystate and you really don't know whatyou're going to so most of the linesyou're writing will be deleted in theend and you really don't know how toproceed what to do and so this kind ofhurry concise and and short sorts ofstatements allow you to uh to get thatphase and get like production ready codemuch much faster because the prototypestart working much much faster anotherthing is that I've learned is that wellthen I was like oh yeahso we can have saved I've tried like wecan have cool attractions anyway so thenthere has Carol reflexes came back and Iwas like oh yeah let's do recursionlet's put in record recursion everywhereand one thing I've learned is that it'snot always so in Russ you can't you canwrite very concise code but therecursive way is not as surely the mostconcise you can do so in sanik area forinstance sometimes we need so it's onthese cried so there's a every time youload something from this from this kitlike cost you a lot so you want to avoidlike do you avoid doing that at all costlike and any time you can avoid loadingyour page it's a map right so anytimeyou can avoid loading your page to theprogram's memory you won't really wantto avoid itand so in order to do that the solutionsthat will do things lazily so we'reinstead of deleting something or addinglike I think new element to the sabitriwhat we're doing is just like we'redoing it lazily that means we're we'rejust saying well next time you want todo something you have to copy too youhave to copy the page and updateeverything but just at the time of doingit not right now because we we don'treally know what we're going throughthis page maybe we will merge it maybewe'll drop it and it would be a hugewaste of time to copy it and then editit or or copy it and then merge it thatwould be like we would have to copy thepage twice instead of just once so inorder to do that we need to have likewrite actually something that looks likea recursive function well we just haveto the that function which we haveaccess to several consecutive elementsin the programs call stack and that'snot really easy when you're writing alike writing it's really recursively sothe way it's done actually at the momentit's there's a fake call stack it's justbasically an array and the fact that Btrees are well balanced means that thereare not actually they're not actuallygoing to eat up the world memory of thecomputer so you know that the the thedeath is not going to be larger than 64so you can allocate an array on thestack and and write like how does falloff a stack pointer so you're basicallysimulating the program stack and so thatthese are the two main things I wish Iknew when I started writing somethingyeah so just why do thank youespecially you're a spell trustorganizers this conference is awesomewe're really enjoying us and if you havequestions I'm really happy you'reanswering Thanks[Applause]
yes can you hear me okay we are on nowdo I have a Jeremy do I have a clickerdo I have to go up here I have my ownclicker can i plug it in yes this is anSQLite clicker module it is not justyour average clicker yeah no I can't Ican't talk without moving my arms rightI can't do that I'm already liveunfortunately I think we only have asingle USB port on this machine oh it'sjust slow to respond thanks everybodyfor coming my name is Richard yep okayokay thank you for being here my name isRichard yep this is a talk on yet andwhy you shouldn't be using it in itscurrent form really this is going to bea talk about what we can do to make getbetter here the complete copy of theslides and the original OpenOfficeformat there if you want to downloadthem I'm getting a little bit offeedback do I need to be somewhere elsedo I need to stand on the stage maybeit's buzzing don't worry about thebuzzing okay so this talk is about yetnow just to be upfront with you I am thecreator and premiere developer for acompeting version control system but I'mnot here to push my system this is aboutbecause really they're both about tenyears old and get has clearly won mineshare everybody used to get raise yourhand if you are using it raise your hairif you want to be using it raising herein your hand if you want to be using ifyou are using it but wish you were notyes okay so it is the software thateverybody seems to love to hate and I'mgoing to talk a little bit about whatsome of its problems are and what we cando to fix them now as I said I'm I'mwrote I'm alright a competing system I'mnot pushing that today but my decade ofexperience riding and maintaining thesystem informs my criticism of yetbefore we start gig before we get goingI have a collection of quotes about yetand I'd love to collect these if you seeany letting you know I brought along afew of my favorites here this is arecent one from Benjaminit is so amazingly simple to use that apress a single publisher needs threedifferent books on how to use it it's sosimple that it's lycian and get her bothneat feel free to write through anonline tutorial to clarify the main gettutorial on the actual get website it'sso transparent that developers routinelytell me that the easiest way to learnyet is to start with the file formatsand work up to the commands so here's Ilove this one this was Jonathan Hartleyit's simplest to think of the state ofyour git repository as a point in highdimensional code space in which branchesare represented as the in dimensionalmembranes mapping spatial loci ofsuccessive commits on to the projectedmanifold of each clone the repositoryand if you understand what that meansyou should probably be using it this isfrom Nick Farina co-founder of MeridianMurray this is a different Meridian fromthe people who are right outside thatdoor showing this is a different companybut he wrote yet is not a Priusyet is a Model T it's plumbing andwiring to stick out all over the placeyou have to be a mechanic to operate itsuccessfully or you'll be stuck on theside of the road when it breaks down andit will break down emphases as in theoriginal so in this and this was in anarticle really pushing him it's thegreatest thing in the world but he he'sreally up thrown about its limitationsI've got a ton of these but my favoriteis the next one this is this is myall-time favorite from a guy named Tstain on reddit Klingon code warriorsembrace get we enjoy arbitrary conflictsget is not for the weak and feeble todayis a good day to code so you know you'reall getting users you're laughing atthis and so the reason you're laughingis because you know it's true so hereare my top 10 things that I think top 10needed enhancements needed forgettingI've got a longer list I tried to limitit to 10 and I tried to order them Ithink an order of importance so I'mgoing to start with the first one appearshow the descendants of a check-in thisis a showstopper for me because git doesnot do this well is I can't use git Ihave to use a different system what do Imean by thatyou know ifyou think to the if you think back tohow git is implemented and apparently inorder to use get successful you kind ofhave to know the low-level Davidstructures you've got this this lengthlist of commit objects now there's fourdifferent types of objects in the fileformat there's commit objects treeobjects blob objects and tagged objectswe're only dealing with commit objectshere and and the course they have offeach one has a complete sha-1 hash labelI have shortened the label on these to asingle hex digit just for readability soyou're the first check in was committednumber nine and and so that blob goes inthere that's great and then there weresome additional check-ins after thatwhich were a and F and it was a at forkthere was a branch there and the Fbranch went it to D and the e branchwent up to C and then then there was amerge operation at B and then a is thelatest a is head and each one of thesecommit objects has a pointer to itsparents so it forms a graph just likethis and in the get documentation itdoes show the graphs going this way sothe arrows are pointing backwards intime which is kind of counterintuitive Imean yeah this is the way you need toimplement it under the covers definitelybut you know users ought to see thegraphs going forward and or the arrowsgoing forward in time but the thing isyou know if you're if you're sitting atE and you want to what wonder what comesnext there's nothing here to tell youyou can follow the arrow and find itsparents but you can't find its childrenand there's no way to go in you knowafter somebody commits C you can't lookat E and you can't modify e to add alist of parents because if you were tomodify E that would change its hash itwould become a different check-in itwould become a different commit sothere's no way to do that and this is abig end and for that reason if you ifyou look at any of the user interfacesit's very very difficult to find thedescendants of a check-inhow do we solve this I propose solvingit by having a shadow table off the sideyou keep and keep the same internal datastructure because that that's the wayyou need to do this but we could youcould make a table in a relationaldatabase that keeps track of theparent-child relationship and so I'vegot a little simple table here and it'sgot the parent hash the child hash andthen rank is a field that says whetheror not this is a merge check in or ifit's the primary that the merge parentor primary parent so let's look at howthis goeswe see that a has a parent which is B onthe first entry here so B is the parenta is the child and the rank is zerobecause it's the prime your parents it'snot emerge B is a child of two differentcheck-ins both C and D C is its primaryparent and D is its merge parent and youcan have a three-way merge too and thenyou number that way and and you can seehow this table very succinctlyrepresents this graph in fact you couldthis is not like a primary datastructure you could build this tablejust by looking at the get log you couldbuild this table very quickly and thenonce you have this table um it becomesvery simple to do a query to get thechildren or the parents of a particularcheck-in and if you have an index onthis table or appropriate indices onthis table then that query becomes veryvery fast and this allows you to to findthe ancestors so and if you want to getand of course you can you can do morecomplicated things with this so forexample usually what you want to do isyou've got to check-in you want say whatcame afterward you want to say fund the50 most recent descendants in time of aparticular check-in to see what washappening so somebody checks in a bugyou find out about it two years laterokaywhat was the follow-on to this and forthis example I've added anotherto the table called in time which isjust a timestamp and then I just andthen using a simple recursive commontable expression I can immediately getthe 50 most recent and descendants ofthat check in now this is just a singlesimple SQL query it's a not a commonthing it's uses a common tableexpression and if you don't know whatthat is there's actually a talk by meright after lunch and you can come and Iwill explain how this works to you butthe point is it's just a simple querygives you this now you could do this byyou could do the same thing by lookingthrough the git log and doing lots ofcomplete table scans it would be a lotof code it would take a lot of it wouldbe slow and I note after 10 years ofintensive use with lots of userinterfaces nobody does it and so thisinformation is just not available topeople who want it you could do allsorts of other instruments if you hadthis table so for example you could findall of the check-ins that occurredduring some interval of time and that'sjust by selecting doing a select on thetable with the in time between twovalues that you've selected I do thiskind of thing all the time because in myrepositories I keep I keep separatecomponents of a project in separaterepositories so the SQLite source codeis in one repository the documentationis in another repository some of thistest cases are when the original sourcerepository but I have other severalother repositories that containadditional test cases so I get an emailor a phone call from a client and sayswe've got a problem with this versionSQLite is three years old and we go backand we we we buy say to trace it aparticular check in and we're turningwell why do we make this change what wewant to see what's happening in theother related repositories at the sametime so we can kind of go back andremember what we were thinking threeyears ago I don't know about you but Icannot remember what I was thinkingtwo years ago this week do you even knowwhere you were two years ago this way Idon't so but but by doing this query Ican I can get a complete listing takenyeah the answer is that we were probablyhere two years ago this week but bydoing a career like this I can go backand give you a complete listing of whatwas happening all of the relevantrepositories and oh yeah that's when wewere working on such-and-such a featureand now I see why we made this changethis happens on a daily basis for meit's not easy to do with the current yetstructure what are the thirty closestcheck-ins to a particular point in timesame kind of thing where you know we'vegot a change that we're investigating weknow that this change introduced aboutwhat was happening around that point intime not necessarily on that same branchmaybe on parallel branches what were wedoing at the same time this is a veryimportant thing when you're trackingboats and and a system like this allowsyou to do it you know I forgot tomention when I'm showing you theoriginal chart that this table thatwe've got here this lineage table it'snot really a primary data structure inthe sense that it's not holding any newinformation all the information that'sin this table is in the original getobjects and if the this table been outof date for some reason some softwarebug or something you could just deletethe whole thing rescan yet log andrebuild the table anytime you want andyou know my particular version controlsystem does you know has the same kindof situation and there's a rebuildcommand just you know type rebuild andit actually rebuilds this table fromprimary sources so here's an example ofdoing 30 the 30 nears check-ins in timeand it's just a couple of selects with aunion and ordered by time differencesand limitto the first 30 and that's very fast soand you could do something like this andso I imported the complete gitrepository for get itself into adifferent version control system andthen ran this query on it just to findout what was the old what were theoldest commits and get itself and sothese were the first five commits to getitself and I was amused by the veryfirst one where Lin is checked in thevery first code to get and he says hisown words the initial revision of gitI don't know if you can see this theinformation manager from hell so you canyou can actually see this in the gitrepo it's right in here at the bottomthat's a very first check-in notice inthis particular rendition the errorspoint forward in time rather thanbackward in time which I personally findmore intuitive but you can you can getthe same information by doing git logand then piping it through tail and injust see the last few entries yesthere's also a git log reverse thatPrinceton in in reverse order I did notknow about that one I did time it thistook three milliseconds when I did getlog and piped it retail that took thebetter part of a second so this isfaster okay so that's that's that'sthat's the big complaint I have I can'tgo back and explore the history it'sjust this this this length list what theair is going the opposite direction thatI normally want to go the next bigproblem I have with yet is it has anoverly complex mental modelin particular if you're working in getyou need to keep in your mind fivedifferent snapshots or five differentcommits you need to remember what's inyour working directory the the filesthat you're editing right now you needto remember what is in the index orstaging area you need to be mindful ofyour local head the branch that you'reworking on you need to be aware of thelocal copy of the remote head that isyour copy of what's going to be on theserver and then also you need to beaware of what is actually the remotehead and there are commands and get tomove information between all five ofthese things really if you're adeveloper you really only need to beconcerned with two of these which is thefirst one in the last when you'reworking directory and what's actually onthe server what everybody else sees allthis other stuff in the middle B C and Dis just complication it forces you tokeep in your mind two and a half timesmore information than you really needand you know every one of us has afinite number of brain cycles you canonly think about so much at a time andmy view is the version control systemshould get out of your way and use asfew brain cycles as possible so that youcan devote as many brain cycles aspossiblewhatever project it is you're working onand having to keep in mind B C and D sheseems to just be stealing cycles fromfrom your normal thinking activity soone of the first things that I thinkreally ought to go is and of coursethese things to be available in in therare cases where they're actually neededone of the first things I think needs togo is the staging area I mean I talk toa lot of people about this and if youhave any views I'd really like you toshare themsome people are fanatical about the getindex is a great thing and ask them whyand the usual answer is I get is thatwell it allows you to do a partialevery other version control system inthe world allows you to do a partialcommit to and they don't have a stagingarea so I'm not sure why that's theadvantage the the fact that you've gotthe differences between what you'recommitting to and the local and theremote head that these don'tautomatically stay in sync there may becases where that would be a desirablething but those are the exceptions notthe rule usually when you want when youdo a commit you'd like it to immediatelyyeah you keep it on your machine butyou'd like it to immediately go out tothe server so that everybody else cansee it too now if sometimes you're offnetworking that doesn't work and so youhave to be aware of these things butthat's the exception not the rule theusual case is that you want to goimmediately and automatically yessome people say that's not their usualcase but you know the people theexperience I have and you know when Iwas originally doing the distributedversion control system doing my own workthe same way where you would have toexplicitly push as a separate step andwe developed some experience with thatafter a while and we eventually foundout that it works a whole lot better toautomatically push so every time youcommit it automatically pushes and thisreally solves a lot of problems in factthere were some some users on on themailing list of my system recently andsometimes you can get in a racecondition where two people commit at thesame time are nearly at the same timeand and there's a race to see who pushesfirst and it would automatically createa branch and everything and and theywere upset about that that that theyweren't getting in a feedback that therewere two people committing at the sametime and it goes beyond just automaticthey want not only automatic but theywant automatic notification that otherpeople have committed as well and allthis other thing this is what peoplereally want in rightnumber three it doesn't really storeyour branch history in the get world abranch is just a symbolic name for themost recent commit on the end of one ofthese commit chains and as you commitnew things on there that that pointermoves so it doesn't really remember thename of the branch where it originallygot committed now you can kind of usesome inference and figure it out some ofthe tools will show you this but it'snot really first client's branch historywhen you're doing analysis of foe andpeople come back to you and you need togo back and look at what was having twoor three years ago you often want toknow what branch was this chickenoriginally checked in on where was thisoriginally what was the original name ofthe branch or what are all thehistorical branches that we'd handle inthis project where they're starting andending dates and how were they finallyresolved where they merged are theystill active were they abandoned whatwas the what was the solution ear listall the historical branches I want to doa bisect only over this particularbranch so a lot of times what we do isby six very important to us so what willbe somebody will want to implement a newfeature and they'll do a sequence ofcheck-ins in a branch and then they'llmerge that branch on to the trunk andbut then later on when we're doing abisect we don't want to bisect into allthese little incremental changes whetherthey're adding the feature we want justthat one case where they added it butbecause there's no branch history andyet they there's there's no way to keepup with this you can have them makeinferences based on on on on the commitlogs but it'sit there there's there's no permanentrecord of the name of the branch wherethings were committed I didn't thinkthis is real important I asked forfeedback from the user community a lotof people are saying this is my numberone complaint and get if it forgets thename of my branch's number for multiplecheck outs from the same repositoryright now with get the working area ispart of the repository you can only haveone working area per repository ifyou're working on something you've gotyour project all taken apart you're inthe middle of editing an email or phonecall comes in it requires you to go backand look at something historical wellyou've got you can stash your work butthat's that's kind of bad because youcan only even even with the stash you'regonna lose context you can clone yourrepository to a new repository and thengo back and look at the historicalversion in the plume but that's just anyaim can work around it that way but it'sunsatisfying it would be so much nicerif you just had a repository and thenyou can have multiple workingdirectories sitting on differentcheckouts from that one repositorypeople who have worked in both systemstell me repeatedly this is a veryimportant thing to them multiplecheckouts from the same repositorysliced and cloned checkoutsare sliced checkouts and clones excuseme so a slice would be this is this iskind of a feature that you had withsubversion and CVS where you've got likea massively wide project like net bsd ifyou ever look at their repository theyhave the entire user space with 60,000different files all in one repositoryit's massive and most people don't wantall 60,000 files they only want to lookat one subdirectory the only one workingon one side directory and so a slicecheck out me I want to check out orclone something but I don't want theentire repository I just want this onesubdirectory wouldn't it be great if youcould do that there's really notechnical reason why you can't it's justthat it isn't supported yes what is a soa shallow clone is where you clone arepository but you don't get all of itshistory um yeah that's that's anotherthing that's nice to have and that is anew feature hello fellow clansmen aroundit's just been like a year or two andthen so um the other thing a lot ofpeople request and it does support thisnow is that you've got a project thatgoes back ten years if I want to accessthis project why do I have to go to tenyears of history can I get by with justloading two months of history and savebandwidth that's a shallow climb andthey've got that but where else we'redoing that a slice clone we're soshallow clone is slicing it this way aslice is doing it this way and so itwould be nice to be able to do a sliceand a shallow clone yes question so thethe comment is that on some projects thedirectories move around and thingschange and so that slicing doesn't workas well there so don't slice on thatproject but on some projects like netbsd the directory structure stays thesame for like 25 years and and and theywant to do this sort of thing and and soyeah yes question thepointed point was raised the getsolution are the district I'm gonna gobeyond yet and just say the distributedversion control solution because theyall have this problem including when Iread the the the distributed versioncontrol solution for this problem is tohave separate repositories for each oneof your little components that you mightwant to load yeah this kind of aworkaround though isn't it becauseyou've got that means you have topredict in advance which directors aregoing to be in interest as a separatepiece and historically we've not beenreally good at predicting that so that'snot it would be much better and thesoftware could in theory do it it's justto be able to clone or check out a slicethat would work check outs and commitsagainst a remote repository right now inorder to work and yet you have to make aclone and has to be on your localmachine everything has to be local andI'm asking why is that well I know thetechnical reasons why but from a user'sperspective that seems like anunnecessary burden now if you're anactive developer yeah you do want yourlocal copy and if you if you're going tobe working off network definitely youwant a local copy and that shouldprobably be the default but if I'm justa low git hub somewhere and I see someinteresting project and I want to lookat it and look at the source code why doI have to include the entire 15-yearhistory just to look at the latestversion one can't I just check outdirectly from get up without cloningthat seems like it's something to bereally easy to do for that matter if Imake a change why can't I one can Icommit it back why do I yeah there arethere advantages if you're doing a lotof work on a project it's certainlybetter to have it local but if it's justan occasional thing why why do I have todo that why can't I commit and go over anetwork okay the next one is a busyboxversion of good you know what busybox iswho knows what busybox is busybox is ofcourse the that single program that hasall of the unit standard UNIXcommand-line utilities built-in now thisis not a perfect analogy because busyboxalso has limitations it doesn't do thefull thingbut right now when you install get andinstalls what is it 134 differentprograms you know because each commandis implemented by different executableand all of these little programs get putin a special directory somewhere andthen the one get program looks at thearguments all I want to run this oneit's got a huge number of dependenciesit's this big pile of stuff and a lot ofpeople tell me that they really want aversion control system it's just anexecutable they download an executableyet sexy or yet if you're on Linux andyou put it on your path and it worksthere's nothing's installed you don'thave to have app yet if you want toupgrade you can put it in a changer jailif you want to upgrade you justoverwrite the old one with the new oneif you want to uninstall it you justdelete the binary very simplewhereas yet you really have somethinglike app yet just to manage it this isyou know you know a big pile of programslike that that's great for developmentit's great for an application it's worksfine for development work ruin yourprototyping but for a mature productthat's 10 years old that everybody'susing you'd think that there would besome better packaging for it you know sothe pink ones just download it reallyquickly I hear a lot of people they workin companies where they're going on atrip they have to check out a laptopthey don't have their own laptops youcheck out a laptop and it comespre-configured and it doesn't have theversion control system you want so forthem they have to go and install yetwouldn't be better just to have a singlebinary they could just plot their on themachine all comes via HTTP or HTTPS somy wife is a faculty member at UNCC thelocal university and I go over on campusa lot and over there they have guestWi-Fi niner-niner guest this manright here is probably in charge ofknown and yes so I am I am grateful I amgreat I am grateful for you know havingthe free Wi-Fi access but you know likeso much of the world they confuse theythink that the internet and the worldwide web are the same thing that meansthat nine or guest only allows you touse TCP port 80 and 443 those are theonly two options okay so you cannotsecure shell into your back into yourserver and and furthermore you can't runany other protocols that don't learnevery port 80 and don't look like HEVwhen you're online or guest and it's notjust UNCC that does this there's a lotof places to do this I hear a lot ofpeople they they they use my alternativeversion control system because it doesjust use HTTP and then we use it becauseit's the only one that will penetrateour corporate firewall you know there'snothing about yet they couldn't befinagled to work over HTTP it's justthat they don't so I I think that'ssomething to be very important maygreatly improve their usability I thinkthere needs to be a get all command thisis the thing we did where it all meansit works on all I keep track of all ofyour repositories as I said we have Ihave dozens of repositories open on mydesktop in any particular point in timeand I lose track of them I can'tremember them all so I'm working all dayand working on all these differentprojects on different repositories andget to the end of the day and I'd liketo be able say get all status and it'sgonna go around and it's gonna find allmy repositories and it's gonna do astatus on them to show me what I have itwhat I forgot to committhe way it would do this of course isthere's already a dot get file in yourhome directory that keeps track isn't itcalled dot get they well in no in yourin your home directory that keeps trackof the your username and and all thatstuffdot yet config ok so there's alreadythat file and every time you run a gitcommand it's going to consult that fileit's gonna read it so and and peoplecomplain what you can't keep track ofall other repositories because you canfreely move them around so buddy just doa move man and move it to a differentplace and that's fine so this the dotgit config keeps track of the last knownposition of a manso every time you run a git command itsays ok this repository I just read gitconfig is my repository that I'm workingon is it listed and they get configusually it will be if it's not let's addit and then when you run again allcommand it actually goes down the listof possible git repositories but then itchecked each one to see if it still isbecause you might have moved it away sothis is easy to implement and it's notso it's not a 100% but in practice itworks well enough so like I'm working onmy desktop and of course I don't use yepI use a different system but if I wereusing hit and we're going on my desktopand I'm getting ready to go on the roadand I got all these things here I can doget all push it pushes everything out tothe server and then I go over to mylaptop and to get all pull and to makesure that everything is synced and thenI can go off network on my laptop and Idon't have to worry that I forgot aboutone of the critical projects one of mycritical repos it's a very importantthing and finally get serve no that'snot finally there's a bonus questionagain get sir this anybody use mercurialyou know about the serve command do youuse HG serve ok and the comment from thefrom the audience was this is exactlywhy we use mercurial rather than he isbecause it has a serve command so whatthis does is it kicks up ayou know mercurial doesn't really go farenough in my view let me tell you so soif you do hg sir it starts up a a webserver a little web server running thereand then you can point your web browserat it and lots of really usefulinformations about your repository butone I implement for my system goes onestep further when you type fossil UI itstarts up the server and it also causesyour favorite web browser to pop up onthat page okay so there's so whenmercurial is a two-step process you haveto start the server and then you have totype the URL into the web browser minddoes them both in one step but the pointis there's this very rich environmentyou know I know that you there are lotsof tools out there for giving you a webinterface to your git repository and butyou know what their separate installthey usually required that you also havea patchy there too and there's lots ofrequirements and there's a big set upand it's you know and you and people doit on a server or something like thatbecause you know that you've taken thetime to set it up but wouldn't it bereally cool if every time you had arepository you automatically had aserver you could just say git serveimmediately your web browser pops up andyou got all this graphical historicalinformation that you just click aroundand find if you had that if you havebeen using this for a couple of weeks Ipromise that you would never you youwould never believe that you've got onewithout it this is a very amazing thingI'm probably blazing through these linesway faster than how much more time do Ihave ninety minutes a lot of answeredquestions and answers but I do have onebonus feature this is a thing that Inever personally needed but I hear froma lot of people they would really liketo have advisory logs what do I mean bythis this is coming from the gamedevelopment community when you'redeveloping with ASCII text files thethey let me take you backto some of the older version controlsystems and this if if you're if you'reyounger than me you may not remembersome of these so those things like SCC san RCS and the way these things workedis that when you would do a check outall your files would come out read-onlyyou couldn't edit them and if you wantedto change a file you had to do a specialcheck out for editing which would thenlock the file so that nobody else onlyone person could have a check out forediting in return and that way you wouldnever get a conflict of any kind becauseonly one person would be editing a fileat a time of course the big downfall ofthat approach is that somebody wouldcheck something out for editing and thenimmediately leave for two-week vacationand and so we then we'd have to gorunning around finding an administratorto unlock it and so forth but and so CVScame along and it gave you the abilityto just edit without anything to checkout for editing and that was the coolestteacher in the world that was justamazing I know that it's very popularthese days for people be bad-mouthingCVS and I recognize that CVS haslimitations and it is an oldertechnology but if you've ever you knowthose of us who have had to use whatcame before CVS will never speak ill ofCVS so but and but so now we have allthis really cool merging stuff so thatyou you you know when two people makesimultaneous changes they get mergedtogether that works great for text filesit does not work for JPEGs it does notwork for EM pegs it does not work forthese binary resources that are a bigpart of game development for example butalso other things and so a lot of peoplewould love to have the ability to in inin the main serverthe central repository put in advisorylicense I'm editing this JPEG and so ifsomebody else wants to edit that JPEGthey'll get a warning now it's anadvisory lock so that you know if theystartand then go on vacation you don't haveto go run up an administrator in orderto fix it but it still gives you ithelps you to coordinate editing ofbinary resources that way so there'swe've had a progression of open-sourceversion control systems I mean in theold days there's a CCS and RCS and thenthere were CVS and subversion which werehuge huge innovations and think it camealong you know it was really based onthis thing called monotone which I thinkreally kind of pioneered the idea ofdistributed version control systems butso but yet was the one that wassuccessful and but that's team that'sbeen 10 years and in the question whatis and get his really other than theother than adding a few features aroundthe edge of such as a shallow clone itreally has an advanced in you in tenyears it has anything new much and so myquestion is what is going to come nextI've outlined some ideas here about whatI think the direction I think versioncontrol needs to go I'm hopeful thatsome of you guys might be interested ingoing out and hacking on it andimplementing some of these ideas maybesomebody who's watching the video wouldsee this if you have other ideascriticisms or complaint if you thinkthat I'm completely hear from you againI don't use it on a daily basis I use adifferent system that I wrote myself andso I'm just I can be some somewhat outof touch if you if you think I'mcompletely off-base I really do want tounderstand your point of view so pleasegive me feedback that is the extent ofmy talk and so I will be happy to takequestions comments criticisms at thispoint so we already said you got aquestion in the mat no no I was going tosay we've already established everybodyhere is using it is that correctwho's also using subversion raise yourhand if you are a current or a currentsubversion user a former subversion userokay CBS Kurtz of CBS users nobodiesstill using CBS former CBS users somematerial material who's using materialsomething different call that what yougot dark sand RC RC s not sure enough Ijust won fine okay yes okay well youknow what I even you know even when Ihave just one file I'll set up arepository for that one file and then Iwill also set up a clone on the serversomewhere and then my system every timeI do a check-in pushes it and that's mybackup so you know anyway a differentone perforce how do you like perforce itgets the job done okay yeah okay anotherone of a very obscure and called fossilthat I might not have heard of yeah okaygreatyeah so um what are your present are youhappy with yet I see a lot of headsgoing this way you're handy with yet youlike it okay so the comment was hisproblem with yet is that it's not getitself it as other people you invested alot of time to learn all of the obscurecommands to learn the tree structure andso now you can get around and get prettywell and other people come along afterhim and blows everything a mess up yourrepository right yeah okay do you thinkit's fair that you know people likebecause a lot of people use versioncontrol other than programmers I meantheir data scientists and your are youand your not uses so a lot of peopleneed to use version control other thanprogrammers you know there's we I wishthat more scientists would use versioncontrolwish that climate scientists would useversion control okay there you go butyou know it gets hard to use you have toyou have to spend a lot of time learningall these obscure commands to go betweenthe five different things that you haveto keep in mind and so one of the one ofthe funny quotes that didn't didn't tellyou about was I'll get to in just asecond one of the funny quotes that Ihave that I didn't have a slide for wasI'm in order to effectively use git I'mwaiting for Emory you have to have theman page tattooed on your arm yeah andand and for the video people theaudience was saying yes you do okay comeyet Oh question we wish that Congresswould use version control and we haveapplause from the audience yes questioncoming all right the comment from theaudience was that he looked at theonline documentation for say yet pushwhich is a command which is very commonit's not something obscure and thedocumentation makes no sense yeppush rest to remote repository what doesthat mean really this is this goes backto the first moment where you have toyou know the best way to learn get is tostart with the data structures and thenwork up to the commands yeah and you'reright and and if you're not a program ifyou're not hardcore if you're not aKlingon warrioryou shouldn't have to learn this stuffthat's the point yes all right so I'mgoing to try to summarize that remarksteaches robotics high school studentstheir new programand it is so complex it's just beyondthe ability of a newbie to obtain it'sjust a barrier to entry you need to be akernel hacker in order to reallyunderstand it but on the other hand weneed to be teaching new programmersversion control as a core skill and theytake one look at it and turn around andchange their major to history yes comingin the back yeah okay I'll just go aheadand say where the I develop the fossilso yeah so get in the in fossil rightokay so the comment was that that getand fossil and also mercurial andmonotone are commits oriented whereasdarhk's is patch oriented and can Icomment on that and what what we shoulddo is you and I should get together overlunch and you because I have neverreally understood darks and I tried butit just wasn't making sense to me so inin if you look at a graph the patchesare just the arcs between the nodesright and so darks is really focused onthe arcs whereas gets focused on thenodes is there is there a differencehere than I'm missingmm-hmm I'm gonna friend all right so sothe point was made that the darks cananswer questions that yet cannot answerand to just kind of summarize theremarks I think it's that some the ideaof keeping track of patches works betterfor some people's way of thinking thankeeping track of commits and that may bethe case and I'm not opposed to thatI'm just see I'm not here to tear downyet I'm trying to make it better andhopefully improve this and really if itgets good enough I'll confer all of myfossil repositories over to get and juststart using that but right now it's notanything close to where I need it to beso my point is to improve it foreverybody any other comments or remarksquestions yes oh yes there's the famousget man page generator yeah I shouldhave made a link to that to see me seeme down because I want to put that inhis every of this talk again I also youknow those who never had a if you'veusing it who's ever had a detached headyeah did you know I just want to getback to my first point if you had thisrelational database keeping track of itdetached head becomes an impossibilitythere are no more detached heads itcompletely solves the dementia problem Imeant to mention it when I just forgotwe've come up with in addition to fixingyet we need to come up with a new aidfor presenters so that we can havepoints that we you know reminders hereon the screen to tell us all right andin in and get master here tells me thathe does detached heads on purposeyeah yeah one person saying he likesdetention it's cuz he does them onpurpose and then somebody else saysdetectives are all fun and games untilyou do it without meaning to and it willeventually garbage collect yourdetention kids wanted yeah so so yourdetentions will go awayyes okay but but with an approach likethis detached kids just appear you knowin the little grants that you get withyour your graphical interface theinitiation is a nice little grant foryour history that is hatch kids justappear there and you can click on themand see what they're about it's not somemystery that you have to go dig up outof a log you don't have to remember thatthey're there it just shows you and younever lose them any other commentsquestions if you wanna know more aboutthe alternative you can meet me in themall we have to give you demonstrationsand a sales talk thank you for comingenjoy the conference if the videos stillgoing I'm getting reports from theaudience that the main page generatorforget is very funny mute thisyouyour customers rely on your website orapplication if it's slowernon-responsive it infuriates your usersand costs you money keeping yourbusiness critical systems humming alongrequires insight into what they're doingyour system metrics tell stories storiesthat can reveal performance bottlenecksresource limitations and other problemsbut how do you keep an eye on all ofyour systems performance metrics in realtime and record this data for lateranalysis enter long view the new way tosee what's really going on under thehood the long view dashboard lets youvisualize the status of all your systemsproviding you with a bird's-eye view ofyour entire fleet you can sort by CPUmemory swap processes load and networkusage click a specific system to accessits individual dashboard then click anddrag to zoom in on chokepoints and getmore detail comprehensive Network dataincluding inbound and outbound trafficis available on the network tab and diskrights and free space on the disk stabwhile the process Explorer displaysusage statistics for individualprocesses the system info tab showslistening services active connectionsand available updates adding longview toa system is easy just click the buttoncopy the one-line installation commandthen run the command on your Linuxsystem to complete the process the agentwill begin collecting data and sendingit to long view then the graphs startrolling use long view to gain visibilityinto your servers so when your websiteor app heats up it stays upcitrix xenserver gives you everythingyou need to integrate manage andautomate a virtual data center all on anenterprise class cloud proven virtualplatform and at a third of the cost ofother solutions but why even bother withvirtualizing your server infrastructurein the first place well let's say youhave a traditional one server to oneapplication architecture but you'rerunning out of resources and performanceis suffering once you order new serverhardware you'll wait for deliveryconfigure it install your businessapplication stage and test the serverand finally add it to your productionfarm if you've been through this processbefore you know it can take weeks oreven months you also know it's amanually intensive process that willburden your team every time you outgrowyour current setup with a virtual serversolution you could accomplish all ofthat in less than half a day servervirtualization software separates the OSand application from the underlyingserver hardware and with multiplevirtual machines on a single server youcan use each of them to run differentOSS and applications this makes itpossible to move your virtual machinesfrom one piece of hardware to anotherwhenever you want to maximizeutilization simplify maintenance orrecover from a hardware failure andwithout slowing down your applicationsor users clearly server virtualizationprovides big benefits and CitrixXenServer provides even more since it'sbuilt on an open platformXenServer plays well with your existinghardware storage systems and ITmanagement software as well as with theindustry's leading cloud serviceproviders best of all you can getstarted by downloading a fullyfunctional production ready version ofZen server for free after a 10-minuteinstallation process you'll see how easyit is to start virtualizing yourworkloads and automating your ITmanagement processes and when you'reready for a richer set of managementtools just upgrade to one of the premiumeditions of Zen server so whether you'reinterested in virtualizing servers forthe first time expanding your servervirtualization footprint or movingserver workloads to the clouddownload and install zests River todayand see how it can help you simplifyyour IT environment Citrix XenServer domore don't spend moreyou
thankyou you haven't even heard what I haveto say yet um okay so since we're a nicecozy audience I mean the material I'veprepared is introductory um mostlyanyway so to the idea is to give aflavor of what darks is about why itexists um how it works at at a fairlyhigh level um but I do know that atleast a third of the audience is quitefamiliar with darks and I don't knowabout the rest of the audience but I'mhappy to sort of you know dive into moredetails in terms of questions and soforth if if people would like that um soum yeah so in when I get to the demo forexample if everybody feels that theyknow exactly what dark stars and soforth I can skip that and um instead youcan ask difficult questions about someof the otherslides um so I guess does does anybodynot know what darksis has anybody here not useddos okay so then at least someintroductory material isworthwhile umsorry that one right okay so what is umso I guess from a point of view of um aHaso programmer you probably only comeacross dark you may I mean these daysyou probably only come across darksoccasionally um a lot you know fiveyears ago most high school projects wereum were in dark were hosted in darks andnow roughly speaking get as one and umthere are only a few hard projects thatare still using darksum so what is it exactly well I mean inlike get material and so forth it isit's a distributed version controlsystem and historically it was one ofthe very early ones to be around umafter T and what's now Arch and umthings in some ancient stuff in thatmold um so in other words it'sdistributed it's not like subversion inCVS it doesn't have one Central serveror at least you're not forced to haveone Central server but I think theconcept of distributed version controlis now pretty familiar to people umunlike other distributed systems or infact any other version control system umthe first class objects in darks arepatches um so in most Version ControlSystems what you try to do what reallywhat the Version Control System caresabout is storing trees of um your Sourcetree and you change it a bit and itstores another copy of your Source treeprobably with a lot of sharing and umoptimization so that it's not it's notstoring an entire copy each time um butas far as the version control system isconcerned its job really is just to giveyou back uh Source tree at a particularrevision um and keep track of all thatfor you um darks tries to in its sort offundamental model keep um use patches umso the changes between trees rather thanthe actual trees themselves and I'll gointo more about thatlater um so why bother um and by the wayI just add that it is I'm very happy tohave heckling from G users who thinkthat g is the be all and end all we canhave a nice argumentum so but why bother keeping darks goingum I mean we could basically stop say totell everybody who's using darks youmight as well stop you might as wellswitch to GitHub right now um so thoseof us who are still working on it thinkthat it does actually this this idea ofusing first class patches is actuallyimportant it does actually give you akind of power um and a different adifferent kind of model for working withum working with Version Control thatmakes um that that really does give yousome something that get and mural and soforth don't give youum and the other way of the other aspectof this is that I mean in practice mostprojects are are hosted in git nowadaysand we need um we're working on uh onhaving a bridge to get so that you knowpeople will be able to say host theirprojects and get and um some people areable to use darks or some people mighthave their Central host be in darks andallow people to submit git patches um sothat's kind of in progress but um youmention gives the impression of beingbased onin what way does it Imean well no what you what you see umwhat you see in the git user interfaceis are commits right so there are and sothose are that's that's a specificspecific state of a repository right soif you know if you say bring up if youbring up the git user interface and youlook at you will see a tree of um orgraph really of of revisions and each ofthose revisions will correspond to aspecific sourc tree um so whereas indarks when you look at the changes in arepository each of those changes reallydoes correspond to the changes from theprevious date and that's that's how darkstores them and that's how darks letsyou manipulate them that that would bemy I'm sure you know what I mean reallyhere but um that that would be my Spielfor that thatquestion okay um so what what does itreally mean to have first class patchesum well I guess part of part of this isreally just is is really just aphilosophical difference Imean hasal has first class functions sodoes c right um but it's it's fairlyclear which you'd prefer to use if youactually if you really want to writefunctional programs um and that's that'sthe same kind of thing with darks youknow you could if you look if you lookat gear to M or whatever through theright glasses um you could think of youcould think of the diff to the previousrevision as being the first classobjects in that system but really um itdoesn't make it easy to work with it inthat way um darks does its very best toum to make it easy as possible to treatPat patches as the as the real unit ofwork in a in a darkrepository um so what does it actuallymean in practice um so one of the reallykey benefits that you get from usingdarks is that cherry picking changes sois very easy very it's encouraged by theuser interface um it's very cheap um andonce you've done it um once you'veCherry Picked a change and you decideyou want the you know the changes thatyou skipped over when you did the cherrypicking you pull those back into yourrepository and you've got exactly thesame thing as you would have had if youpull them in the normal order okay sodarks kind of gives you and I'll showyou um I'll show you that a bit more uhin MyDemo um and whereas if you did that ingit you'd end up having to um you'd endup with two different two differentrepositories you'd have to merge the twoto get get back to the samestate um the other thing about darks isthat merges a deterministic so anytimeyou do a merge in darks um it will giveyou exactly the same result wherever youare in the world whatever version ofdarks you're using um if it would theand the behavior of the merge will justdepend on um what what darks chose torecord at the time you actually savedyourpatches um from the point of view ofuser of darks we like to think that wehave a simple intuitive user model ifyou just sort of start using darks umwithout knowing how to do so you know wehope that you'll quick quickly get togrips with what what's going on behindthe scenes enough to understand how touse it but not not really need tounderstand the the guts too much um andyou I think that's that's a criticismthat a lot of people make of git thatit's it's very hard to use it wellwithout having expert level knowledge ofum of a large portion of the um of alarge portion of how git works and howyou're expected to interact with it umso by contrast again darks tries to havea simple command set um another thinganother philosophical point about darksis that we try to make commandsinteractive by default so you know whenyou've got got some options about how toabout how the what you could do with acommand and rather than having to passsome Flags to figure out exactly youknow to get that behavior um darks willtry to ask you questions I mean not itit that's not entirely the case but itwill it will ask you questions about umwhat would you um how how much of thiswould you commit would you like to makefor example that kind ofthing so what doesn't darks have um wella whole load really um so the featuresthat people tend to find missing mostmore and more are um multi multiple headrepositories so in darks you you have tohave one directory per um per checkoutbasically per Branch so if you want tobranch and have two different things inflight and switch between them you needto just use two different directories inyour file system um hopefully we'll fixthat at some point but it's not nottrivial um and some people argue that weshouldn't do it but I think we've mostof us most of dark developers concludedthat we should um darks makes a mess ofconflict handling if you havecomplicated conflict so simple conflictsaren't particularly problematic um butwhen you start having conflicts withconflicts and that kind of thing theboth the user interface and theinternals of darks tend to get a littlebit upset um I mean so you know inpractice people don't run into this veryoften I mean darks works fine with smallto mediumsized repositories it doesn'treally work brilliantly with really hugerepositories um and another part of thatis another point on here that it doesn'tit's not all thatfast um it doesn't have unique umrevision identifiers for a particularparticular state of the repository whichreally annoys some people cuz you knowthings like G to even subversion youjust say here this number represents anexact state of the repository you givethat to somebody else and if they've gotenough State they'll be able toreproduce that exactly what youhadum and really these days what darks ismissing is is not GitHub um and ofcourse it doesn't have the whole set oftooling or in fact a large userCommunity around it any well it neverdid have a large user Community but it'sshrunk abit um and yeah it's not not all thatfast butstill what about codecomplexity complexity forwhatum it okay without if there are noconflicts involved then I mean that Imean s emerge okay so emerge emerg inDOS is more is ASM totically morecomplex than emerg in git if you've gota very large um if you've got a verylarge history leading up to the merge soif you if you've branched and you'vedone you've made 100 changes on eachside and you've got say one one filerepository then in dark's emerge willscale with the number of changes thatyou've made um on each you know like theproduct of the number of changes on eachside um whereas in get it would scalewith the number of um with these wellwith the size of the tree roughly sothat that's kind of I mean it's you knowdarks might be faster probably wouldn'tbe but um it could in principle be withyou know just a single change well maybenot is the worddo you mean with ter in with respect toconflicts or with respect to justmerging ingeneral right so yes I mean so it isquadratic yes it's the product if you'vegot if you've got two branches 100changes on each side of the branch thatare not that are unique to that side tothat Branch to the each separate Branchthen it will be 100 times 100 will bethe the amount of work dark has darkshas to do to merge those thosechanges um I mean Ithink I don't think that's such a bigbig problem in practiceumsorry yeah I mean so if you getconflicts then that is a real problem umbut if you don't get if it's just ifit's a clean merge um yeah I mean Ithink I mean that is a significant pointthat it does it does do these merges onestep at a time um but I don't Ipersonally I don't think you run into itthat that often with sort of at leastmediumsizedrepositories umokay so I'll dive into a demo now um sothere are some people who haven't seendarks so I I guess I'll go into itproperlyum okayso um I'm going to start by making adark's reposit a directory to hold myrepository um can everyone read thisokay I think I checked from the backearlier so I think it should be allright okay so I'm going to keep myrepository in this directory and thefirst thing I do is do a dark in it umdo well just get get the basic metadatain place get this make this be an inmake this be an emptyrepository um and then I'm going to copyin a um a simple source file that Iprepared earlier for the purpose sorryum so just to show you what's in thatfile for now um it's just sort of a verybasic high school program that I shallmake make a few changesto okay and so the next thing I need todo is tell darks that this file existsin my repository and it's something thatit shouldtrack so darksadd and so in darks um there is a um thecommand the equivalent of what might becalled commit in many other VersionControl Systems is called record um sowhich means record a patch um so I'mgoing to run thatcommandand uh it scrolledconfusingly because I've zoomed in toomuch now I can't find my mouse pointersorryokayum is it going to be okay if I make thefont a bit smaller so I can actually seethe fullwidth right is that still readable foreverybodyokay um all right so dark is just goingto um now is this is what I was sayingabout it being interactive before it'sjust going to ask me about what I'vedone and what I've done is I've I'veadded the file and I've put somecontents in the file so um I'm justgoing to record thisinitial um this initial patch so this isjust the in initialversion okay so then dark asks me somestuff about um just confirms for surethat I want to record exactly what Isaid I wanted to say record and asks mewhat the name of the commit name of thepatch is going to be and if I wanted toit would bring far up an editor and letme edit um put a longer comment than asingle line yep question the first CHwas adding the file yeah second wasadding the content that's right yesum so you only need to do the ad oncefile that's right yes I mean it'sexactly the same as tracking a file inany other system you just it's so that'sotherwise it I mean there is also Icould have also done this by sayingdarks record- L which says looked forads and it would have then gone off andlooked for every file in the repositoryand let me add that so um it's just it'sa trade-off between whether you want tohave untracked files to worry about ornot okay so I'm not going to add a longcomment okay so now I'm going to editthat fileum okay so I'm going to make a couple ofchanges to this file um so firstly I'mgoingto I'm going to just add one more um onemore thing that the this program isgoing todo and secondly I'm going to change oneof these messages that it's already umthat it's already printingoutokay and then I'm going to come back andrecord that changes so um darks allowaccepts prefixes of any command that itnormally um that it that that are uniqueso I can use wreck as a synonym forrecord which I at least personally findquite usefulblastum okay sorry my Demo's just g a bitwrongum so what what I was okay so what I washoping at that point was that darkactually I can no this is fine I canstill do this um so what I was intendingfor at this point when I did um was thatdarks would ask would give me those twochanges as separate um as separatechanges to the repository but what it'sdone is it's decided that it's decidedum it's decided that the diff that thisall lives in one part of the diff soit's showing me showing me two changes Imade at once and since I actually wantto show you how it will treat those to aseparate changes that's a bitinconvenient um there going to sorrythere is still going to be a secondchange yes but it's not going to makesemantic sense when I start making otherchanges to the patch so what I'm goingto do actually is bring up something Iwasn't intending to show you um which isthe interactive um interactive editorfor um for CH for changes that you makethat for um for patches that you'rerecording so what I actually wanted thedarks to do was say Okay I want I wantthe addition of the exclamation mark tobe to be one change and I want this saygoodbye to be a different change so whatI'm going to do here is just edit thiskind of intermediate state that darkshas presented me with to say okay wellthe first change I want you to show meis that addition of the exclamation markand then the second change you can showme will be the addition of goodbyeum okay so um oh actually yeah sorry andwhat I actually wanted to do was so I'vegone I've used K to go back um becausewhat I actually wanted to do was recordthe addition of the text first and thenthe addition of the message for reasonsI'll explain in a minute change thismessage so I shall say no to this oneand I'll I'llblast[Music]all right fine let'sum okay I'm going to I'm going to startagain with this so this actually worksout properly so I'm going to so now I'mjumping around in what I was actuallyintending to do um us usually this kindof thing doesn't I mean this you knowthis this kind of problem where darkswants to show you changes that youdidn't um that are stuck together inways you didn't quite expect um is lesslikely when you're working with biggerrepositories because the changes thatyou make are more likely to be widelyspread across the you know AC um eitherinside a single file or between orbetween different files um what'shappened here is I've tried to make avery small demo um and as a resulteverything's um the diffs that it'sasked me about are not not quite theones I wanted um put inbefore um so what what I'm actuallygoing to do is just go back and put somelines in so that it doesn't itdefinitely gives me the diff I want justto cheat a bit really um just to makethis demo work quick at least reasonablyquickly okay so I'm I'm using a calledDark revert now which um that's what revis short for um which just offers me thechanges that I've made locally in myrepository and says do you want to throwthese changes away um and again it'soffering them to me one at a time so Ican just um but in this case I want toget rid of all ofthem okay so what I'm going to do isum sorry supposed to be highschoolsorry can't use yeitherokay okay so now I'm going to do thosemake those same changes again andhopefully this time it won't it willoffer me those changesindependently if I can work by which Iapparently can'tum excuseme oh yeah thanks I wasn't actuallygoing to run the program so it doesn'treally matter matterbut I like I like people to think that Iwrite validhasal okay so that's one of the changesI could um that I could record which isthe addition of the exclamation mark andthe addition of the goodbye message isthe other change that I can record umand what I'm going to do and I meanclearly I could record any of these Icould record any subset of these threethings that I chose to yes sorry myquestion is chain yeah say that now theline between the changes has some hassome content in the previous case it wasan emptybut okay so the only reason it'sdifferent is because the um is becauseit's a heuristic Andi it's a heuristicin diff so at the point so I mean Iguess this is jumping ahead a little bitin some sense but if you um when whenyou do a three-way merge in any VersionControl System what will happen is thatthe Version Control System will will rundiff between the between the between thebase of your merge and the two the twooptions and then it will try to mergethose changes and at the point where itruns the diff um there are usually athere are usually multiple choices forwhat kind of diff you canget right and so the what what'shappened here is that because there aretwo blank lines in the file um thechoice about whether to line up theblank lines with each in in the old filein the new F in the old file there wasone blank line in between the say helloand the um and Main and in the new onethere was a blank line between say helloand goodby and say goodbye and there'sanother blank line between say goodbyeandMain so how diff has chosen to line upthe blank line between those between theold version and the new version of thefile is what has affected the what'sbeen presented tome and so what I did was I changed oneof those blank lines to be something abit more than just a blank line thusforcing it which meant that it couldn'tpossibly line those twoup formereally the Reas so the difference isthat that line with content when I addedthe when I added the first I didn't copythat content I put a blank line in sothe blank line is different from thecontent that was that was therebefore so if you so if I um so if youlook atum yeah yeah it's it's really just it'sreally just blank lines getting it diffgetting a little bit confused about it'salways istic that that diffapplies yes exactly so it's just all thedark is doing at this point is getmaking a diff and presenting me with theindividual pieces of thatdiff doesn't makethe uh the number of chain sets toolarge if you're just editing the P notsubsequent lines but just having oneline one line one linethis part of the line but this is nChang for me mhm so it will darks willoffer you all nine changes individuallyum if yeah um if you've if you madechanges that are separated in the fileum and you do have so this thisinteractive prompt actually has a bunchof um a bunch of commands that bunch ofkeys that you can use at this point toto choose you don't have to say yes orno to every single um to every singlechange so the the standard thing you usein that case is you if you know you wanteverything that you've done into thatfile then you just press F to say recordall changes to that file um or recordall further changes to this file so youcan you know for example if you've madesome change that you didn't want torecord you'd say no at the beginning umand then you can say or a for example torecord everything that it's going toofferum okay so yeah so right now I'm goingto record I'm going to record this theaddition of say goodbye and the call tosay goodbye together so that those thoselive live together with an atomic changeset an atomicpatch once you recorded those twotogether kind of like stuck togetherforever yes um yeah pretty muchy you two up three yeah what happens tothe original file that the the file inthe the file in your direct when you'redoing a record the file in yourdirectory doesn't get touched at all umall you're doing is recording ischanging the sort of you know thecurrent version controlled state sothere's always an implicit diff betweenwhat you've done in your direct yourworking directory and what's what thewhat the state of the very last Revisionin the repository was and so that diffhas now got smaller but the actual filehasn'tchanged um so on the other hand if I'duse revert which is about changing thechanging the state of your working copythen I could have reverted some of thesechanges and not others and then the filewould have been put back into some youknow some sort of State a little bitcloser to the work to the to therecorded copybut okay so I'veI've recorded achange that's called i' say goodbyewhich has that change in it and now I'mgoing to record Another patchthat that incorporates the last changethat I had and nothingelse okay so now what I've recorded isapart from the initial setup of therepository I've recorded the change toum I've I've recorded two different twodifferent changes firstly I've added saygoodbye and secondly I've changed the umI've changed the text ofhello okay now I just remembered that Iforgot to clone this repository at thetime I intended to um so I'm going towell I'm going to do it now anyway um soI'm going to clone I'm going to clonethe repository that I made um intoanother one um and that's dark's get isthe um standard command for doing thatand you can use that with remote remoterepositories as well um and because Iforgot to do this at the time I intendedto which was before I recorded thatchange I'm going to delete a couple ofchanges from this repository using umthe obliteratecommand Okay so and now I'll get back towhat I actually intended to do at thispoint which is to show the pull commandwhich is about moving patches betweenrepositories and now this is the pointwhere I'm going to show you how cherrypicking Works in darks in a sort ofsimple intuitive way so if I try to pullfrom repository one it's going to offerme the changes that I made to um that Imade to that repository um in the orderthat they exist in that repository sothis is a set of changes that are in theother repository not in my present oneum and it's going to show me them in theorder that that they would live in thatone you got comment there that's the onecomment yeah you see the code umyes um so X will give me a summary of itand V will show me the code um thoughthere's a little bit of metadata thatprobably would be more user friendly tohave to have Allied but[Applause]yeah okay so at this point I'm going toI don't want to I don't want to saygoodbye yet I just want to make that Ijust want that hello change but that wasthe second change I made so I'm going tosay no to this and it's still going tooffer me to offer me offer me the Hellomessage pullso I'm going to pullthat and um there we are and if I have alook at the contents of demo. HS nowit's got that change to it's got thatchange to hello but it hasn't got thechange to goodbye even though that wasnever one of the recorded states that Imade in the previousrepositoryis I disconect this CH no okay no it'snot um it's no it's not availablelocally so if if you wanted that thenwhat you would need to do is clone theClone the entire repository and theninto a different directory and thenstart pulling from that yesum and in fact that's the way Itypically use dos if I'm when I'm I workon the train a lot I I will just do apull from from the Upstream repositoryto somewhere and then work work fromthatlocally okay so yeah so I've got this umI've got this change that was um thatwas one of the two changes I made butnot in the order that I made it now I'mgoing to go back to the other repositoryand I'm going to use um I'm going to usea command called darksreplace so darks replace is a search andreplace operation on your Repository ittakes um so what if I show you demo. HSI've decided I want to rename the printmessage function to something differentso I'm going to I'm going to do darksreplace print message outputmessage repo1 um not repo one sorrydemo.HS okay now so firstly i' to emphasizethat this is a pretty dumb search inreplace right it doesn't take anyaccount of language semantics oranything like that so if I had if I hada variable called print messagesomewhere else in scope then it wouldhappily rename that that as wellum okay so now I've done thisreplace um it's changed my file for mebut it hasn't actually recorded a patchto say this is um this is a change Imade so I'll just I'll just do that umand the change is actually expressed asthis replacementoperation and yes I want to recordthat okay so just looking at demo. againyou know that there's been there havebeen a couple of occurrences of outputmessage of what word print message thatare now output message that have beenchanged by this replace operation now ifI go back to repository2 that doesn't actually have all thosecall all those call sites for printmessage because I didn't pull the changewhere I said called say goodbye as wellum but if I pull from repository one Ican still ignore the say goodbye changeand I can pull the print messagechange and what's it done to my file umsorry okay well it's it hasn't pulled tosay goodbye so there's no there's nochange there um but it has um but it hasrenamed print message to Output messagecorrectly and finally if I pull againfrom repositoryone and I bring back the say goodbyechange I could also shown you that patchI did it it's pulled back say goodbyethis say goodbye change which wasoriginally recorded as an additionalprint message um but now in the contextthat it's just been pulled it's an ADDof output message because that's whatthat that's what makes your program makesense um now that's only true becausethe replace was a safe thing to do on myrepository um but nonetheless it's stillit's um it's done the rightthing um and so the other point aboutthis is that now this repository ispretty much identical to that first onethere's no other changes I can pull fromthe other one if if I sort of ask darksyou know are there any differencesbetween these two repositories it won'tthere won't really be any umsure okayso darks Chang so darks changes is justa list of um list of what patches youhave in your repository in reverse orderum and as you can see in this repositoryit's this way around and in the otherrepository it's um you've got them inthe order that I originally recordedthem mhm something which has the printmessage insideyeah and I want the print message to bethere and then put that changeinto you want to keep the print messageyou mean yesum so is thisside okay so if I was to take if I wasto um now start editing after thisrename operation and put a new printmessage in there right then that will bethat will be safe because it comes afterthe replace but if you've if you've gotif you work in a repository that hasn'tseen the replace already but you havethe yeah but I mean so either repositoryit doesn't make it doesn't really makeany difference now but if I I mean I canI can safely change demo. HS now and putin um no no no what I mean is if you thechange set whichhad well it was originally recorded withthe print message change yesCH whichhading message theresult and and you put it on top of youralreadyreplaced yeah still it's saying thatit'smessage um well that I mean I I thinkthat's I mean that that was what wasexpected to happen because the if youlook at it from the other repositoriespoint of view the replace was operatingon both printmessagesso um so so in the of the first Changthe first one uh uh replace the namesand a change which is change uh which isexplicit have messageinside um so in so that I mean thatthat's the that's the point of whatdarks has done here is that it's decidedto it's it's changed the behavior ofthat it's changed the contents of thatpatch because of the order in which it'sput them umsoum in which repository in repo 2okayso so there so the patch itself has beenchanged by this op by the by the factthat we've reordered it to say thispatch says this patch is an addition ofoutputmessage it doesn't have the exactly thesame text in it and that's exactly thepoint of having first class patches thatyour um that dark will change the patchso the point is that darks has tried topreserve the meaning of the patchright so the the meaning of therepository that wehadum so I mean that's that's the point ofusing replace right you're you're you'rekind of asking darks to do that you'resaying that so looking at it fromanother point of view right you've solet's let's say that I've beendeveloping a program um with printmessage in it and then I decide I'mgoing to rename print message to Outputmessage okay and in parallel to thatsomebody else decides that they're goingto add a call to printmessage okay I'mjust I just don't know Ilike y fair enough wellokay yeah so I mean I think I meanotherwise if if they if pulling the saygoodbye had resulted in print message atthe end of repo 2 that would have beentot that would have been incorrect rightI mean it wouldn't have actually been afunctioning program so if you want thisproperty about cherry picking and soforth where you can get patches out oforder and then pull the other ones backit has to dothat guess that's another way ofjustifying what I did if you don't likeit but umanyway kind of I put it in yes I think Ithink of outut message and then kind oflike and out there takethat so if you um if you add if I nowadded an output me a call to Outputmessage in my repository afterwards andthen I and then I unpulled the replaceum removed the replace patch it wouldchange that to print message and thatwould be correct but the name capturepoint that I I thought maybe you wereasking was that if I had say a localvariable called print message or outputmessage depending on where I in whichstate I added it then the replace patchis going to do the wrong thing with thatso you know if somebody else in anotherbranch has decided to write a functionthat internally uses a print messagevariable of its own yes Isupp it won't it w't if you do that itwon't let you do it so there there sopart part of the darks user interface isthat it will detect detect things thatdon't make sense like froma yesand repl is global Iit's file byfile um there is no patch type formoving code between different betweenplaces in files um it's something thatwe would like to add um but yeah wehaven't done ityet okayumokay so I was um I've already shown yourever actually no I'll show you a bitmore of that um so if I just edit demo.HS a little bit moreum just add some garbage in two placesand I confusedbyokay so I've just added a couple oflines um and now I want to get rid ofthose lines and again as I showed youbefore darks offered those independentlyso I can say revert just the Wibble bitand then I'll be left with wobble in thefile and then I can revert that as welland that's gone um darks also has aninverse of revert so darks tries to makemost operations undoable there are caseswhere you that's not really feasible butum for example darks has thisunreversed um but not the one beforethat so I can't get Wibble backanymoreum okay I'm going to change some thefile this time and keep the changesum so I'm going to change this texttoand I'm going to record Another changewith that that in it oops I forgot toget rid of that never mind I'll do thatin aminuteuh sorry my laptop's frozenupum I think it's frozen solid this whatyou get for runningWindows um I'm going to have to rebootit so[Music]I think it I think actually this I thinkit may have Hardware problems of somekind it'soccasionally well I mean all I can sayis that my dark my laptop does lock up alot and I do run darks in it a lotbut I I think it's a hardware problemI'm not entirely certain I'm sorry aboutthis um so what I was about to do thenwas just record that patch and then Iwas going to show you how you can use acommand called aend record what theokay I'll give it a little bit of timeto[Applause]um yeah so what what I was going to showyou at that point was um editing a umediting a patch that you've alreadyrecorded to add add some extra changesto it um and I'll I'll just do that nextjust to um is there kind of longm toextend yes um but that that'sparticularly difficult so the problemwith the with extending it to know aboutsyntax trees is that you haveto um you have to fix your paa um youhave to decide that you know this thisone paa that that exists right now isthe one that's actually going to we'regoing to um we're going to use to pausethe codee um so you know if if thelanguage changes a bit it become you youend up having to versionthings so that's probably a reasonableway down theline umactually uh yes sorry but you you haveto keep you you still have to keep theold one around so that you can read oldpatchesso yes you can record new patches with anew paer um but then you have to figureout how you merge between the patchesthat were recorded with the differentpaers um and you haveto[Music]you kind of imagine you're creating yourP by recording MH and then you've got atsome pointpatches yeah youneed no no yes you're ask you need tohave the paer at recording stage but youdo need to the kind of meaning of thattra so the pretty printer you do need tohave everywhere you need to be able toinvert that you need to be able toinvert what you recorded to to actuallyyou know show the user what the realSource would be if I really wantedtoand literally just Tex so us it fourtimes it might L change treat that asyeah so absolutely so so then and thenyou won't get that nice that nicebehavior I showed you where it breakstart yeah so it will f it will refuseto it will refuse to let your cherrypick around that thenum okay let me record this change as Iwas trying to do not that one Ican't exactlyokay and I'm going to revert that otherchange that I forgot to revertbeforeum okay and now I'm going to makeanother change to thisfile umjustokay I won't bother calling it that'lldo okay so now this time instead ofrecording a separate change for this umfor that edit I've just made I want toactually incorporate it into this otherpatch because I think that the twobelong together for some reason I meanin this case they don't reallybut and so the command I can use forthat is dark amend record and that saysthat will go that will go back and sayokay here's a here's a here's the patchthat you just recorded would you like tochange that um sorry I was going to dothat with a flag saying edit allow me toedit the message aswell so should iend this patch um yesand then here's here's the change thatI've just made to my repository andshall I add this change to the patchthat you've already record that I'vealready recorded yes um and also give mea chance to edit the message so Isayadd okay um so that's you know that'suse if you're kind of working with draftpatches as you go um and then you startyou want to sort of keep adding changesto them um and then eventually you'redone with it and then you say okay I'mgoing to send this off to other peopleand at that point you should stopamendingitenyou so if you if you make a change P itto the other repository then amend thechange and pull it again it will looklike a different change and it will be aconflict yes so you then have to go manun pulling the otherone okay um so okay let me just go backto my slides for asecondthat workyes okay so I've shown you quite I'veshown you a sort of bunch of the a bitof the basic command set of darks andjust sort of what the what the sort ofmodel is um a new command that we'readding to Dos in the in the next releasethat will come out in probably the nextfew months is a rebase command um andrebas is a concept that exists um in insystems like git and so forth and ithasn't existed in darks before um and inpart that's because some many of thethings that you'd want to use rebase forin git aren't really necessary in darksum because you don't you don't you don'thave all these merge commits with darksum that are about that you know youwhere you want to um because wheneveryou do a merge you always get the sameresults so dark doesn't have to presentyou with a merge commit um and you don'tneed to use um rebase to do cherrypicking which is another thing that umit's used when git but nonetheless thereare things where there are cases whereyou do want to use you do want to do arebase you want to changea whole sequence of patches in yourhistory so that they're maybe more up todate or so forth so um we're adding arebasecommandum and so the basic so the way that therebase command works is it kind of putsyour repository into a temporary statewhere you're doing some you're doingsome work on some patches and some otherpatches have been sort of suspended umand then you can bring those patchesback into your repository with anunsuspend command so I'll just show youthatnow uh except forthat whathappened okay so I'm just going to umI'm just going to throw away my thatchange I just recorded and do the samething again um but with with a slighttwist toit which I canusenever mind about the typo um and I'mgoing to make a second change on top ofthat um which is that I wanttoI also want to write this to standard Iwant this to go to standard error aswell so here's my second change which isuse standard error okay so now I want todo that amend that I did before I wantto add that extra I'm going to put thatcall to say we backin and like to change that patch that Iwrote before the one where I where Ichanged the message I want to CH I wantto insert this say we into that thatpatch again so I don't want to edit thestandard error patch so I say no to thatum oh but then darks isn't going to letme do um do that amend record on the umon the update message patch which iswhat I actually want to change umbecause I've made a patch on top of thatthat depends on that patch there's noway that those two patches can belogically untangled from each otherbecause darks is a line by line has madea line by line diff and this the difffor the standard error lives directly ontop of the diff where I change say towants to say um so I'm out of luck if Iwant to amend it and that's where rebasecomes in so rebase will say okay wellthis standard error patch is in the waynow let's let's push it out of the wayfor a bitum okay so that's that's disappearedfrom our repository and it's nowsuspended so if I just look at thechanges in the repository you can't seeit anymore um but darks will keepreminding of the fact that it's stillaround if you um and that you can get itback okay so now Ican now I can amend this patchum okay great um so done that now I wantmy suspended patch back now how's darkgoing to deal with that because well Imean it's we haven't magically gotaround the fact that the um that thechange the second change I made that toinsert standard error really does dependon that first change I made sorry willyou wait noum okay so well okay it's going to letme unsuspend it it's going to offer itto me weare okay now that's I think there's abug in the development version I justused because it should have told me thatit actually had a conflict umhuh oh right sorry now I got this demo abit wrong that's fine this is so this isthat that that bed was actually supposedto happen and it was it was just not notquite what I ended to do with the demoum so what happened here was that Ididn't um I didn't when I when I changedthat patch I didn't change anythingabout the line that was actually aproblem right so the con there shouldhave been if i' if I tried to doanything to the output message text thendarks really would not have been able tofigure out how to bring that patch backun suspend that patch into my repositorybut because all I want all I did wasadded say we to that change that wasfine that um it was able to do that nowbecause of some technical technicalitiesto way the internals of darks works itstill has brought this patch back with adifferent identity so it will actuallyclash with other patches in otherrepositories even though it's it lookslike exactly the same patch but now letme just show you what would havehappened if i' done what I intended todo there um okay so I'll suspend thatone again and this time I'm going toedit HS to also say something differenthere soum can't make up my mind which editor Iwant to use okay so I'd like to do thatinstead okay and now I can I can changethis update message so that it's now nowa patch that doesthis okay and now if I um if I unsuspendthis patch okay then it tells me thatthat things have gone a bit wrong thatit can't quite it can't quite unsuspendthis patch and have it have it be a samepatchlocally and it's also um inserted someconflict markers into my repositorywhich is dark's unique style of doingconflicts um to tell me to to to give methe local changes that have gone wrong sgo it just well it's I mean yeah itmarks up your repository like like mostsystems would it um it lets you do theoperation but says there's a conflictand here's and this is what the conflictis in terms ofI the uh in this case is the fact whichcomplet beyour changes support mhm is still inyouryes so the the original one that was thecause of the conflict is still there I'mfine and this one is in a slightlyconflicted state that I'm now going toI'm now going to fix the conflictand yes um okay so sorry um I haven'tshown you conflicts in the normal caseof merging and in that case it really isanother patch on top of it when you'reusing rebase the normal thing to do isto fix your conflict and then amend thatinto the patch you just unsuspended soum I won't bother doing it here becauseit um I've I've already spent quite awhile in this demo that I didn't intendto but um what's so what this conflictis telling us is that well the firstline is the base of the conflict beforethe equals so um that was the originalstate and then there are two differentchanges that are conflicting with eachother one of which is to insert thestandard error and the other of which isto change wants to sat to would like tosay so our job is to you know see thatand do do the do the change forourselves and then then edit the patchthat we just unsuspended to to bring allthat um to bring that intoline okayum let me go back to my talk I did thatbit of the demo um this is just a quickslide just to show you that the mainsort of set ofcommand oh sorry wrong way around umthank you that would be agood yes in fact so another re anotherplace you wouldn't another way we thingwe could do that would mean you wouldn'tget a conflict there is if we didpatches to us to the characters of aline rather than to rather than a lineby line thing and that's probably moremore in reach aswell okay so just just to give you avery brief overview of the sets ofcommands that you you can do with darksum you know how they how they all fittogether so one interesting thing is thesort of specific kinds of patch types uphere um so I already showed you replaceand add um and move um moving a filearound is also a kind of operation thatnicely with and merges nicely with otherpatches um that you know also add filesand so forth um some standard remoteoperations get in its mirror put pulland push um obliterate which is toremove a patch which used to be calledunpo and still has theAlias um there is a staging area calledpending which we tried we we do ourabsolute best to hide from you unlikegit um so pending is used for thingslike in fact things like replace um sosomething that can't be something thatreally can't be represented in therepository um then we put it in thestaging area so you know if we if you'vemade that semantic statement that you'remaking a change to your repository thatyou you can't just infer that from thefiles then we'll then we put that intopending um otherwise it will be just inotherwise you just record what's in yourworkingdirectory it doesn't make sensebecause the way howyou changes is actually uh so what whenyou assembling your your story isexactly what when you are just sayingthis is record changesyeah yeah I think I think that I thinkthat so the mixture of that and also thefact that amend record is designed to befairly user friendly are two things thatmake it I think unnecessary to have thatindex have that equivalentum okay and yeah there's some functionsfor quering a repository you know justbut there's you know to for basicoperation with a darks reposit you don'tneed to know um all that many commands Ihope I'm getting across withthat okay so um let's just go into um abit more technical detail about umwhat's inside a dark's repository um sodark's patches are semantic descriptionsof what should um what should happen tothe previous tree to give you the newtree um so example of um an example of adarks patch is remove the text X thatstarts at line three in in a particularfile and put text Y in thereinstead um another kind of dark patchwould be add file a or rename file C orreplace token x with token Y in aparticularfile um an important property of darksPatches from the point of view of theinternals of darks is that the patchesare invertible um so any anything thatyou can record um in darks darks has toknow how to undo it as well so that itcan kind of get back to the old tree umthat was before that um so for exampleso it is actually important that you sayremove X from the patch at line threeyou can't just say remove one line umthe internals of darks were actuallyrecord the fact that the old version wasX it won't be able to usewhat World card um that's correct yesyeah um I mean that's not to I mean itit can't but that's not totally I meanif you if you had a linear regularexpression it would be possible it wouldbe technically possible we just have wehaven't done that and it would probablybe quite confusing but yeah as long aslong as you're sort of replace was wasinvertible in some way you could youcould Implementthat um so yeah so that I mean I'm notgoing to go into the details of whythat's important but it it is it is oneof the features of you know getting thisreversible cherry picking operation kindof requires that you're you're able tomanipulate patches um in that kind ofway and inverting them is one of thoseproperties um I guess another Pointthat's worth about how darks patcheswork in general is in most VersionControl Systems you're kind of committedto this this series of changes that youmade in a particular order whereas darkslets you sort of pull out random randomchanges from the middle if there's notextual conflict um and that means thatof course you can write you can pull outchanges that really doesn't make senseto pull out um because you know forexample if in one patch you add afunction and then in another patch you acall site to that function and then youtry to cherry-pick the call site of thatfunction on its own darks will probablylet you do that if they weren't weren'tright next to each other um but you'renot going to have a meaningfulrepository um but that's some power thatit gives you that you can you have youhave to decide how to useit um yeah so I've already kind ofalluded to some of this stuff before soand merges in dark are deterministic soevery single time every time you doemerge anywhere with two darks patchesthat were recorded at the same um thatthe original darks patches then you'llget exactly the same result and thatmeans that darks doesn't create mergecommits when you actually when you do amerge okay so if you get a conflict thenyou have to record a new patch toresolve the conflict and then you'vekind of got an equivalent to mergecommit um but there is no explicit sortof marker in darks in general for amerge happenedhere umand merging is associative so say you'vegot three repositories and you decide tomerge two of them together and then youmerge it with a third one um you willalways get the same result as if youmerged the second two as your firstoperation and then merged it with thethird with the initial with the firstrepository um so I mean that that'sthat's I guess it's kind of part of theintuitive user model that we're tryingto get with darks that you don't reallyhave to think about what audit what whatthings happen you you just when with thedarks user interface it tries it triesto make that implicit that that you canexpect that to happen without having tothink about it and it's it's kind ofimportant that that it actually is trueyou don't get nasty surprises thatwayum okay so going a bit more into theinternals of What's um of of how thisall worksumso on the left I've got the kind of sortof diagram that represents what happenswhen you have a merge operation so we'vegot two different repositories withpatch a and Patch B in them that's thedifference between them and so the baseof the merg is um the bottom leftandcorner and in two different in the twodifferent repositories we're going toend up with two different two differentstates of the repositand the the point of a merge operationin any version control system is tofigure out what should that questionmark be um and then do do theappropriate thing withthat um and in darks what we would dowhen we do that merge is that umsupposing we were pulling B into arepository that already had a um wewould compute what that we would computethis alternative version of b b Prime umthat is so kind of the same effect as Bbut in a different context it's nowliving after a and if depending on whatdepending on what A and B were we mightactually have to change B into somethingslightly different for that to stillmakesenseum so the dark has this internally hasthis concept called commuting which ischanging the order of two patches um andthat's what I'm trying to get acrosswith this diagram is that that really isjust the inverse of the merge operationso um with a merge you have two patchesthat start at the same place and youwant to figure out what how you how youcombine them and but an alternative wayof looking at what the sort of workinvolved here is that you have twopatches that are already in the samerepository in sequence and you want tofigure out how to unpick that you wantto get back to two patches that aretotally separate from each other um soif you already know A and B Prime thenwhat'sB okay and that's that's the fundamentalpart of cherry picking because once youcan do that um if you can swap the orderof patches then if we don't if we've gota repository with A and B Prime and wedon't want a we just want B Prime thenwhat we have to do is commute A and BPrime to turn them into B and a primeand then we can throw away a prime andwe've got a repository just has tochange B in it um and when we do themerge again um because of all the stuffthat darks does behind the scenes it'sgoing to give you exactly the sameresult B primed again so that's that'swhy what I showed you um that's that'skind of a flavor of what's behind what Ishowed you it's not um there could bemany bea there could be many so um in in themerge diagram there have or in any ofthese diagrams there has to be uniquedarks that's one of the things darksensures so if you given a prime youmultiple ways offacturing yes but they you have to um asdarks will to if you adding a new patchtype to darks that would define whatthis commu would be you'd have to do itin such a way way that um it would goback to B Prim when you merged it againso otherwise you'd break darks umsoI[Music]awill B Prime uh B primee yeah so it willstore them in a particular in theparticular order that's useful for thatworking that repository per repositoryyesexactly okay so yeah this actually goesinto so I mean conceptually you canthink of a dark repository as being aset of patches in most cases right sothe the set of patches you got in yourrepository no matter what order youbrought them in in the end state of yourrepository will be the same um but theyare they are actually stored in aparticular order that reflects how yougot them um and you know when you runthings like darks changes as as weshowed you that um that does showup um there it's rare I mean I think itthere is a user demand for that kind ofthing because you know you want usequite often want to see what's what thechanges in your repository are that itisn't in a remote repository and it' beuseful to reorder the patches in yourrepository so that just your localpatches are at the end of the repositoryum so and we don't have a great commandset for doing that we do have onecommand that lets you do that does someof that for you um but it'snot uh that would be Overkill because itwould change the identity of yourpatches so you you really wouldn't wantto use rebase for that and there's acommand called darks optimize reorderwhich will do that up to the last tag inthe repository but it's not it's notsort of Generalum yeah so there's this um there's thisdistinction in in darks in terms of wellI guess in terms of both the mentalmodel and in terms of the internals thatpatches have an identity which is thething that you know they they retainthat as you pull them around betweenrepositories but as we saw therepresentation of a patch can change asyou pull it around um and that you knowdue to either merging or cherry pittingdepending on which way around you um howyou're pulling itaroundum so I mean there is this kind ofthere's there's this kind of theory kindof hiding at the uh hiding inside darksand hiding very well because no none ofthe people who really work work on darkshave managed to make a really formaltheory behind um behind what's happeningwith these patches and it's that's areal shame because it ought to be um itought to be possible to do um justnobody's Managed IT um but we kind of Imean I was sort of alluding to variousthings that you have to get rightotherwise you'll otherwise you'll screwup the behavior of darks and we kind ofintuitively know that the these thingshave to behave this way and you knowthere so there are some laws that webelieve are true about um all the darkpatch types that we have and that wouldbe true about any new ones that we addedso an important one for example is thatif you commute two patches if you if youhad a and you swap them around if youswap them back again you're going to getback to the sameresult um so just the equivalent ofthese two commutes um and also that ifyou um if you kind of if you invertpatches and start commuting the inversesthen you get the same result as if youwere as if you left them uninverted tobegin with um I just put that up I don'twant to explain that that indetail um another thing I'm used quite abit of time so I'm just going to go overthis very quickly is the um there's thisI said that merges have to beassociative um and the thing thatunderlies that in terms of what yourpatches have to do is that um if youstart commuting patches around so sayyou have your patches in the order a b cto begin with and then you startchanging some of those patches around umand you keep doing that you can you kindof tra you're kind of following a patharound this Cube if um picture and umthere's no absolute guarantee from anysort of external source that you'llactually get back to exactly the sameresults when you've got to the end ofthat because the commutes that you'vedone in each case are are independent ofeach other um but that's another lawthat we expect to be true that you willget back to the same result um and ifyou look at think of that Cube as a kindof merge that's about guaranteeing thatyou will that you know that top righthand corner of that Cube will always bethe same no matter what merges you[Applause]did um okay so I'll finish up um fairlybriefly by talking a little bit aboutthe internals of darks and one oneparticular way we've used um H or togood effect in theinternals um so one of the things that'ssort of I hope of what I've already saidabout darks is that when you when thecode of darks is working with patchesit's going to do a lot of time it'sgoing to be moving patches aroundbetween different contexts a lot it'sgoing to be um deciding that okay thispatch um thispatch have these patches before to now Ineed to cherry pick commute it or mergeit and so that it's got some otherpatches um leading up to thatpatch um and a lot of the time when youdo that the representation of patchdoesn't change because you know for forexample if two P if you've got twopatches to two different files the orderin which they're in makes no differencetheir representation but you've got twopatches to the same file the order inwhich they're in you'll have to changethe line numbers of in the offsets ofthe second the patch that's further downthe file um so it's it's quite easy toget this um to firstly get start usingpatches in the wrong context in darkscode in in the dark source and if you dothat you might not notice quicklybecause a lot lot of the things you'retesting with won't actually have won'tactually trigger that um trigger thisbeing aproblem um so here's an example of um ofhow you might actually um of some somecode you might want to write insidedarks um so supposing that we'vesupposing we want to commute two patcheswe've got um the patch C and we've gotthe sort of sequential composition oftwo other patches A and B that's theintention of this bracketing so I want Iwant to get to sort of C primed and thena prime as the or a prime B Prime is thealternative okay so if if we take animaginary patch type which is um eithera um a null patch or nothing or sort oftwo patches stuck together and thenobviously there' be some moreConstructors for the real um for realpatch kinds in that um you might write afunctional commute that's um going tohave a case that's for about commutingcommuting this um a as one group with cas the other group okay so how you goingto do I mean typical hasal you're goingto do this kind of operation bydecomposing it into its sub pieces umcommute you know commute one of thesetwo things with the other with the nextone and then um and then commute umcommute the results to get it all intothe right order um so that's that's onereasonably plausible way of writing itthat happens to be totally wrongum because what you should do is I meanif you just look at the order of this itmakes sense to commute B and Cfirst um and then commute the resultyou'll get um you'll get ACB and thenyou commute the result A and C yeah thisorder but what what I'm doing here isI'm commuting a andc first and a andcaren't next to each other originally soit really doesn't make sense to do thatum but there's nothing in this codethat's going to stop me from doingthat um because dark doesn't GHC doesn'tknow anything about the context of thesepatches so um one thing we introduced indarks quite some time ago now is um away of actually telling GHC at least tosome extent these are the this is thecontext of a patch um don't um don't letme screw up so instead of defining justpatches as being um so the these um sowhat I've done is I've introduced acouple of type variables here and thesetype variables are Phantom types rightthey don't have any reflection inreality in terms of the ACT you know Icould even I it doesn't make sense forme to say patch in Char or somethinglike that that's that's not what they'reabout um they're about saying they'reabout a sort of abstract representationof the context that that patch lives inand and it means that if I write somecertain low-level operations carefullyum and don't screw up the context in thelowlevel operations then the high leveloperations that I build on top of thoseum I won't be able to um the typeChecker will actually stop me fromgetting it wrong so the the key point ofthis really is that this sequentialcomposition operator now operation nowum says the context must match up so apatch AB is a patch that notionally goesfrom an initial starting point initialcontext a to a final context B okay sothe sequential composition of twopatches now goes from an initial contexta via an intermediate context B um sothe um to a final context C so the thesequential composition is a patch from ato c and it's made up of two indiviindependent individual patches that umstart at a and finish at B and start atB and finish atc um and so to actually to actually workwith this kind of stuff you you findyourself having to redefine a whole loadof basic type so what used to be just atle um I've now had to make a witness atle with these extra witness types in itum that says okay this this is a sequenthis is a pair of patches that I'm goingto want to commute um for the to be ableto write the type of commute so commuteis now talking about the fact that we'vegot a pair of patches that goes from ato c in total and I want another pair ofpatches that goes from a to c in totalbut is in the opposite orderinternally so I can I can translate thatthat incorrect code I wrote in theprevious slide into incorrect code usingthe um using these Phantom types and GHCwill complain um now the downside ofusing these Witnesses is that the GHCcomplaint takes a little bit of gettingusedto umso what with you know so with somepractice you kind of get used tounderstanding what it's trying to tellyou here which is that well um there wassome other variable B1 that would havehad to existed for this to make sensebut it it doesn't really work um andum and in particular really you know youcan see that this expected versus actualB1 is different from B so you know thatyou've got the intermediate context ofthis this commute wrong but it certainlymakes um the darks code harder to getinto for newbies it also stops newbiesscrewing up the darks code so um that'sthat's a sort of two-edgedsword um yeah so that's um and thesecontexts can't do everything you do haveto get the lowlevel operations right forfor the sort of stuff around it to makesense and they don't um in some casesyou actually have to insert a assertionsthat say I've thought about this I can'tconvince the type Checker that this iscorrect but I've thought about it and itis it is correct and trust me pleaseum okay so that was just a sort of umbrief overview of all of that um stuffso what are we going to do with darksnext um well 210 which is the upcomingrelease we'll have rebase in it rebasein it and it will also have a fast rertake command if you've used dos beforeyou'd know that the dark annotatecommand is one of the things that reallyis very slow um and what we've done tomake that a lot better is introduced aum what's called a patch index a look upfrom theum look up from lines of lines in a fileto what patches touch the line that linein the file so you can um s Dart canvery quickly find out what patches areum what patches affected a specificfile yes annotate is yeahexactlyso what'sthe you have V on the sameand yeah so annotate will give you adifferent output depending on what orderthose patches are in your repositorydepends yes yeah so there there arethere are plenty of commands that willdepend on that and changes you know justthe list of changes the annotate um theway that patches are presented to youinteractively all those things will bedifferent depending on the order of yourpatches um yep so we're um we'redeveloping a um a bridge to get but it'snot really ready for prime time yet sowe want to um want to get that ready umwe don't have a great hosting storythough we do have um we do have a kindof on um a standard host for darksrepositories that you can put online andso forth but it's nowhere near as fullyfeatured as GitHub so we'd like toimprove that um we'd like to have bettermost of interacting with darks is verytext based in fact all of it I mean thedarks command line is just a text toolthere are no standard graphical toolsfor it um what we'd like to do is makesomething web based that's um kind ofyou know the the same the same code baseas the hosting um code that we have justbring up a web browser locally and allowyou to interact with your repositorylike that um but that's still inprogress um we'd love to have multi-headrepositories a bit of implementationeffort um conflict handling is a messfor a couple of different reasons butone of which is that we don't actuallyhave a good algorithm for figuring outdeeply nested conflicts and dealingmaking them making them behave properlyand um we hope to improve that in thefuture um more patch types that's I meanthat that's kind of the point of darksright I mean what darks lets you do atthe moment is quite is nice useful andit's still I I still personally find ita lot more useable than get but reallywe could we could make darks really niceto use in terms of um you know by havingby having more patch types that let youdo things like moving around chunks oftext from one place to another um makeyou know do indent a block of indent awhole set of lines um by a certainamount and have those things mergenicely with other kinds of patches wherethat makessensethat new is deping on the number of youalready haveso yes that's well it's not it's notquite that bad because there are somekinds of patches types that would neverthat would never make sense to mergewith each other so for example if youhad a semantic patch that would you knowabout um ch um treating a source file asa pastry and doing stuff with that theninsert a patch that inserts lines in themiddle of that file is never going tomerge with that sensibly I think it wejust say that's a conflict um so youyou're kind of and the the other thingis that um merging so you know filesthat affect the internal contents of asorry patches that affect the internalcontents of a file are independent ofpatches that say move around the thingsin the file system and so on so the yesyou're right basically that thatquadratic explosion does happen willhappen but um there are some mitigatingfactors and that but that's part of whyit's difficult to add new patchtypes okay so um that was all I had tosay really um if you want to know moreabout dos it's got a homepage and it'sgot a hosting place um and we are um ifany of you students and interested indoing some stuff for darks then we areparticipating in G in some of code viaum h.org so please apply[Applause]um probably four or five some that sortof um to move to syntax trees other thanum the complexity of the compiler howconceptually how hard isthat um I don't know how to merge whatwhat the merge operations on what thesort of basic patch types for two treeswould be so let's let's say that we'vepaused a we've paused a sour file andwe've got a tree and we want to we needto Define changes to that tree um andit's not completely obvious to me howyou what those changes should be whatthe actual patch Types on trees would beum so I think that would and then andgiven that it's not obvious you're alsogoing to need to have a good degree ofcertainty that you actually that it'sall correct in all cases and so forth soum probably probably want to bring in atheor approver or something like that toum to check it out that's quite hardsomeworkyour demo on a single fileis so there are patch types that affectan entire directory at a time so you canyou know rename a directory and all thefiles within that will move and thatwill merge and commute nicely with sayadding a new file to that directory onan independentBranch two files but if you edited twofiles that's two separate patch well Imean you can stick them together intothe same you know patch with name um youwouldn't you wouldn't record them asseparate you wouldn't have to recordthem as separate patches if I you knowif you if you make a change across yoursourc tree you can the interactiverecord will ask you about all thosechanges to all the files at once and youcan select some of them to be stucktogether into one patch I saidyes yes and then that will be an atomicchange set that will you know just kindof BR to the next question whichissmallum I think it varies quite a bit but umyou know varying from the quick typo fixand so on which is worth recording as aseparate patch to you know maybe a fewhours work umor yeah no you would so Imean so the dark repository itself hasum about 10,000 changes in it and sowellyou know here here's a here's a verysimple one that I made that was just umfixing a testum this one looks quite a bit bigger umthat was you know not sure exactly whatit was about but um so you know it's youyou want to group together logicallywork that belongs together logicallybecause otherwise there's not much pointin having Atomic change sets um you youknow you you want to stop being you wantto stop people cherry picking too muchas well you you want to encourage themto cherry pick things that make senseand not things thatare yes that's rightyeah how to um how complex is it to dothe transformationfor um not very um it's it's basically aset of rules for so the the mostinteresting cases come in for examplewhen you're um if you're say merging orcommuting two two hunk patches to youknow individual line changes to the samefile and then you have to make sure thatthey it's safe to do it and that youhave to adjust the offsets appropriatelyum similarly if you're doing that withfile renames and so forth you need tomake sure that you change the patchesthat touch that file um but the the codefor doing that in itself is probably afew a few pages um you know 00 that sortof scalethank you thanyou
# Document TitleSkip to contentDuck RowingDuck Rowingby Fred McCannHomeAboutMicroblogMastodonSearchRSSBzr Init: A Bazaar TutorialPosted On December 26, 2013Around seven years ago I made the leap from Subversion to distributed version control. The problem I was trying to solve was simple- I used to commute which meant 2-3 hours of my days were spent on a train. To pass the time, I would learn new programming languages or frameworks. Eventually I became very productive in spite of the fact that a train is a hard place to work and battery technology was very poor.I’d been a Subversion user, but there was no way I could work offline on the train. While this wasn’t the only reason I considered switching to distributed version control, it was the straw that broke the camel’s back.Why Bazaar?Bazaar is perhaps the least popular option of the big three, open source DVCS: Git, Mercurial, and Bazaar. At this point, I think you can make the case that Git is by far the most commonly used option. In fact, when I was experimenting on the train Git was the first tool I tried. After a few weeks with Git, it was clear things weren’t working out.I still use Git, though not as often as other systems. Because of Github and Git’s popularity, it’s hard not to use Git. I also use Mercurial almost every day to collaborate on some projects that have chosen Mercurial. And obviously I’m a Bazaar user. The majority of the projects I work on are tracked by Bazaar.I’m advocating for Bazaar, or that you should at least learn more about it. Unlike the infamous Why Git Is Better Than X site, I’m going to do my best to stay away from focusing on minor differences that don’t matter or blatantly inflammatory misinformation. A lot of DVCS advocacy focuses on the benefits of DVCS in general, and attributes these benefits to one tool. In reality, Git, Mercurial, and Bazaar are all completely distributed systems and all share the same basic benefits of DVCS.In that spirit, I’d like to get a few things out of the way:Developers successfully use Git, Mercurial, and Bazaar on projects both large and small.Git, Mercurial, and Bazaar are basically feature equivalent.Git, Mercurial, and Bazaar are all fast.What I will focus on is how much friction a tool introduces, that is how easy it is to get work done, and the philosophies and design decisions, which have an impact on how the tools are used.What makes Bazaar the best choice is it’s by far the easiest tool to use, but more important, it gets the DVCS model right.An Ideal Model of Distributed Version ControlI assume anyone reading this likely already uses a DVCS or at least is aware of the benefits. In a nutshell, when you have a number of people both collaborating and working independently on the same codebase, you essentially have a number of concurrent branches, or lines of development. A DVCS tool makes creating and merging these branches easy, something that was a problem with Subversion or CVS. This is the core aspect of DVCS from which all the benefits flow.The most visible consequence of this is that sharing code is divorced from saving snapshots. This is what solved my working offline problem, but in general it gives people a lot more control over how they can model their workflows with their tools.The most important aspect of a DVCS is how well it models and tracks branches and the ease with which we can fork branches and merge branches.Before examining why I think Bazaar gets the DVCS model correct, I’m going to present an ideal view of a project history. What I’m presenting here should be uncontroversial, and I’m going to present it in tool neutral language.Assume the following scenario:(trunk a) Steve creates an initial state as the beginning of a “trunk” branch(trunk b) Steve implements feature 0 in a mirror of trunk and pushes upstream(branch1 a) Duane branches into branch1 and commits work(branch1 b) Duane commits work in branch1(trunk c) Duane lands his work into trunk by merging branch1 and pushing upstream(branch3 a) Pete branches into branch3 and commits(branch2 a) Jerry branches into branch2 and commits(branch3 b) Pete commits work in branch3(branch2 b) Jerry commits work in branch2(branch2 c) Jerry commits work in branch2(branch3 c) Pete commits work in branch3(trunk d) Jerry lands feature 2 into trunk by merging branch2 and pushing upstream(trunk e) Pete lands feature 3 into trunk by merging branch3 and pushing upstreamWe have four branches: Trunk, Branch 1, Branch 2, and Branch 3. A branch is a line development. We could also think of it as a thread of history.IdealThe dots on the diagram represent snapshots. The snapshots represent the state of our project as recorded in history. The ordered set of all snapshots is:Project = {A, B, 1A, 1B, C, 2A, 2B, 2C, D, 3A, 3B, 3C, E}This is the complete set of snapshots. You could call this ordering the project history across all branches. Note that this is not ordered by the time the work done, but rather the time work arrived in the trunk. Ordering either by merge time or actual time could be valid, but this ordering is more beneficial for understanding the project.Also, snapshots have parents. The parent of a snapshot is the set of immediately preceding snapshots in the graph. For example, the parent of snapshot 3B is snapshot 3A. The parent of snapshot D is {C, 2C}.A snapshot that has two parents represents the point at which two branches were merged.We can further define the snapshots that define the branches:Trunk = {A, B, C, D, E}Branch 1 = {1A, 1B}Branch 2 = {2A, 2B, 2C}Branch 3 = {3A, 3B, 3C}Furthermore we can define the concept of a head. A head, generally, is the most recent snapshot in a branch, after which new snapshots are appended.Trunk head = EBranch 1 head = 1BBranch 2 head = 2CBranch 3 head = 3CWith this ideal view of history, there are a number of questions we can answer such as:What is the history of a branch?The history of branch X is the set of snapshots that compose the branch.What is the common ancestor of two branches (i.e. when did two branches diverge)?This can be answered by traversing the graphs of the branches, starting at the head then working backwards through the graph to find a common ancestor.When did two branches merge?To answer this we find a snapshot that has two parents, one from each branch.In which branch was a snapshot a added?This is a question we’d ask when we want to determine which branch, or line of development, introduced a change. We can determine this because a snapshot is a member of only one of the branches’s set of snapshots.Like I said, nothing here is controversial. The qualities I’m describing is what makes it possible to make sense of our project’s history.When looking at the project above, we’d most often want to talk about the history of the trunk branch. If we made the assumption that branches were feature branches, a view of the trunk would look like this:E [Merge] Implement Feature 3D [Merge] Implement Feature 2C [Merge] Implement Feature 1B Implement Feature 0A Initial StateThe view of the trunk is the “big picture” of the project. The merge badges are an indicator that there is more history available in the associated feature branch’s history. For example, Branch 2 might look like:2C Completion of Feature 22B Partial Completion of Feature 22A Start work on Feature 2The branches don’t have to be features- they could represent individual developers, teams, etc. The distinction is that branches are lines of development with histories, and we want to see different levels of detail in the history depending on what branch we’re examining.GitGit is by far the most popular of the big three open source DVCS. It is also an order of magnitude harder to learn and to use than either Mercurial or Bazaar. I’ve expounded on this in my post on the Jenkins incident and one on Cocoapods.At this point, I don’t think it’s inflammatory to point out that Git is difficult to use. A simple Google search will turn up enough detail on this point that I won’t address it here. Git’s interface is a very leaky abstraction, so anyone who’s going to be very successful using Git will eventually have to learn about Git’s implementation details to make sense of it.This introduces a lot of cognitive overhead. More plainly, a user of Git has to expend a significant amount of focus and attention on using Git that would otherwise be spent on their actual work.This overhead is inherent in a lot of Git commands as the metaphors are fuzzy and a lot of commands have bad defaults.While Git’s interface is more complicated than is necessary, it certainly isn’t a large enough concern to stop people from successfully using it, or other UIs built on top of it.The core problem with Git is that it gets the branching model wrong.The first problem with Git is that it uses colocated branches by default. A Git repository has a single working directory for all branches, and you have to switch between them using the git checkout command.This is just one source of cognitive overhead. You only have a single working tree at a time, so you lack cues as to which branch you’re operating in. It also makes it difficult to work in more than one branch at a time as you have to manage intermediate changes in your working directory manually before you can perform the switch with git stash.A better default would be to have a directory per branch. In this simpler scenario, there’s never a concern about collisions between changes in the working directory and you can take advantage of all the cues from working in separate directories and take advantage of every tool that’s been developed over the years that understands files and directories.Colocated branches essentially make directories modal, which requires more of your attention.Of course, you could make the case that collocated branches take up less space, but considering how cheap storage is, this doesn’t impact most projects, which is why it’s an odd default behavior.Another case you could make is that you can share intermediate products, such as compiled object files. While this seems like a good rationale, a better, simpler solution would to use a shared directory for build products.While colocated branches aren’t a deal breaker, there is a more fundamental flaw in Git’s branch model. As I described in the ideal DVCS model, a branch is a line of development or a thread of history. Git in fact does not track this history. In Git, a branch is merely the head pointer, not the history.In Git, branch 3 = 3C, not {3A, 3B, 3C}What we would consider the history only exists in the reflog, which will eventually be garbage collected.By default, when we send our work up to a remote repository, not only is the reflog information not sent, the head pointer is also discarded, so any notion of a branch is lost.This means that Git does not match the ideal model I’ve put forth, and is not able to answer certain questions such as, in which branch was a snapshot a added?Let’s take the example in the ideal model. We’ll assume each branch is a feature, and that a separate developer is working on each in their local repository. We’ll also assume that our “trunk” branch is the master branch on some remote repository.If we use the default behavior in Git, and make all the snapshots as defined in the ideal model in the same chronological order, this is what our graph looks like:Git$ git logcommit 314397941825fb0df325601519694102f3e4a25bMerge: 56faf98 a12bd9eAuthor: PeteE Implement Feature 3commit 56faf98989a059a6c13315695c17704668b98bdaAuthor: Jerry2C Complete Feature 2commit a12bd9ea51e781cdc37cd6bce8a3966f2b5ee952Author: Pete3C Complete Feature 3commit 4e36f12f1a7dd01aa5887944bc984c316167f4a9Author: Jerry2B Partial Completion of Feature 2commit 2a3543e74c32a7cdca7e9805545f8e7cef5ca717Author: Pete3B Partial Completion of Feature 3commit 5b72cf35d4648ac37face270ee2de944ac2d5710Author: Jerry2A Start work on Feature 2commit 842dbf12c8c69df9b4386c5f862e0d338bde3e01Author: Pete3A Start work on Feature 3commit b6ca5293b79e3c37341392faac031af281d25205Author: Duane1B Complete Feature 1commit 684d006bfb4b9579e7ad81efdbe9145efba0e4ebAuthor: Duane1A Start work on Feature 1commit ad79f582156dafacbfc9e2ffe1e1408c21766578Author: SteveB Implement Feature 0commit 268de6f5563dd3d9683d307e9ab0be382eafc278Author: SteveA Initial StateGit has completely mangled our branch history.This is not a bug, this is a fundamental design decision in Git. For lack of a better term, I’d differentiate Git from other DVCS systems by saying it’s patch oriented. The goal of Git is to arrive at some patch that describes the next state in a project- tracking history is viewed as not as important as tracking the patch or the content.This means we don’t have a good way of figuring out what’s going on with our conceptual trunk branch. Only one branch was recorded (incorrectly), and everything is shoved into master in chronological order. We also don’t have a good way to see the history of any particular branch, nor which branch a branch was forked from or when it was merged back in.If you’ve ever had to figure out how some change entered the system or whether some branch is cleared to be released, you’ll know that this matters.Git’s broken branching model means the burden of correctly recording or interpreting history is thrust upon the user.The safest workaround is to do all branch merging with the –no-ff option. This step will correct the object graph since it will memorialize all merges. The default Git graph loses snapshots {C,D}, but preventing fast forward merges will fix that.The second problem is that Git doesn’t capture enough information in each snapshot to remember the branch. To get around this, we have to manually insert that information, in-band, in the commit message.So, when a developer commits work, they’ll manually tag the commit with branch metadata. Then we can at least use git log’s grep option to see work clearly:$ git log --grep="\[Trunk\]"commit 5d155c1d81b9c2803f2f2de890298ca442597535Merge: 6a4ba95 0ed512bAuthor: PeteE [Trunk][Merge] Implement Feature 3commit 6a4ba950c807d9cb8fe55236e8d787b4fd4a663aMerge: f355800 36ef31bAuthor: JerryD [Trunk][Merge] Implement Feature 2commit f3558008a174e56735d95432b5d27cf0a26db030Merge: df9f0df 3bdf920Author: DuaneC [Trunk][Merge] Implement Feature 1commit df9f0df24faf0de5e8653626b9eb8c780086fc28Author: SteveB [Trunk] Implement Feature 0commit 67ba4b110a8cb45ba4f5a4fc72d97ddffafd7db0Author: SteveA [Trunk] Initial State$ git log --grep="\[Branch1\]"commit 4b327621647987c6e8c34d844068e48dab82a6abAuthor: Duane1B [Branch1] Complete Feature 1commit 469a850c458179914f4a79c804b778e2d3f1bfbeAuthor: Duane1A [Branch1] Start work on Feature 1This is sort of a clunky and error-prone approach, but tagging commit messages is a common way to compensate for Git’s lack of branch support.There is a more drastic and dangerous approach which is to use a Git rebase workflow, potentially with squashing commits.In this scenario, rather than adding more information to commits, you actively rewrite history to store even less data than Git captures by default. In the rebase workflow, rather than getting the mangled Git graph produced by the normal Git process, we get a straight line:RebaseIt’s ironic that the rebase workflow, a reaction to Git not storing enough information to model or make sense of project history, is to erase even more information. While the straight line graph is easier to read, it’s completely unable to answer questions laid out in the ideal model.This is what a Git user might refer to as a “clean” history. Your average Mercurial or Bazaar user is confused by this because both of the other tools correctly model branches, and neither one would consider the butchered object graph “clean”. The rebase workflow conflates readable logs with “clean” histories.Rebasing is significantly worse than the –no-ff work around as you can squash commits that might actually matter. Another problem with rebasing, beyond that you might accidentally select the wrong commits, is that it prevents collaboration.In order to use a rebase workflow, you have to adopt byzantine notions of “private” and “public” history. If you rebase away (i.e. discard) commits that have been shared with others, it leads to complicated cascading rebases. So, if I’m working on a feature, and I intend to use the rebase workflow, I’m not going to be able to make ad-hoc merges with colleagues because no one can count on my snapshots sticking around. Basically, sharing my work becomes difficult. One of the benefits of DVCS is that collaborating should be easy.It’s very hard to recommend Git. It’s the most difficult tool to use, and its broken branch model gives rise to Git specific solutions to Git inflicted problems that Git users may think are best practices rather than kludges.44354577MercurialMercurial is much easier to use than Git. This is not to say that there aren’t some UI concerns. The biggest gripe I have with Mercurial is that it separates pulling changes into a repository from merging changes, which can lead to a pending merge situation where you have multiple heads on the same branch as they have not yet been merged. Normally this isn’t a problem, but if you’re careless, you can end up with three or four heads in a repository before catching it.I’ve never understood if the manual multiple heads pull/merge pattern was considered a feature in Mercurial, or if it was an implementation detail leaking through. It certainly feels like that latter just as does Git’s “feature” of exposing the index to users.That said, the gulf between Mercurial usability and Git usability is akin to the distance between New York City and Los Angeles. While Mercurial’s UI is more complicated than Bazaar’s, that difference is more like the distance between New York City and Boston, so I won’t belabor the point.Mercurial has two popular models for branching (though I believe there are technically four). The first and most common model is to branch by cloning. You can think of this as a euphemism for not branching at all.In Mercurial, all work happens in a branch, but by default, all work happens in the same branch, called the “default” branch. The primary advantage branching by cloning has is that it’s easier than the alternatives. There’s fewer commands to know and call and there’s a working directory per branch. However, if we apply all of the commits in the same chronological order as I described in the ideal model example, here’s what our changeset graph looks like:Murky$ hg logchangeset: 10:d21a419a293etag: tipparent: 9:53d61d1605f9parent: 6:7129c383aa3buser: Petesummary: E Implement Feature 3changeset: 9:53d61d1605f9user: Petesummary: 3C Complete Feature 3changeset: 8:fb340c4eb013user: Petesummary: 3B Partial Completion of Feature 3changeset: 7:2a35bc51a28dparent: 3:e9292c14c2b2user: Petesummary: 3A Start work on Feature 3changeset: 6:7129c383aa3buser: Jerrysummary: 2C Complete Feature 2changeset: 5:b13281f66a25user: Jerrysummary: 2B Partial Completion of Feature 2changeset: 4:187abbd1b3c4user: Jerrysummary: 2A Start work on Feature 2changeset: 3:e9292c14c2b2user: Jerrysummary: 1B Complete Feature 1changeset: 2:a5e8ccb38d38user: Duanesummary: 1A Start work on Feature 1changeset: 1:b60a08bf46c7user: Stevesummary: B Implement Feature 0changeset: 0:747376f7cfb9user: Stevesummary: A Initial StateMercurial has completely mangled our branch history. Well, to be fair, this is what happens when we don’t actually use branches.Mercurial fared a bit better than Git on the logging front- at least the order of changesets reflects the order they were merged, though I’ve seen projects where either the complexity of the merges or perhaps because of an older version of Mercurial, the log looked nearly identical to Git’s.Unlike Git, Mercurial has no equivalent to the –no-ff option for merging. This means that with this clone-as-branch model, we can’t get snapshots {C,D}.While Git discards branch histories, Mercurial remembers them. So, if we use the named branches workflow, we have a different story:Murky2$ hg logchangeset: 15:5495a79dbe2btag: tipparent: 10:ab972541c15eparent: 14:f524ef255a5duser: Petesummary: E Implement Feature 3changeset: 14:f524ef255a5dbranch: branch3user: Petesummary: 3D Close Branch3changeset: 13:90ea24b0f0e1branch: branch3user: Petesummary: 3C Complete Feature 3changeset: 12:82bef28d0849branch: branch3user: Petesummary: 3B Partial Completion of Feature 3changeset: 11:d26ae37169a3branch: branch3parent: 5:7f7c35f66937user: Petesummary: 3A Start work on Feature 3changeset: 10:ab972541c15eparent: 5:7f7c35f66937parent: 9:4282cb9c4a23user: Jerrysummary: D Implement Feature 2changeset: 9:4282cb9c4a23branch: branch2user: Jerrysummary: 2D Close Branch2changeset: 8:0d7edbb59c8dbranch: branch2user: Jerrysummary: 2C Complete Feature 2changeset: 7:30bec7ee5bd2branch: branch2user: Jerrysummary: 2B Partial Completion of Feature 2changeset: 6:bd7eb7ed40a4branch: branch2user: Jerrysummary: 2A Start work on Feature 2changeset: 5:7f7c35f66937parent: 1:635e85109055parent: 4:52a27ea04f94user: Duanesummary: C Implement Feature 1changeset: 4:52a27ea04f94branch: branch1user: Duanesummary: 1C Close Branch1changeset: 3:ceb303533965branch: branch1user: Duanesummary: 1B Complete Feature 1changeset: 2:a6f29e9917ebbranch: branch1user: Duanesummary: 1A Start work on Feature 1changeset: 1:635e85109055user: Stevesummary: B Implement Feature 0changeset: 0:918345ee8664user: Stevesummary: A Initial StateThe hg log command is going to show us something that looks very similar to the previous log with a few changes. First off, all of the merge changesets are present. Second, every changeset is associated with a branch (default is assumed if not listed). The last thing to notice is we have an extra changeset per branch to “close” the branch.Also, we can now easily show the history of a branch:$ hg log -b defaultchangeset: 15:5495a79dbe2btag: tipparent: 10:ab972541c15eparent: 14:f524ef255a5duser: Petesummary: E Implement Feature 3changeset: 10:ab972541c15eparent: 5:7f7c35f66937parent: 9:4282cb9c4a23user: Jerrysummary: D Implement Feature 2changeset: 5:7f7c35f66937parent: 1:635e85109055parent: 4:52a27ea04f94user: Duanesummary: C Implement Feature 1changeset: 1:635e85109055user: Stevesummary: B Implement Feature 0changeset: 0:918345ee8664user: Stevesummary: A Initial State$ hg log -b branch1changeset: 4:52a27ea04f94branch: branch1user: Duanesummary: 1C Close Branch1changeset: 3:ceb303533965branch: branch1user: Duanesummary: 1B Complete Feature 1changeset: 2:a6f29e9917ebbranch: branch1user: Duanesummary: 1A Start work on Feature 1When we take advantage of Mercurial’s branching support, you can see how the Git notion of “cleaning” history is ridiculous.There are some costs to using the named branches workflow. The first one is there’s more manual work to set up and tear down branches. We ended up with additional commits to “close” branches, which is a bit of bookkeeping Mercurial pushes off on the user. It’s possible with planning to make the final commit in a brach and close it at the same time, but in practice you may not actually land a branch in the trunk until after a review or a QA process, so you may not know when it’s time to close the branch until that happens.Another problem with Mercurial’s implementation is it uses colocated branches with a single working directory by default. The additional cognitive overhead imposed by this choice is what I suspect is the reason more Mercurial users use the clone as a branch approach.Also, using named branches means your repository will have multiple heads most of the time (one per branch). For intermediate and advanced Mercurial users that have a good mental model of what’s going on, this is no big deal. But beginning Mercurial users are often trained to see multiple heads as a sign of a pending merge, and a situation to be rectified.In general, the named branch approach, while fully supported and completely functional, feels like an afterthought. Calling the hg branch command to create a new branch issues a warning as though it was advising you against using branches:$ hg branch branch1marked working directory as branch branch1(branches are permanent and global, did you want a bookmark?)Just for clarification, you do not want a bookmark.Similarly, when pushing work to an upstream repository, you have to tell Mercurial that yes, you intended to push a new branch:hg push --new-branchIf I had to sum up Mercurial, I’d call it repository oriented. Unlike Git, it fully tracks history, but it treats the repository as the most prominent metaphor in the system at the expense of the branch. Key Mercurial commands such as clone, pull, push, and log operate on repositories, while it has a separate branch command to operate on branches.If you’re willing to put up with a bit of extra work and some odd warnings from Mercurial, the named branches workflow meets the ideal model I’ve presented. Even if you choose the clone as a branch model, you’d still have a lot less cognitive overhead than using Git, so while I find it hard to recommend Git, if you’re already a Mercurial user, I’d urge you to consider using named branches.5pqqpBazaarBazaar is the simplest of the three systems to use. Common use of Bazaar is much like the clone as a branch approach of Mercurial without exposing the user directly to head pointers or multiple heads. As I said in the Mercurial section, Git is an order of magnitude harder to use than either Mercurial or Bazaar. Bazaar is unquestionably simpler from a UI perspective than Mercurial, especially compared to the named branches workflow, but it’s nothing compared to the gulf between Git and the others in usability.Also, Bazaar by default uses a directory per branch, so each branch gets its own working directory, meaning you don’t have to use kludges like stashing working directory changes when switching branches. Switching branches is as easy as changing directories. Bazaar does support a method for colocated branches, but it is not the default and it’s rarely needed. As I pointed out in the Git section, there are easy ways to tackle the problems that colocated branches are supposed to solve.Let’s repeat the experiment with the normal Bazaar workflow and see how it compares to the ideal model:BazaarBazaar, by default, exactly matches the ideal model. Now, I know my more cynical readers will assume that this is because I picked Bazaar’s model as the “ideal” model, but they would be incorrect. Bazaar is not the first DVCS I used, nor did my ideal model derive from Bazaar. The ideal model is what I think should happen when branching and merging. As I said earlier, I don’t think the model I laid out is controversial. I use Bazaar because it meets the model, not the other way around.In fact, I think a lot of users get confused by Git and Mercurial’s branch by clone workflow precisely because the history graph that they record does not resemble user’s mental model.Let’s take a look at the log:$ bzr log------------------------------------------------------------revno: 5 [merge]committer: Petebranch nick: trunkmessage:E Implement Feature 3------------------------------------------------------------revno: 4 [merge]committer: Jerrybranch nick: trunkmessage:D Implement Feature 2------------------------------------------------------------revno: 3 [merge]committer: Duanebranch nick: trunkmessage:C Implement Feature 1------------------------------------------------------------revno: 2committer: Stevebranch nick: trunkmessage:B Implement Feature 0------------------------------------------------------------revno: 1committer: Stevebranch nick: trunkmessage:A Initial State------------------------------------------------------------Use --include-merged or -n0 to see merged revisions.This is very different than the logs presented by either Mercurial or Git. Each of the logs I’ve shown were taken from the upstream “authoritative” repository. Unlike the others, bzr log does not operate on a repository, it operates on a branch. So by default, we see the history of the project from the perspective of the trunk branch.Another distinction with Bazaar’s logs is that they are nested. To show the complete log, we can set the nesting level to 0, which will show an infinite level of nesting:$ bzr log -n0------------------------------------------------------------revno: 5 [merge]committer: Petebranch nick: trunkmessage:E Implement Feature 3------------------------------------------------------------revno: 3.2.3committer: Petebranch nick: branch3message:3C Complete Feature 3------------------------------------------------------------revno: 3.2.2committer: Petebranch nick: branch3message:3B Partial Completion of Feature 3------------------------------------------------------------revno: 3.2.1committer: Petebranch nick: branch3message:3A Start work on Feature 3------------------------------------------------------------revno: 4 [merge]committer: Jerrybranch nick: trunkmessage:D Implement Feature 2------------------------------------------------------------revno: 3.1.3committer: Jerrybranch nick: branch2message:2C Complete Feature 2------------------------------------------------------------revno: 3.1.2committer: Jerrybranch nick: branch2message:2B Partial Completion of Feature 2------------------------------------------------------------revno: 3.1.1committer: Jerrybranch nick: branch2message:2A Start work on Feature 2------------------------------------------------------------revno: 3 [merge]committer: Duanebranch nick: trunkmessage:C Implement Feature 1------------------------------------------------------------revno: 2.1.2committer: Duanebranch nick: branch1message:1B Complete Feature 1------------------------------------------------------------revno: 2.1.1committer: Duanebranch nick: branch1message:1A Start work on Feature 1------------------------------------------------------------revno: 2committer: Stevebranch nick: trunkmessage:B Implement Feature 0------------------------------------------------------------revno: 1committer: Stevebranch nick: trunkmessage:A Initial StateNested logs is a simple but incredibly useful innovation. We don’t really need a graphical tool to visualize history. Also, it’s clear that the Git notion of “cleaning” history is equally as ludicrous as it was when looking at Mercurial’s logs, perhaps even more so.There is a real advantage though as compared to Mercurial’s named branches workflow. For lack of a better term, Bazaar is branch oriented. Both git init and hg init create new repositories. However, bzr init creates a new branch. Both git clone and hg clone create copies of repositories. The equivalent command in Bazaar, bzr branch forks a new branch. Both git log and hg log examine the history of the repository. The bzr log command shows branch history.It’s a very subtle change in point of view. Bazaar elevates branches to the primary metaphor, relegating the concept of repository to more of a background player role.The result is that the basic Bazaar workflow is to always branch and to always accurately track branch history. It accomplishes this with a cognitive overhead that is comparable to working with a centralized system like Subversion or Mercurial’s clone as a branch method.Bazaar does have some downsides. The most significant one is it has the smallest user base of the big three open source DVCS. This means it will be a little harder to find answers to your questions in blog postings and Stack Exchange answers. Also, commercial hosting companies like Bitbucket aren’t going to offer Bazaar support. Nor is it as actively developed as Mercurial and Git.However, Bazaar’s positives so strongly outweigh the negatives that you’d be crazy to not consider Bazaar.44354687Branching is the Secret Sauce in DVCSWhat makes DVCS work is the ability to handle branches from creation to merging, and to do it in the least intrusive way possible so we can get our work done.Bazaar gets branching right, by default. It’s also the easiest system to learn and use. You don’t have to expend nearly as much focus and attention on Bazaar as Mercurial and Git, which means you have more attention to devote to your actual work.I did not go into detail showing the commands used to create the graphs and logs presented in this post because rather than getting bogged down in the syntax and the details, I want to make the big picture clear. I leave the commands as an exercise to motivated readers so they can prove to themselves that their favorite tool just might not be recording history as they visualize it.Bzr Init: A Bazaar TutorialAt this point, you might be wondering where the actual tutorial is. First I wanted to convince you why Bazaar is a good choice, if not the best choice.The actual tutorial is much longer than this rant, so I gave it it’s own site:Bzr Init: A Bazaar TutorialThis tutorial is not meant to be exhaustive; it exists to show you how easy Bazaar is and how it can be easily adapted to your workflow. It’s inspired by (let’s say ripped off from) Joel Spolsky’s Mercurial Tutorial- HgInit. Hginit is not only one of the best Mercurial tutorials, it’s one of the best DVCS tutorials. I’d recommend you read it even if you’re not a Mercurial user.Hopefully you’d get the same experience from the Bzr Init tutorial. In any case, if I’ve piqued your interest, read on!Categories: ProgrammingTagged: bazaar, bzr, dvcs, git, hg, mercurial, vcs, version controlPost navigationPrevious Post‹Not That There’s Anything Wrong With That, 1952 Edition›Next PostChromebook Boom?© 2024 Duck Rowing |Footer MenuPrivacy PolicyHome About Microblog Mastodon Search RSS
# Document TitlearXiv:1311.3903v1 [cs.LO] 13 Nov 2013A Categorical Theory of PatchesSamuel Mimram Cinzia Di GiustoCEA, LIST∗AbstractWhen working with distant collaborators on the same documents, oneoften uses a version control system, which is a program tracking the historyof files and helping importing modifications brought by others as patches.The implementation of such a system requires to handle lots of situationsdepending on the operations performed by users on files, and it is thusdifficult to ensure that all the corner cases have been correctly addressed.Here, instead of verifying the implementation of such a system, we adopta complementary approach: we introduce a theoretical model, which isdefined abstractly by the universal property that it should satisfy, andwork out a concrete description of it. We begin by defining a category offiles and patches, where the operation of merging the effect of two coinitialpatches is defined by pushout. Since two patches can be incompatible,such a pushout does not necessarily exist in the category, which raisesthe question of which is the correct category to represent and manipulatefiles in conflicting state. We provide an answer by investigating the freecompletion of the category of files under finite colimits, and give an explicitdescription of this category: its objects are finite sets labeled by linesequipped with a transitive relation and morphisms are partial functionsrespecting labeling and relations.1 IntroductionIt is common nowadays, when working with distant collaborators on the samefiles (multiple authors writing an article together for instance), to use a programwhich will track the history of files and handle the operation of importing mod-ifications of other participants. These software called version control systems(vcs for short), like git or Darcs, implement two main operations. When a useris happy with the changes it has brought to the files it can record those changesin a patch (a file coding the differences between the current version and the lastrecorded version) and commit them to a server, called a repository. The usercan also update its current version of the file by importing new patches addedby other users to the repository and applying the corresponding modificationsto the files. One of the main difficulties to address here is that there is no globalnotion of “time”: patches are only partially ordered. For instance consider arepository with one file A and two users u1 and u2. Suppose that u1 modifiesfile A into B by committing a patch f , which is then imported by u2, and∗This work was partially supported by the French project ANR-11-INSE-0007 REVER.1then u1 and u2 concurrently modify the file B into C (resp. D) by committinga patch g (resp. h). The evolution of the file is depicted on the left and thepartial ordering of patches in the middle:C DBg``❆❆❆ h>>⑥⑥⑥Af O Og hf]]✿✿✿ A A ✄✄✄ECh/g > > ⑥⑥⑥ Dg/h``❆❆❆Bg``❆❆❆ h>>⑥⑥⑥Af O ONow, suppose that u2 imports the patch g or that u1 imports the patch h.Clearly, this file resulting from the merging of the two patches should be thesame in both cases, call it E. One way to compute this file, is to say that thereshould be a patch h/g, the residual of h after g, which transforms C into E andhas the “same effect” as h once g has been applied, and similarly there should bea patch g/h transforming D into E. Thus, after each user has imported changesfrom the other, the evolution of the file is as pictured on the right above. Inthis article, we introduce a category L whose objects are files and morphismsare patches. Since residuals should be computed in the most general way, weformally define them as the arrows of pushout cocones, i.e. the square in thefigure on the right should be a pushout.However, as expected, not every pair of coinitial morphisms have a pushoutin the category L: this reflects the fact that two patches can be conflicting (forinstance if two users modify the same line of a file). Representing and handlingsuch conflicts in a coherent way is one of the most difficult part of implementinga vcs (as witnessed for instance by the various proposals for Darcs: mergers,conflictors, graphictors, etc. [10]). In order to be able to have a representationfor all conflicting files, we investigate the free completion of the category Lunder all pushouts, this category being denoted P, which corresponds to addingall conflicting files to the category, in the most general way as possible. Thiscategory can easily be shown to exist for general abstract reasons, and oneof the main contributions of this work is to provide an explicit description byapplying the theory of presheaves. This approach paves the way towards theimplementation of a vcs whose correctness is deduced from universal categoricalproperties.Related work. The Darcs community has investigated a formalization of pat-ches based on commutation properties [10]. Operational transformations tackleessentially the same issues by axiomatizing the notion of residual patches [9].In both cases, the fact that residual should form a pushout cocone is neverexplicitly stated, excepting in informal sentences saying that “g/f should havethe same effect as g once f has been applied”. We should also mention anotherinteresting approach to the problem using inverse semigroups in [4]. Finally,Houston has proposed a category with pushouts, similar to ours, in order tomodel conflicting files [3], see Section 6.Plan of the paper. We begin by defining a category L of files and patches inSection 2. Then, in Section 3, we abstractly define the category P of conflictingfiles obtained by free finite cocompletion. Section 4 provides a concrete descrip-tion of the construction in the simpler case where patches can only insert lines.2We give some concrete examples in Section 5 and adapt the framework to thegeneral case in Section 6. We conclude in Section 7.2 Categories of files and patchesIn this section, we investigate a model for a simplified vcs: it handles only onefile and the only allowed operations are insertion and deletion of lines (modi-fication of a line can be encoded by a deletion followed by an insertion). Wesuppose fixed a set L = {a, b, . . .} of lines (typically words over an alphabet ofcharacters). A file A is a finite sequence of lines, which will be seen as a functionA : [n] → L for some number of lines n ∈ N, where the set [n] = {0, 1, . . . , n − 1}indexes the lines of the files. For instance, a file A with three lines such thatA(0) = a, A(1) = b and A(2) = c models the file abc. Given a ∈ L, we some-times simply write a for the file A : [1] → L such that A(0) = a. A morphismbetween two files A : [m] → L and B : [n] → L is an injective increasing partialfunction f : [m] → [n] such that ∀i ∈ [m], B ◦ f (i) = A(i) whenever f (i) isdefined. Such a morphism is called a patch.Definition 1. The category L has files as objects and patches as morphisms.Notice that the category L is strictly monoidal with [m] ⊗ [n] = [m + n] andfor every file A : [m] → L and B : [n] → L, (A ⊗ B)(i) = A(i) if i < m and(A ⊗ B)(i) = B(i − m) otherwise, the unit being the empty file I : [0] → L, andtensor being defined on morphisms in the obvious way. The following propositionshows that patches are generated by the operations of inserting and deleting aline:Proposition 2. The category L is the free monoidal category containing L asobjects and containing, for every line a ∈ L, morphisms ηa : I → a (insertionof a line a) and εa : a → I (deletion of a line a) such that εa ◦ ηa = idI (deletingan inserted line amounts to do nothing).Example 3. The patch corresponding to transforming the file abc into dadeb,by deleting the line c and inserting the lines labeled by d and e, is modeled bythe partial function f : [3] → [5] such that f (0) = 1 and f (1) = 4 and f (2) isundefined. Graphically,abcdadebThe deleted line is the one on which f is not defined and the inserted lines arethose which are not in the image of f . In other words, f keeps track of theunchanged lines.In order to increase readability, we shall consider the particular case where Lis reduced to a single element. In this unlabeled case, the objects of L can beidentified with integers (the labeling function is trivial), and Proposition 2 canbe adapted to achieve the following description of the category, see also [6].Proposition 4. If L is reduced to a singleton, the category L is the free categorywhose objects are integers and morphisms are generated by sni : n → n + 1 and3dni : n + 1 → n for every n ∈ N and i ∈ [n + 1] (respectively corresponding toinsertion and deletion of a line at i-th position), subject to the relationssn+1i snj = sn+1j+1 sni dni sni = idn dni dn+1j = dnj dn+1i+1 (1)whenever 0 ≤ i ≤ j < n.We will also consider the subcategory L+ of L, with same objects, and to-tal injective increasing functions as morphisms. This category models patcheswhere the only possible operation is the insertion of lines: Proposition 2 can beadapted to show that L+ is the free monoidal category containing morphismsηa : I → a and, in the unlabeled case, Proposition 4 can be similarly adaptedto show that it is the free category generated by morphisms sni : n → n + 1satisfying sn+1i snj = sn+1j+1 sni with 0 ≤ i ≤ j < n.3 Towards a category of conflicting filesSuppose that A is a file which is edited by two users, respectively applyingpatches f1 : A → A1 and f2 : A → A2 to the file. For instance,a c c b f1←− a b f2−→ a b c d (2)Now, each of the two users imports the modification from the other one. Theresulting file, after the import, should be the smallest file containing both mod-ifications on the original file: accbcd. It is thus natural to state that it shouldbe a pushout of the diagram (2). Now, it can be noticed that not every diagramin L has a pushout. For instance, the diagrama c b f1←− a b f2−→ a d b (3)does not admit a pushout in L. In this case, the two patches f1 and f2 are saidto be conflicting.In order to represent the state of files after applying two conflicting patches,we investigate the definition of a category P which is obtained by completingthe category L under all pushouts. Since, this completion should also containan initial object (i.e. the empty file), we are actually defining the category P asthe free completion of L under finite colimits: recall that a category is finitelycocomplete (has all finite colimits) if and only if it has an initial object and isclosed under pushouts [6]. Intuitively, this category is obtained by adding fileswhose lines are not linearly ordered, but only partially ordered, such as on theleft of a❃❃❃c❁❁❁ d✂✂✂ba<<<<<<< HEADc=======d>>>>>>> 5c55...b(4)which would intuitively model the pushout of the diagram (3) if it existed,indicating that the user has to choose between c and d for the second line.Notice the similarities with the corresponding textual notation in git on theright. The name of the category L reflects the facts that its objects are fileswhose lines are linearly ordered, whereas the objects of P can be thought as fileswhose lines are only partially ordered. More formally, the category is definedas follows.4Definition 5. The category P is the free finite conservative cocompletion of L:it is (up to equivalence of categories) the unique finitely cocomplete categorytogether with an embedding functor y : L → P preserving finite colimits, suchthat for every finitely cocomplete category C and functor F : L → C preserv-ing finite colimits, there exists, up to unique isomorphism, a unique functor˜F : P → C preserving finite colimits and satisfying ˜F ◦ y = F :LyF / / CP ˜F??Above, the term conservative refers to the fact that we preserve colimits whichalready exist in L (we will only consider such completions here). The “standard”way to characterize the category P, which always exists, is to use the followingfolklore theorem, often attributed to Kelly [5, 1]:Theorem 6. The conservative cocompletion of the category L is equivalent tothe full subcategory of ˆL whose objects are presheaves which preserve finite limits,i.e. the image of a limit in Lop (or equivalently a colimit in L) is a limit in Set(and limiting cones are transported to limiting cones). The finite conservativecocompletion P can be obtained by further restricting to presheaves which arefinite colimits of representables.Example 7. The category FinSet of finite sets and functions is the conservativecocompletion of the terminal category 1.We recall that the category ˆL of presheaves over L, is the category of functorsLop → Set and natural transformations between them. The Yoneda functory : L → ˆL defined on objects n ∈ L by yn = L(−, n), and on morphismsby postcomposition, provides a full and faithful embedding of L into the cor-responding presheaf category, and can be shown to corestrict into a functory : L → P [1]. A presheaf of the form yn for some n ∈ L is called representable.Extracting a concrete description of the category P from the above propo-sition is a challenging task, because we a priori need to characterize firstly alldiagrams admitting a colimit in L, and secondly all presheaves in ˆL which pre-serve those diagrams. This paper introduces a general methodology to buildsuch a category. In particular, perhaps a bit surprisingly, it turns out thatwe have to “allow cycles” in the objects of the category P, which will be de-scribed as the category whose objects are finite sets labeled by lines together witha transitive relation and morphisms are partial functions respecting labels andrelations.4 A cocompletion of files and insertions of linesIn order to make our presentation clearer, we shall begin our investigation of thecategory P in a simpler case, which will be generalized in Section 6: we computethe free finite cocompletion of the category L+ (patches can only insert lines)in the case where the set of labels is a singleton. To further lighten notations,in this section, we simply write L for this category.We sometimes characterize the objects in L as finite colimits of objects in asubcategory G of L. This category G is the full subcategory of L whose objects5are 1 and 2: it is the free category on the graph 1 / / / / 2 , the two arrowsbeing s10 and s11. The category ˆG of presheaves over G is the category of graphs:a presheaf P ∈ ˆG is a graph with P (1) as vertices, P (2) as edges, the functionsP (s11) and P (s10) associate to a vertex its source and target respectively, andmorphisms correspond to usual morphisms of graphs. We denote by x ։ y apath going from a vertex x to a vertex y in such a graph. The inclusion functorI : G → L induces, by precomposition, a functor I∗ : ˆL → ˆG. The imageof a presheaf in ˆL under this functor is called its underlying graph. By wellknown results about presheaves categories, this functor admits a right adjointI∗ : ˆG → ˆL: given a graph G ∈ ˆG, its image under the right adjoint is thepresheaf G∗ ∈ ˆL such that for every n ∈ N, G∗(n + 1) is the set of paths oflength n in the graph G, with the expected source maps, and G∗(0) is reducedto one element.Recall that every functor F : C → D induces a nerve functor NF : D → ˆCdefined on an object A ∈ C by NF (A) = D(F −, A) [7]. Here, we will considerthe nerve NI : L → ˆG associated to the inclusion functor I : G → L. An easycomputation shows that the image NI (n) of n ∈ L is a graph with n vertices,so that its objects are isomorphic to [n], and there is an arrow i → j for everyi, j ∈ [n] such that i < j. For instance,NI (3) = 0 / / 6 6 1 / / 2 NI (4) = 0 / / ( ( 4 4 1 / / ( ( 2 / / 3It is, therefore, easy to check that this embedding is full and faithful, i.e. mor-phisms in L correspond to natural transformations in ˆG. Moreover, since NI (1)is the graph reduced to a vertex and NI (2) is the graph reduced to two verticesand one arrow between them, every graph can be obtained as a finite colimit ofthe graphs NI (1) and NI (2) by “gluing arrows along vertices”. For instance, theinitial graph NI (0) is the colimit of the empty diagram, and the graph NI (3) isthe colimit of the diagramNI (2) NI (2)NI (1)NI (s1 ) , , ❳❳❳❳❳❳❳❳❳❳❳❳❳❳NI (s1 ) 6 6❧❧❧❧ NI (1)NI (s0 )hh❘❘❘❘ NI (s1) 6 6❧❧❧❧ NI (1)NI (s0 )hh❘❘❘❘NI (s0 )rr❢❢❢❢❢❢❢❢❢❢❢❢❢❢NI (2)which may also be drawn as on the left of**❯❯❯❯❯❯❯❯❯:: ✈✈✈✈ f f ◆◆◆◆ 8 8 ♣♣♣♣ d d ❍❍❍❍tt✐✐✐✐✐✐✐✐✐2 21&&◆◆◆◆◆◆◆@@✁✁✁ 1^^❂❂❂ @ @ ✁✁✁ 1^^❂❂❂xx♣♣♣♣♣♣♣2by drawing the graphs NI (0) and NI (1). Notice, that the object 3 is the col-imit of the corresponding diagram in L (on the right), and this is generallytrue for all objects of L, moreover this diagram is described by the functorEl(NI (3)) π−→ L. The notation El(P ) refers to the category of elements of apresheaf P ∈ ˆC, whose objects are pairs (A, p) with A ∈ C and p ∈ P (A)and morphisms f : (A, p) → (B, q) are morphisms f : A → B in C such thatP (f )(q) = p, and π is the first projection functor. The functor I : G → L isthus a dense functor in the sense of Definition 9 below, see [7] for details.6Proposition 8. Given a functor F : C → D, with D cocomplete, the associatednerve NF : D → ˆC admits a left adjoint RF : ˆC → D called the realizationalong F . This functor is defined on objects P ∈ ˆC byRF (P ) = colim(El(P ) π−→ C F−→ D)Proof. Given a presheaf P ∈ ˆC and an object D, it can be checked directly thatmorphisms P → NF D in ˆC with cocones from El(P ) D−→ to D, which in turnare in bijection with morphisms RF (P ) → D in D, see [7].Definition 9. A functor F : C → D is dense if it satisfies one of the twoequivalent conditions:(i) the associated nerve functor NF : D → ˆC is full and faithful,(ii) every object of D is canonically a colimit of objects in C: for every D ∈ D,D ∼= colim(El(NF D) π−→ C F−→ D) (5)Since the functor I is dense, every object of L is a finite colimit of objectsin G, and G does not have any non-trivial colimit. One could expect the freeconservative finite cocompletion of L to be the free finite cocompletion P of G.We will see that this is not the case because the image in L of a non-trivialdiagram in G might still have a colimit. By Theorem 6, the category P is thefull subcategory of ˆL of presheaves preserving limits, which we now describeexplicitly. This category will turn out to be equivalent to a full subcategoryof ˆG (Theorem 15). We should first remark that those presheaves satisfy thefollowing properties:Proposition 10. Given a presheaf P ∈ ˆL which is an object of P,1. the underlying graph of P is finite,2. for each non-empty path x ։ y there exists exactly one edge x → y (inparticular there is at most one edge between two vertices),3. P (n + 1) is the set of paths of length n in the underlying graph of P ,and P (0) is reduced to one element.Proof. We suppose given a presheaf P ∈ P, it preserves limits by Theorem 6.The diagram on the left32s22 = = ③③③ 2s20aa❉❉❉1s10aa❉❉❉ s11==③③③P (3)P (2)xxP (s22)P (2)&&P (s20 )▼▼P (1)&&P (s10)▼▼ x x P (s11 )is a pushout in L, or equivalently the dual diagram is a pullback in Lop. There-fore, writing D for the diagram 2 1s10oo s11 / / 2 in L, a presheaf P ∈ P should satisfyP ((colim D)op) ∼= lim P (Dop), i.e. the above pushout diagram in L should betransported by P into the pullback diagram in Set depicted on the right of theabove figure. This condition can be summarized by saying that P should satisfy7the isomorphism P (3) ∼= P (2) ×P (1) P (2) (and this isomorphism should respectobvious source and target maps given by the fact that the functor P shouldsend a limiting cone to a limiting cone). From this fact, one can deduce thatthe elements α of P (3) are in bijection with the paths x → y → z of length 2in the underlying graph of P going from x = P (s22s11)(α) to z = P (s20s10)(α). Inparticular, this implies that for any path α = x → y → z of length 2 in theunderlying graph of P , there exists an edge x → z, which is P (s21)(α). Moregenerally, given any integer n > 1, the object n + 1 is the colimit in L of thediagram2 2 2 21s11 ; ; ①①① 1s10cc❋❋❋ s11 ; ; ①①① s10aa❈❈❈❈ . . .s11 = = ④④④④ 1s10cc❋❋❋ s11 ; ; ①①① 1s10cc❋❋❋ (6)with n + 1 occurrences of the object 1, and n occurrences of the object 2.Therefore, for every n ∈ N, P (n+ 1) is isomorphic to the set of paths of length nin the underlying graph. Moreover, since the diagram2 2 2 21s11 , , ❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩❩s11 < < ②②② 1s10bb❊❊❊ s11 < < ②②② s10``❇❇❇ . . .s11 > > ⑤⑤⑤ 1s10bb❊❊❊ s11 < < ②②② 1s10bb❊❊❊s10rr❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞❞2(7)with n + 1 occurrences of the object 1 also admits the object n + 1 as colimit,we should have P (n + 1) ∼= P (n + 1) × P (2) between any two vertices x and y,i.e. for every non-empty path x ։ y there exists exactly one edge x → y. Also,since the object 0 is initial in L, it is the colimit of the empty diagram. Theset P (0) should thus be the terminal set, i.e. reduced to one element. Finally,since I is dense, P should be a finite colimit of the representables NI (1) andNI (2), the set P (1) is necessarily finite, as well as the set P (2) since there is atmost one edge between two vertices.Conversely, we wish to show that the conditions mentioned in the aboveproposition exactly characterize the presheaves in P among those in ˆL. In orderto prove so, by Theorem 6, we have to show that presheaves P satisfying theseconditions preserve finite limits in L, i.e. that for every finite diagram D : J → Ladmitting a colimit we have P (colim D) ∼= lim(P ◦ Dop). It seems quite difficultto characterize the diagrams admitting a colimit in L, however the followinglemma shows that it is enough to check diagrams “generated” by a graph whichadmits a colimit.Lemma 11. A presheaf P ∈ ˆL preserves finite limits if and only if it sends thecolimits of diagrams of the formEl(G) πG−−→ G I−→ L (8)to limits in Set, where G ∈ ˆG is a finite graph such that the above diagramadmits a colimit. Such a diagram in L is said to be generated by the graph G.Proof. In order to check that a presheaf P ∈ ˆL preserves finite limits, we haveto check that it sends colimits of finite diagrams in L which admit a colimitto limits in Set, and therefore we have to characterize diagrams which admitcolimits in L. Suppose given a diagram K : J → L. Since I is dense, everyobject of linear is a colimit of a diagram involving only the objects 1 and 2 (see8Definition 9). We can therefore suppose that this is the case in the diagram K.Finally, it can be shown that diagram K admits the same colimits as a diagramcontaining only s10 and s11 as arrows (these are the only non-trivial arrows in Lwhose source and target are 1 or 2), in which every object 2 is the target ofexactly one arrow s10 and one arrow s11. For instance, the diagram in L belowon the left admits the same colimits as the diagram in the middle.2 31s10^^❂❂❂s11❂❂❂s22 s11 @ @ ✁✁✁ 1s10✁✁✁s20s10^^❂❂❂1s10jj22 2 21s11 @ @ ✁✁✁ 1s10^^❂❂❂s11 @ @ ✁✁✁s11 & & ◆◆◆◆◆◆◆ 1s10^^❂❂❂s11 @ @ ✁✁✁ 1s10xx♣♣♣♣♣♣♣s10^^❂❂❂20 / / 1 / / 6 6 2 / / 3Any such diagram K is obtained by gluing a finite number of diagrams of theform 1 s11 / / 2 1s10oo along objects 1, and is therefore of the form El(G) π−→ G I−→ Lfor some finite graph G ∈ ˆG: the objects of G are the objects 1 in K, the edges ofG are the objects 2 in K and the source and target of an edge 2 are respectivelygiven by the sources of the corresponding arrows s11 and s10 admitting it as target.For instance, the diagram in the middle above is generated by the graph on theright. The fact that every diagram is generated by a presheaf (is a discretefibration) also follows more abstractly and generally from the construction ofthe comprehensive factorization system on Cat [8, 11].Among diagrams generated by graphs, those admitting a colimit can becharacterized using the following proposition:Lemma 12. Given a graph G ∈ ˆG, the associated diagram (8) admits a colimitin L if and only if there exists n ∈ L and a morphism f : G → NI n in ˆLsuch that every morphism g : G → NI m in ˆL, with m ∈ L, factorizes uniquelythrough NI n: G f/ /g2 2NI n / / NI mProof. Follows from the existence of a partially defined left adjoint to NI , inthe sense of [8], given by the fact that I is dense (see Definition 9).We finally arrive at the following concrete characterization of diagrams admit-ting colimits:Lemma 13. A finite graph G ∈ ˆG induces a diagram (8) in L which admits acolimit if and only if it is “tree-shaped”, i.e. it is1. acyclic: for any vertex x, the only path x ։ x is the empty path,2. connected: for any pair of vertices x and y there exists a path x ։ y or apath y ։ x.Proof. Given an object n ∈ L, recall that NI n is the graph whose objects areelements of [n] and there is an arrow i → j if and only if i < j. Given a finitegraph G, morphisms f : G → NI n are therefore in bijection with functionsf : VG → [n], where VG denotes the set of vertices of G, such that f (x) < f (y)whenever there exists an edge x → y (or equivalently, there exists a non-emptypath x ։ y).9Consider a finite graph G ∈ ˆG, by Lemma 12, it induces a diagram (8)admitting a colimit if there is a universal arrow f : G → NI n with n ∈ L. Fromthis it follows that the graph is acyclic: otherwise, we would have a non-emptypath x ։ x for some vertex x, which would imply f (x) < f (x). Similarly,suppose that G is a graph with vertices x and y such that there is no pathx ։ y or y ։ x, and there is an universal morphism f : G → NI n for somen ∈ L. Suppose that f (x) ≤ f (y) (the case where f (y) ≤ f (x) is similar). Wecan define a morphism g : G → NI (n + 1) by g(z) = f (z) + 1 if there is a pathx ։ z, g(y) = f (x) and g(z) = f (z) otherwise. This morphism is easily checkedto be well-defined. Since we always have f (x) ≤ f (y) and g(x) > g(y), there isno morphism h : NI n → NI (n + 1) such that h ◦ f = g.Conversely, given a finite acyclic connected graph G, the relation ≤ definedon morphisms by x ≤ y whenever there exists a path x ։ y is a total order.Writing n for the number of vertices in G, the function f : G → NI n, which toa vertex associates the number of vertices strictly below it wrt ≤, is universalin the sense of Lemma 12.Proposition 14. The free conservative finite cocompletion P of L is equiva-lent to the full subcategory of ˆL whose objects are presheaves P satisfying theconditions of Proposition 10.Proof. By Lemma 11, the category P is equivalent to the full subcategory of ˆLwhose objects are presheaves preserving limits of diagrams of the form (8) gen-erated by some graph G ∈ ˆG which admits a colimit, i.e. by Lemma 13 the finitegraphs which are acyclic and connected. We write Gn for the graph with [n]as vertices and edges i → (i + 1) for 0 ≤ i < n − 1. It can be shown that anyacyclic and connected finite graph can be obtained from the graph Gn, for somen ∈ N, by iteratively adding an edge x → y for some vertices x and y suchthat there exists a non-empty path x ։ y. Namely, suppose given an acyclicand connected finite graph G. The relation ≤ on its vertices, defined by x ≤ ywhenever there exists a path x ։ y, is a total order, and therefore the graph Gcontains Gn, where n is the number of edges of G. An edge in G which is notin Gn is necessarily of the form x → y with x ≤ y, otherwise it would not beacyclic. Since by Proposition 10, see (7), the diagram generated by a graph ofthe form . . .is preserved by presheaves in P (which corresponds to adding an edge betweenvertices at the source and target of a non-empty path), it is enough to showthat presheaves in P preserve diagrams generated by graphs Gn. This followsagain by Proposition 10, see (6).One can notice that a presheaf P ∈ P is characterized by its underlyinggraph since P (0) is reduced to one element and P (n) with n > 2 is the set ofpaths of length n in this underlying graph: P ∼= I∗(I∗P ). We can thereforesimplify the description of the cocompletion of L as follows:Theorem 15. The free conservative finite cocompletion P of L is equivalent tothe full subcategory of the category ˆG of graphs, whose objects are finite graphssuch that for every non-empty path x ։ y there exists exactly one edge x → y.Equivalently, it can be described as the category whose objects are finite setsequipped with a transitive relation <, and functions respecting relations.10In this category, pushouts can be explicitly described as follows:Proposition 16. With the last above description, the pushout of a diagram inP (B, <B ) f←− (A, <A) g−→ (C, <C ) is B ⊎ C/ ∼ with B ∋ b ∼ c ∈ C wheneverthere exists a ∈ A with f (a) = b and f (a) = c, equipped with the transitiveclosure of the relation inherited by <B and <C .Lines with labels. The construction can be extended to the labeled case (i.e. L isnot necessarily a singleton). The forgetful functor ˆL → Set sending a presheaf Pto the set P (1) admits a right adjoint ! : Set → ˆL. Given n ∈ N∗ the elementsof !L(n) are words u of length n over L, with !L(sn−1i )(u) being the word ob-tained from u by removing the i-th letter. The free conservative finite cocomple-tion P of L is the slice category L/!L, whose objects are pairs (P, ℓ) consistingof a finite presheaf P ∈ ˆL together with a labeling morphism ℓ : P → !L ofpresheaves. Alternatively, the description of Proposition 15 can be straightfor-wardly adapted by labeling the elements of the objects by elements of L (labelsshould be preserved by morphisms), thus justifying the use of labels for thevertices in following examples.5 ExamplesIn this section, we give some examples of merging (i.e. pushout) of patches.Example 17. Suppose that starting from a file ab, one user inserts a line a′ atthe beginning and c in the middle, while another one inserts a line d in themiddle. After merging the two patches, the resulting file is the pushout ofa′acbf1←−abf2−→adbwhich isa′ac dbExample 18. Write G1 for the graph with one vertex and no edges, and G2 forthe graph with two vertices and one edge between them. We write s, t : G1 → G2for the two morphisms in P. Since P is finitely cocomplete, there is a coproductG1 + G1 which gives, by universal property, an arrow seq : G1 + G1 → G2:G2G1s 8 8 rrrrr / / G1 + G1seq O OG1ootff▲▲▲▲▲ or graphically s< < ②②②②②②②② / /seqOOoo tbb❊❊❊❊❊❊❊❊that we call the sequentialization morphism. This morphism corresponds to thefollowing patch: given two possibilities for a line, a user can decide to turn theminto two consecutive lines. We also write seq′ : G1 + G1 → G2 for the morphismobtained similarly by exchanging s and t in the above cocone. Now, the pushoutof seq←−− seq′−−→ iswhich illustrates how cyclic graphs appear in P during the cocompletion of L.11Example 19. With the notations of the previous example, by taking the co-product of two copies of idG1 : G1 → G1, there is a universal morphismG1 + G1 → G1, which illustrates how two independent lines can be mergedby a patch (in order to resolve conflicts).id•< < ②②②②②②②②② / /mergeOOoo id•bb❊❊❊❊❊❊❊❊❊6 Handling deletions of linesAll the steps performed in previous sections in order to compute the free con-servative finite cocompletion of the category L+ can be adapted in order tocompute the cocompletion P of the category L as introduced in Definition 1,thus adding support for deletion of lines in patches. In particular, the general-ization of the description given by Theorem 15 turns out to be as follows.Theorem 20. The free conservative finite cocompletion P of the category Lis the category whose objects are triples (A, <, ℓ) where A is a finite set oflines, < is a transitive relation on A and ℓ : A → L associates a label toeach line, and morphisms f : (A, <A, ℓA) → (B, <B , ℓB ) are partial functionsf : A → B such that for every a, a′ ∈ A both admitting an image under f , wehave ℓB (f (a)) = ℓA(a), and a <A a′ implies f (a) <B f (a′).Similarly, pushouts in this category can be computed as described in Proposi-tion 16, generalized in the obvious way to partial functions.Example 21. Suppose that starting from a file abc, one user inserts a line dafter a and the other one deletes the line b. The merging of the two patches(in P′) is the pushout ofadbcf1←−abcf2−→acwhich isadci.e. the file adc. Notice that the morphism f2 is partial: b has no image.Interestingly, a category very similar to the one we have described in The-orem 20 was independently proposed by Houston [3] based on a constructionperformed in [2] for modeling asynchronous processes. This category is notequivalent to ours because morphisms are reversed partial functions: it is thusnot the most general model (in the sense of being the free finite cocomple-tion). As a simplified explanation for this, consider the category FinSet whichis the finite cocompletion of 1. This category is finitely complete (in additionto cocomplete), thus FinSetop is finitely cocomplete and 1 embeds fully andfaithfully in it. However, FinSetop is not the finite cocompletion of 1. Anotherway to see this is that this category does not contain the “merging” morphismof Example 19, but it contains a dual morphism “duplicating” lines.127 Concluding remarks and future worksIn this paper, we have detailed how we could derive from universal constructionsa category which suitably models files resulting from conflicting modifications.It is finitely cocomplete, thus the merging of any modifications of the file iswell-defined.We believe that the interest of our methodology lies in the fact that it adaptseasily to other more complicated base categories L than the two investigatedhere: in future works, we should explain how to extend the model in orderto cope with multiple files (which can be moved, deleted, etc.), different filetypes (containing text, or more structured data such as xml trees). Also, thestructure of repositories (partially ordered sets of patches) is naturally modeledby event structures labeled by morphisms in P, which will be detailed in futureworks, as well as how to model usual operations on repositories: cherry-picking(importing only one patch from another repository), using branches, removinga patch, etc. It would also be interesting to explore axiomatically the additionof inverses for patches, following other works hinted at in the introduction.Once the theoretical setting is clearly established, we plan to investigatealgorithmic issues (in particular, how to efficiently represent and manipulate theconflicting files, which are objects in P). This should eventually serve as a basisfor the implementation of a theoretically sound and complete distributed versioncontrol system (no unhandled corner-cases as in most current implementationsof vcs).Acknowledgments. The authors would like to thank P.-A. Melli`es, E. Haucourt,T. Heindel, T. Hirschowitz and the anonymous reviewers for their enlighteningcomments and suggestions.References[1] J. Ad´amek and J. Rosicky. Locally presentable and accessible categories,volume 189. Cambridge Univ. Press, 1994.[2] R. Cockett and D. Spooner. Categories for synchrony and asynchrony.Electronic Notes in Theoretical Computer Science, 1:66–90, 1995.[3] R. Houston. On editing text. http://bosker.wordpress.com/2012/05/10/on-editing-text.[4] J. Jacobson. A formalization of darcs patch theory using inverse semi-groups. Technical report, CAM report 09-83, UCLA, 2009.[5] M. Kelly. Basic concepts of enriched category theory, volume 64. CambridgeUniv. Press, 1982.[6] S. Mac Lane. Categories for the Working Mathematician, volume 5 ofGraduate Texts in Mathematics. Springer Verlag, 1971.[7] S. Mac Lane and I. Moerdijk. Sheaves in geometry and logic: A firstintroduction to topos theory. Springer, 1992.[8] R. Par´e. Connected components and colimits. Journal of Pure and AppliedAlgebra, 3(1):21–42, 1973.13[9] M. Ressel, D. Nitsche-Ruhland, and R. Gunzenh¨auser. An integrating,transformation-oriented approach to concurrency control and undo in groupeditors. In Proceedings of the 1996 ACM conference on Computer supportedcooperative work, pages 288–297. ACM, 1996.[10] D. Roundy and al. The Darcs Theory. http://darcs.net/Theory.[11] R. Street and R. Walters. The comprehensive factorization of a functor.Bull. Amer. Math. Soc, 79(2):936–941, 1973.14A A geometric interpretation of presheaves on L+Since presheaf categories are sometimes a bit difficult to grasp, we recall here thegeometric interpretation that can be done for presheaves in ˆL+. We forget aboutlabels of lines and for simplicity suppose that the empty file is not allowed (theobjects are strictly positive integers). In this section, we denote this categoryby L. The same reasoning can be performed on the usual category L+, andeven L, but the geometrical explanation is a bit more involved to describe.In this case, the presheaves in ˆL can easily be described in geometricalterms: the elements P of ˆL are presimplicial sets. Recall from Proposition 4that the category L is the free category whose objects are strictly positivenatural integers, containing for every integers n ∈ N∗ and i ∈ [n + 1] mor-phisms sni : n → n + 1, subject to the relations sn+1i snj = sn+1j+1 sni when-ever 0 ≤ i ≤ j < n. Writing y : L → ˆL for the Yoneda embedding, a repre-sentable presheaf y(n + 1) ∈ ˆLfin+ can be pictured geometrically as an n-simplex:a 0-simplex is a point, a 1-simplex is a segment, a 2-simplex is a (filled) triangle,a 3-simplex is a (filled) tetrahedron, etc.:y(1) y(2) y(3) y(4)Notice that the n-simplex has n faces which are (n − 1)-dimensional simplices,and these are given by the image under y(sni ), with i ∈ [n], of the unique el-ement of y(n + 1)(n + 1): the i-th face of an n-simplex is the (n − 1)-simplexobtained by removing the i-th vertex from the simplex. More generally, aab cdfghij αpresheaf P ∈ ˆLfin+ (a finite presimplicial set) is a finite colimitof representables: every such presheaf can be pictured as a glu-ing of simplices. For instance, the half-filled square on the rightcorresponds to the presimplicial set P with P (1) = {a, b, c, d},P (2) = {f, g, h, i, j}, P (3) = {α} with faces P (s11)(f ) = a,P (s10)(f ) = b, etc.Similarly, in the labeled case, a labeled presheaf (P, ℓ) ∈ L/! can be pic-tured as a presimplicial set whose vertices (0-simplices) are labeled by elementsabcabbcacabcof L. The word labeling of higher-dimensional simplices can then bededuced by concatenating the labels of the vertices it has as iteratedfaces. For instance, an edge (a 1-simplex) whose source is labeledby a and target is labeled by b is necessarily labeled by the wordab, etc.More generally, presheaves in L+ can be pictured as augmented presimplicialsets and presheaves in L as augmented simplicial sets, a description of those canfor instance be found in Hatcher’s book Algebraic Topology.B Proofs of classical propositionsIn this section, we briefly recall proofs of well-known propositions as our proofsrely on a fine understanding of those. We refer the reader to [7] for furtherdetails.15Proposition 8 Given a functor F : C → D, with D cocomplete, the associatednerve NF : D → ˆC admits a left adjoint RF : ˆC → D called the realizationalong F . This functor is defined on objects P ∈ ˆC byRF (P ) = colim(El(P ) π−→ C F−→ D)Proof. In order to show the adjunction, we have to construct a natural family ofisormorphisms D(RF (P ), D) ∼= ˆC(P, NF D) indexed by a presheaf P ∈ ˆC and anobject D ∈ D. A natural transformation θ ∈ ˆC(P, NF D) is a family of functions(θC : P C → NF DC)C∈C such that for every morphism f : C′ → C in C thediagramP (C)P (f )θC / / D(F C, D)D(F f,D)P (C′) θC′/ / D(F C′, D)commutes. It can also be seen as a family (θC (p) : F C → D)(C,p)∈El(P ) ofmorphisms in D such that the diagramF C θC (p)##●●●●DF C′F fOOθC′ (P (f )(p));;①①①or equivalentlyF πP (C, p) θC (p)''◆◆◆◆◆DF πP (C′, p′)F πP fOOθC′ (p′ )77♣♣♣♣♣commutes for every morphism f : C′ → C in C. This thus defines a co-cone from F πP : El(P ) → D to D, and those cocones are in bijection withmorphisms RF (P ) → D by definition of RF (P ) as a colimit: we have shownD(RF (P ), D) ∼= ˆC(P, NF (D)), from which we conclude.The equivalence between the two conditions of Definition 9 can be shown asfollows.Proposition 22. Given a functor F : C → D, the two following conditions areequivalent:(i) the associated nerve functor NF : D → ˆC is full and faithful,(ii) every object of D is canonically a colimit of objects in C: for every D ∈ D,D ∼= colim(El(NF D) π−→ C F−→ D)Proof. In the case where D is cocomplete, the nerve functor NF : D → ˆC admitsRF : ˆC → D as right adjoint, and the equivalence amounts to showing that theright adjoint is full and faithful if and only if the counit is an isomorphism,which is a classical theorem [6, Theorem IV.3.1]. The construction can beadapted to the general case where D is not necessarily cocomplete by consideringcolim(El(−) π−→ C F−→ D) : ˆC → D as a partially defined left adjoint (see [8]) andgeneralizing the theorem.16C Proofs of the construction of the finite cocom-pletionLemma 11 A presheaf P ∈ ˆL preserves finite limits, if and only if it sends thecolimits of diagrams of the formEl(G) πG−−→ G I−→ Lto limits in Set, where G ∈ ˆG is a finite graph such that the above diagramadmits a colimit. Such a diagram in L is said to be generated by the graph G.Proof. In order to check that a presheaf P ∈ ˆL preserves finite limits, we haveto check that it sends colimits of finite diagrams in L which admit a colimitto limits in Set, and therefore we have to characterize diagrams which admitcolimits in L. The number of diagrams to check can be reduced by using thefacts that limits commute with limits [6]. For instance, the inclusion functorI : G → L is dense, which implies that every object n ∈ L is canonically acolimit of the objects 1 and 2 by the formula n ∼= colim(El(NI n) π−→ G I−→ L),see Definition 9. Thus, given a finite diagram K : J → L, we can replace anyobject n different from 1 and 2 occurring in the diagram by the correspondingdiagram El(NI n) π−→ G I−→ L, thus obtaining a new diagram K′ : J → L whichadmits the same colimit as K. This shows that P will preserve finite limits ifand only if it preserves limits of finite diagrams in L in which the only occurringobjects are 1 and 2. Since the only non-trivial arrows in L between the objects1 and 2 are s10, s11 : 1 → 2, and removing an identity arrow in a diagram doesnot change its colimit, the diagram K can thus be assimilated to a bipartitegraph with vertices labeled by 1 or 2 and edges labeled by s10 or s11, all edgesgoing from vertices 1 to vertices 2.We can also reduce the number diagrams to check by remarking that somepairs of diagrams are “equivalent” in the sense that their image under P havethe same limit, independently of P . For instance, consider a diagram in whichan object 2 is the target of two arrows labeled by s10 (on the left). The diagramobtained by identifying the two arrows along with the objects 1 in their source(on the right) can easily be checked to be equivalent by constructing a bijectionbetween cocones of the first and cocones of the second.2 ... 2 2 2 ... 2155 ...OO@@ 1OO^^88 1ff^^ s10@@✁✁✁ 1s10^^❂❂❂ @ @ 8 81ff ...OO@@ 1OO^^ii2 ... 2 2 2 ... 2188 ...OO@@ 1OO^^@@ 1ff^^s10OO @ @88 1^^ ...OO@@ 1OO^^ffMore precisely, if we write K : J ′ → L and K : J → L for the two diagrams andJ : J ′ → J for the obvious functor, the canonical arrow colim(K◦J) → colim(K)is an isomorphism, i.e. the functor J is final. The same reasoning of course alsoholds with s11 instead of s10. We can therefore restrict ourselves to consideringdiagrams in which 2 is the target of at most one arrow s10, and of at most onearrow s11. Conversely, if an object 2 is the target of no arrow s10 (on the left),then we can add a new object 1 and a new arrow from this object to the object2 (on the right) and obtain an equivalent diagram:2 2 ... 21^^ ...OO@@ 1ffOO^^2 2 ... 21s10O O 1^^ ...OO@@ 1ffOO^^17The same reasoning holds with s11 instead of s10 and we can therefore restrictourselves to diagrams in which every object 2 is the target of exactly one arrow s10and one arrow s11.Any such diagram K is obtained by gluing a finite number of diagrams ofthe form21s11 @ @ ✁✁✁ 1s10^^❂❂❂along objects 1, and is therefore of the form El(G) π−→ G I−→ L for some finitegraph G ∈ ˆG: the objects of G are the objects 1 in K, the edges of G are theobjects 2 in K and the source and target of an edge 2 are respectively given bythe sources of the corresponding arrows s11 and s10 admitting it as target. Forinstance, the diagram on the left2 2 21s11 @ @ ✁✁✁ 1s10^^❂❂❂s11 @ @ ✁✁✁s11 & & ◆◆◆◆◆◆◆ 1s10^^❂❂❂s11 @ @ ✁✁✁ 1s10xx♣♣♣♣♣♣♣s10^^❂❂❂20 / / 1 / / 6 6 2 / / 3is generated by the graph on the right.Lemma 12 Given a graph G ∈ ˆG, the associated diagram (8) admits a colimitin L if and only if there exists n ∈ L and a morphism f : G → NI n in ˆLsuch that every morphism g : G → NI m in ˆL, with m ∈ L, factorizes uniquelythrough NI n:G f/ /g2 2NI n / / NI mProof. We have seen in proof of Proposition 8 that morphisms in ˆL(G, NI n) arein bijection with cocones in L from El(G) πG−−→ G I−→ L to n, and moreover givena morphism h : n → m in G the morphism ˆL(G, NI n) → ˆL(G, NI m) induced bypost-composition with NI h is easily checked to correspond to the usual notion ofmorphism between n-cocones and m-cocones induced by NI h (every morphismNI n → NI m is of this form since NI is full and faithful). We can finally concludeusing the universal property defining colimiting cocones.D Proofs for deletions of linesIn this section, we detail proofs of properties mentioned in Section 6.D.1 Sets and partial functionsBefore considering the conservative finite cocompletion of the category L, asintroduced in Definition 1, it is enlightening to study the category PSet of setsand partial functions. A partial function f : A → B can always be seen1. as a total function f : A → B ⊎ {⊥A} where ⊥A is a fresh element wrt A,where given a ∈ A, f (a) = ⊥A means that the partial function is undefinedon A,182. alternatively, as a total function f : A ⊎ {⊥A} → B ⊎ {⊥B} such thatf (⊥A) = ⊥B .This thus suggests to consider the following category:Definition 23. The category pSet of pointed sets has pairs (A, a) where Ais a set and a ∈ A as objects, and morphisms f : (A, a) → (B, b) are (total)functions f : A → B such that f (a) = b.Point (ii) of the preceding discussion can be summarized by saying that a partialfunction can be seen as a pointed function and conversely:Proposition 24. The category PSet of sets and partial functions is equivalentto the category pSet of pointed sets.It is easily shown that the forgetful functor U : pSet → Set, sending apointed set (A, a) to the underlying set A, admits a left adjoint F : Set → pSet,defined on objects by F A = (A ⊎ {⊥A}, ⊥A). This adjunction induces a monadT = U F on Set, from which point (i) can be formalized:Proposition 25. The category PSet is equivalent to the Kleisli category SetTassociated to the monad T : Set → Set.Finally, it turns out that the category pSet of pointed sets might have beendiscovered from PSet using “presheaf thinking” as follows. We write G for thefull subcategory of PSet containing two objects: the empty set 0 = ∅ and aset 1 = {∗} with only one element, and two non-trivial arrows ⋆ : 0 → 1 and⊥ : 1 → 0 (the undefined function) such that ⊥◦⋆ = id0. We write I : G → PSetfor the inclusion functor. Consider the associated nerve functor NI : PSet → ˆG.Given a set A the presheaf NI A ∈ ˆG is such that:• NI A0 = PSet(I0, A) ∼= {⋆}: the only morphism 0 → A in PSet is noted⋆,• NI A1 = PSet(I1, A) ∼= A ⊎ {⊥A}: a morphism 1 → A is characterizedby the image of ∗ ∈ A which is either an element of A or undefined,• NI A⋆ : NI A1 → NI A0 is the constant function whose image is ⋆,• NI A⊥ : NI A0 → NI A1 is the function such that the image of ⋆ is ⊥A.Moreover, given A, B ∈ PSet a natural transformation from NI A to NI B is apair of functions f : A ⊎ {⊥A} → B ⊎ {⊥B} and g : {⋆} → {⋆} such that thediagramsA ⊎ {⊥A} f / /NI A⋆B ⊎ {⊥B}NI B⋆{⋆} g / / {⋆}andA ⊎ {⊥A} f / / ⊎{⊥B }{⋆} g / /NI A⊥ O O{⋆}NI B⊥OOcommutes. Since {⋆} is the terminal set, such a natural transformation is char-acterized by a function f : A ⊎ {⊥A} → B ⊎ {⊥B} such that f (⊥A) = ⊥B . Thefunctor NI : PSet → ˆG is thus dense and its image is equivalent to pSet.19D.2 A cocompletion of LThe situation with regards to the category L is very similar. We follow theplan of Section 4 and first investigate the unlabeled case: L is the category withintegers as objects and partial injective increasing functions f : [m] → [n] asmorphisms f : m → n.We write G for the full subcategory of L whose objects are 0, 1 and 2. Thisis the free category on the graph0 s00 / / 1d00oos10 / /s11 / / 2d10ood11oosubject to the relationss10s00 = s11s00 d00s00 = id1 d10s10 = id2 d11s11 = id2 d00d10 = d00d11 (9)(see Proposition 4). We write I : G → L for the embedding and consider theassociated nerve functor NI : L → ˆG. Suppose given an object n ∈ L, theassociated presheaf NI n can be described as follows. Its sets are• NI n0 = L(I0, n) ∼= {⋆},• NI n1 = L(I1, n) ∼= [n] ⊎ {⊥},• NI n2 = L(I2, n) ∼={(i, j) ∈ [n] × [n] | i < j} ⊎ {(⊥, i) | i ∈ [n]} ⊎ {(i, ⊥) | i ∈ [n]} ⊎ {(⊥, ⊥)}:a partial function f : 2 → n is characterized by the pair of images(f (0), f (1)) of 0, 1 ∈ [n], where ⊥ means undefined.and morphisms are• NI ns00 : NI n1 → NI n0 is the constant function whose image is ⋆,• NI nd00 : NI n0 → NI n1 is the function whose image is ⊥,• NI ns10 : NI n2 → NI n1 is the second projection,• NI ns11 : NI n2 → NI n1 is the first projection,• NI nd10 : NI n1 → NI n2 sends i ∈ [n] ⊎ {⊥} to (⊥, i)• NI nd11 : NI n1 → NI n2 sends i ∈ [n] ⊎ {⊥} to (i, ⊥)Such a presheaf can be pictured as a graph with NI n1 as set of vertices, NI n2 asset of edges, source and target being respectively given by the functions NI ns11and NI ns10:⊥0 1 2. . .n − 1Its vertices are elements of [n] ⊎ {⊥} and edges are of the form20• i → j with i, j ∈ [n] such that i < j,• i → ⊥ for i ∈ [n]• ⊥ → i for i ∈ [n]• ⊥ → ⊥Morphisms are usual graphs morphisms which preserve the vertex ⊥. We arethus naturally lead to define the following categories of pointed graphs andgraphs with partial functions. We recall that a graph G = (V, s, t, E) consists ofa set V of vertices, a set E of edges and two functions s, t : E → V associatingto each edge its source and its target respectively.Definition 26. We define the category pGraph of pointed graphs as the cat-egory whose objects are pairs (G, x) with G = (V, E) and x ∈ V such that forevery vertex there is exactly one edge from and to the distinguished vertex x,and morphisms f : G → G′ are usual graph morphisms consisting of a pair(fV , fE ) of functions fV : VG → VG′ and fE : EG → EG′ such that for everyedge e ∈ EG, fV (s(e)) = s(fE (e)) and fV (t(e)) = t(fE (e)), which are such thatthe distinguished vertex is preserved by fV .Definition 27. We define the category PGraph of graphs and partial mor-phisms as the category whose objects are graphs and morphisms f : G → G′are pairs (fV , fE ) of partial functions fV : VG → VG′ and fE : EG → EG′ suchthat• for every edge e ∈ EG such that fE (e) is defined, fV (s(e)) and fV (t(e))are both defined and satisfy fV (s(e)) = s(fE (e)) and fV (t(e)) = t(fE (e)),• for every edge e ∈ EG such that fV (s(e)) and fV (t(e)) are both defined,fE (e) is also defined.More briefly: a morphism is defined on an edge if and only it is defined on itssource and on its target.Similarly to previous section, a partial morphism of graph can be seen as apointed morphism of graph and conversely:Proposition 28. The categories pGraph and PGraph are equivalent.Now, notice that the category L is isomorphic to the full subcategory of PGraphwhose objects are the graphs whose set of objects is [n] for some n ∈ N, andsuch that there is an edge i → j precisely when i < j. Also notice that thefull subcategory of pGraph whose objects are the graphs NI n (with ⊥ asdistinguished vertex) with n ∈ N is isomorphic to the full subcategory of ˆGwhose objects are the NI n with n ∈ N. And finally, the two categories areequivalent via the isomorphism of Proposition 28. From this, we immediatelydeduce that the functor NI : L → ˆG is full and faithful, i.e.Proposition 29. The functor I : G → L is dense.We can now follow Section 4 step by step, adapting each proposition as nec-essary. The conditions satisfied by presheaves in P introduced in Proposition 10are still valid in our new case:21Proposition 30. Given a presheaf P ∈ ˆL which is an object of P,1. the underlying graph of P is finite,2. for each non-empty path x ։ y there exists exactly one edge x → y,3. P (n + 1) is the set of paths of length n in the underlying graph of P ,and P (0) is reduced to one element.Proof. The diagrams of the form (6) and (7) used in proof of Proposition 10still admit the same colimit n + 1 with the new definition of L and 0 is stillinitial. It can be checked that the limit of the image under a presheaf P ∈ ˆLof a diagram (6) is still the set of paths of length n in the underlying graphof P .Lemma 11 is also still valid:Lemma 31. A presheaf P ∈ ˆL preserves finite limits, if and only if it sends thecolimits of diagrams of the formEl(G) πG−−→ G I−→ Lto limits in Set, where G ∈ ˆG is a finite pointed graph such that the abovediagram admits a colimit. Such a diagram in L is said to be generated by thepointed graph G.Proof. The proof of Lemma 11 was done “by hand”, but we mentioned a moreabstract alternative proof. In the present case, a similar proof can be done butwould be really tedious, so we provide the abstract one. In order to illustratewhy we have to do so, one can consider the category of elements associated tothe presheaves representable by 0 and 1, which are clearly much bigger than inthe case of Section 4:El(NI 0) ∼= ⋆ s00 / / ⊥d00oos10 / /s11 / / (⊥, ⊥)d10ood11ooandEl(NI 1) ∼=(⊥, ⊥)d10rrrryyrrrrrr d11rrrryyrrrrrr⊥d00✈✈✈✈✈{{✈✈✈✈✈s109 9 rrrrrrrrrrr s11rrrrrr99rrrrs11 / /s10▲▲▲▲▲▲%%▲▲▲▲(⊥, 1)d10rrrrryyrrrrr⋆s00;; ✈✈✈✈✈✈✈✈✈✈✈s00// 1s10rrrrrrr99rrrrs11 / / (1, ⊥)d11oosubject to relations which follow directly from (9).Before going on with the proof, we need to introduce a few notions. A func-tor F : C → D is called final if for every category E and diagram G : D → Ethe canonical morphism colim(G ◦ F ) → colim(G) is an isomorphism [6]: re-stricting a diagram along F does not change its colimit. Alternatively, these22functors can be characterized as functors such that for every object D ∈ D thecategory is non-empty and connected (there is a zig-zag of morphisms betweenany two objects). A functor F : C → D is called a discrete fibration if for anyobject C ∈ C and morphism g : D → F C in D there exists a unique mor-phism f : C′ → C in C such that F f = g called the lifting of g. To any suchdiscrete fibration one can associate a presheaf P ∈ ˆD defined on any D ∈ Dby P D = F −1(D) = {C ∈ C | F C = D} and on morphisms g : D′ → Das the function P g which to C ∈ P D associates the source of the lifting of gwith codomain C. Conversely, any presheaf P ∈ ˆD induces a discrete fibrationEl(P ) π−→ D, and these two operations induce an equivalence of categories be-tween the category ˆD and the category of discrete fibrations over D. It wasshown by Par´e, Street and Walters [8, 11] that any functor F : C → D factorizesas final functor J : C → E followed by a discrete fibration K : E → D, andthis factorization is essentially unique: this is called the comprehensive factor-ization of a functor. More explicitly, the functor K can be defined as follows.The inclusion functor Set → Cat which send a set to the corresponding dis-crete category admits a left adjoint Π0 : Cat → Set, sending a category to itsconnected components (its set of objects quotiented by the relation identifyingtwo objects linked by a zig-zag of morphisms). The discrete fibration part Kabove can be defined as El(P ) π−→ D where P ∈ ˆD is the presheaf defined byP = Π0(−/F ). In this precise sense, every diagram F in D is “equivalent” toone which is “generated” by a presheaf P on D (we adopted this informal termi-nology in the article in order to avoid having to introduce too many categoricalnotions).In our case, we can thus restrict to diagrams in L generated by presheaveson L. Finally, since I : G → L is dense, we can further restrict to diagramsgenerated by presheaves on G by interchange of colimits.Lemma 13 applies almost as in Section 4: since the morphism f : G → NI n (seenas a partial functions between graphs) has to satisfy the universal property ofLemma 12, by choosing for every vertex x of G a partial function gx : G → NI mwhich is defined on x (such a partial function always exists), it can be shownthat the function f has to be total. The rest of the proof can be kept unchanged.Similarly, Proposition 14 applies with proof unchanged.Finally, we have thatTheorem 32. The free conservative finite cocompletion P of L is equivalent tothe full subcategory of ˆL whose objects are presheaves P satisfying the conditionsof Proposition 30. Since its objects P satisfy I∗I∗(P ) ∼= P , it can equivalentlybe characterized as the full subcategory of ˆG whose objects P are1. finite,2. transitive: for each non-empty path x ։ y there exists exactly one edgex → y,3. pointed: P (0) is reduced to one element.From this characterization (which can easily be extended to the labeled case),along with the correspondence between pointed graphs and graphs with par-tial functions (Proposition 28), the category is shown to be equivalent to the23category described in Theorem 20: the relation is defined on vertices x, y of agraph G by x < y whenever there exists a path x ։ y.As in case of previous section, the forgetful functor pGraph → Graphadmits a left adjoint, thus inducing a monad on Graph. The category pGraphis equivalent to the Kleisli category associated to this monad, which is closelyrelated to the exception monad as discussed in [3].E Modeling repositoriesWe briefly detail here the modeling of repositories evoked in Section 7. Asexplained in the introduction, repositories can be modeled as partially orderedsets of patches, i.e. morphisms in L. Since some of them can be incompatible,it is natural to model them as particular labeled event structures.Definition 33. An event structure (E, ≤, #) consists of a set E of events, apartial order relation ≤ on E and incompatibility relation on events. We requirethat1. for any event e, the downward closure of {e} is finite and2. given e1, e′1 and e2 such that e1 ≤ e′1 and e1#e2, we have e′1#e2.Two events e1 and e2 are compatible when they are not incompatible, andindependent when they are compatible and neither e1 ≤ e2 nor e2 ≤ e1. Aconfiguration x is a finite downward-closed set of compatible events. An evente2 is a successor of an event e1 when e1 ≤ e2 and there is no event in between.Given an event e we write ↓e for the configuration, called the cause of e, obtainedas the downward closure of {e} from which e was removed. A morphism ofevent structures f : (E, ≤, #) → (E′, ≤′, #′) is an injective function f : E → E′such that the image of a configuration is a configuration. We write ES for thecategory of event structures.To every event structure E, we can associate a trace graph T (E) whosevertices are configurations and edges are of the form x e−→ x ⊎ {e} where xis a configuration such that e 6 ∈ x and x ⊎ {e} is a configuration. A trace isa path x ։ y in this graph. Notice that two paths x ։ y are of the samelength. Moreover, given two configurations x and y such that x ⊆ y, thereexists necessarily a path x ։ y. It can be shown that this operation provides afaithful embedding T : ES → Graph from the category of event structures tothe category of graphs, which admits a right adjoint.Example 34. An event structure with five events is pictured on the left (arrowsrepresent causal dependencies and ∼ incompatibilities). The associated tracegraph is pictured on the right.d&f & f & f & fbAA✄✄✄ c]]❀❀❀❀ / oc′a^^❂❂❂ @ @ ✁✁✁77 ♣♣♣♣♣♣♣{a, b, c, d}{a, b, c}d O O{a, b, c′} {a, b}c′oo c 7 7 ♦♦♦♦♦ {a, c}bgg❖❖❖❖{a, c′}bff◆◆◆◆{a}c′oobgg❖❖❖❖❖❖ c77 ♦♦♦♦♦∅a O O24Definition 35. A categorical event structure (E, λ) in a category C with aninitial object consists of an event structure equipped with a labeling functorλ : (T E)∗ → C, where (T E)∗ is the free category generated by the graph T E,such that λ∅ is the initial object of C and the image under λ of every squarezy1e2 = = ⑤⑤⑤⑤ y2e1aa❇❇❇❇xe1aa❇❇❇ e2==⑤⑤⑤in T E is a pushout in C.The following proposition shows that a categorical event structure is char-acterized by a suitable labeling of events of E by morphisms of C.Proposition 36. The functor λ is characterized, up to isomorphism, by theimage of the transitions ↓e e−→ ↓e ⊎ {e}.We can now define a repository to be simply a finite categorical event struc-ture (E, ≤, #, λ : T (E) → L). Such a repository extends to a categorical eventstructure (E, ≤, #0, I ◦ λ : T (E) → P), where #0 is the empty conflict rela-tion. The state S of such an event structure is the file obtained as the imageS = I ◦ λ(E) of the maximal configuration: this is the file that the users iscurrently editing given his repository. Usual operations on repositories can bemodeled in this context, for instance importing the patches of another reposi-tory is obtained by a pushout construction (the category of repositories is finitelycocomplete).25
# Document TitleThe most confusing git terminologyn.b. This blog post dates from 2012, so some of it may be out of date now.To add my usual disclaimer to the start of these blog posts, I should say that I love git; I think it’s a beautiful and elegant system, and it saves me huge amounts of time in my daily work. However, I think it’s a fair criticism of the system that its terminology is very confusing for newcomers, and in particular those who have come from using CVS or Subversion.This is a personal list of some of my “favourite” points of confusion, which I’ve seen arise time and time again, both in real life and when answering questions on Stack Overflow. To be fair to all the excellent people who have contributed to git’s development, in most cases it’s clear that they are well aware that these terms can be problematic, and are trying to improve the situation subject to compatibility constraints. The problems that seem most bizarre are those that reuse CVS and Subversion terms for completely different concepts – I speculate a bit about that at the bottom.“update”If you’ve used Subversion or CVS, you’re probably used to “update” being a command that goes to the remote repository, and incorporates changes from the remote version into your local copy – this is (very broadly) analogous to “git pull”. So, when you see the following error message when using git:foo.c: needs updateYou might imagine that this means you need to run “git pull”. However, that’s wrong. In fact, what “needs update” means is approximately: “there are local modifications to this file, which you should probably commit or stash”.“track” and “tracking”The word “track” is used in git in three senses that I’m aware of. This ambiguity is particularly nasty, because the latter two collide at a point in learning the system where newcomers to git are likely to be baffled anyway. Fortunately, this seems to have been recognized by git’s developers (see below).1. “track” as in “untracked files”To say that a file is tracked in the repository appears to mean that it is either present in the index or exists in the commit pointed to by HEAD. You see this usage most often in the output of “git status”, where it will list “untracked files”:# On branch master# Untracked files:# (use "git add <file>..." to include in what will be committed)## .classpathThis sense is relatively intuitive, I think – it was only after complaining for a while about the next two senses of “track” that I even remembered that there was also this one :)2. “track” as in “remote-tracking branch”As a bit of background, you can think of a remote-tracking branch as a local cache of the state of a branch in a remote repository. The most commonly seen example is origin/master, or, to name that ref in full, refs/remotes/origin/master. Such branches are usually updated by git fetch (and thus also potentially by git pull). They are also updated by a successful push to the branch in the remote repository that they correspond to. You can merge from them, examine their history, etc. but you can’t work directly on them.The sense of “track” in the phrase “remote-tracking branch” is indicating that the remote-tracking branch is tracking the state of the branch in the remote repository the last time that remote-tracking branch was updated. So, you might say that refs/remotes/origin/master is tracking the state of the branch master in origin.The “tracking” here is defined by the refspec in the config variable remote.<remote-name>.fetch and the URL in the config variable remote.<remote-name>.url.3. “track” as in “git branch –track foo origin/bar” and “Branch foo set up to track remote branch bar from origin”Again, if you want to do some work on a branch from a remote repository, but want to keep your work separate from everything else in your repository, you’ll typically use a command like the following (or one of its many “Do What I Mean” equivalents):git checkout --track -b foo origin/bar… which will result in the following messages:Branch foo set up to track remote branch bar from originSwitched to a new branch 'foo'The sense of “track” both in the command and the output is distinct from the previous sense – it means that config options have been set that associate your new local branch with another branch in the remote repository. The documentation sometimes refers to this relationship as making bar in origin “upstream” of foo. This “upstream” association is very useful, in fact: it enables nice features like being able to just type git pull while you’re on branch foo in order to fetch from origin and then merge from origin/bar. It’s also how you get helpful messages about the state of your branch relative to the remote-tracking branch, like “Your branch foo is 24 commits ahead of origin/bar and can be fast-forwarded”.The tracking here is defined by config variables branch.<branch-name>.remote and branch.<branch-name>.merge.“tracking” SummaryFortunately, the third sense of “tracking” seems to be being carefully deprecated – for example, one of the possible options for push.default used to be tracking, but this is now deprecated in favour of the option name upstream. The commit message for 53c403116 says:push.default: Rename ‘tracking’ to ‘upstream’Users are sometimes confused with two different types of “tracking” behavior in Git: “remote-tracking” branches (e.g. refs/remotes/*/*) versus the merge/rebase relationship between a local branch and its @{upstream} (controlled by branch.foo.remote and branch.foo.merge config settings).When the push.default is set to ‘tracking’, it specifies that a branch should be pushed to its @{upstream} branch. In other words, setting push.default to ‘tracking’ applies only to the latter of the above two types of “tracking” behavior.In order to make this more understandable to the user, we rename the push.default == ‘tracking’ option to push.default == ‘upstream’.push.default == ‘tracking’ is left as a deprecated synonym for ‘upstream’.“commit”In CVS and Subversion, “commit” means to send your changes to the remote repository. In git the action of committing (with “git commit”) is entirely local; the closest equivalent of “cvs commit” is “git push”. In addition, the word “commit” in git is used as both a verb and a noun (although frankly I’ve never found this confusing myself – when you commit, you create a commit).“checkout”In CVS and Subversion “checkout” creates a new local copy of the source code that is linked to that repository. The closest command in git is “git clone”. However, in git, “git checkout” is used for something completely distinct. In fact, it has two largely distinct modes of operation:To switch HEAD to point to a new branch or commit, in the usage git checkout <branch>. If <branch> is genuinely a local branch, this will switch to that branch (i.e. HEAD will point to the ref name) or if it otherwise resolves to a commit will detach HEAD and point it directly to the commit’s object name.To replace a file or multiple files in the working copy and the index with their content from a particular commit or the index. This is seen in the usages: git checkout -- (update from the index) and git checkout <tree-ish> -- (where <tree-ish> is typically a commit).(git checkout is also frequently used with -b, to create a new branch, but that’s really a sub-case of usage 1.)In my ideal world, these two modes of operation would have different verbs, and neither of them would be “checkout”.Update in 2023: This has now happened – you can now use “git switch” for the former cases and “git restore” for the latter. This is a very welcome change :)“HEAD” and “head”There are usually many “heads” (lower-case) in a git repository – the tip of each branch is a head. However, there is only one HEAD (upper-case) which is a symbolic ref which points to the current branch or commit.“fetch” and “pull”I wasn’t aware of this until Roy Badami pointed it out, but it seems that git and Mercurial have opposite meanings for “fetch” and “pull” – see the top two lines in this table of git / hg equivalences. I think it’s understandable that since git’s and Mercurial’s development were more or less concurrent, such unfortunate clashes in terminology might occur.“push” and “pull”“git pull” is not the opposite of “git push”; the closest there is to an opposite of “git push” is “git fetch”.“hash”, “SHA1”, “SHA1sum”, “object name” and “object identifier”These terms are often used synonymously to mean the 40 characters hexadecimal strings that uniquely identify objects in git. “object name” seems to be the most official, but the least used in general. Referring to an object name as a SHA1sum is potentially confusing, since the object name for a blob is not the same as the SHA1sum of the file.“remote branch”This term is only used occasionally in the git documentation, but it’s one that I would always try to avoid because it tends to be unclear whether you mean “a branch in a remote repository” or “a remote-tracking branch”. Whenever a git beginner uses this phrase, I think it’s worth clarifying this, since it can avoid later confusion.“index”, “staging area” and “cache”As nouns, these are all synonyms, which all exist for historical reasons. Personally, I like “staging area” the best since it seems to be the easiest concept to understand for git beginners, but the other two are used more commonly in the documentation.When used as command options, --index and --cached have distinct and consistent meanings, as explained by Junio C. Hamano in this useful blog post.Why are there so many of these points of confusion?I would speculate that the most significant effect that contributed to these terminology confusions is that git was being actively used by an enthusiastic community from very early in its development, which means that early names for concepts have tended to persist for the sake of compatibility and consistency. That doesn’t necessarily account for the many conflicts with CVS / Subversion usage, however.To be fair to git, thinking up verbs for particular commands in any software is tough, and there have been enough version control systems written that to completely avoid clashes would lead to some convoluted choices. However, it’s hard to see git’s use of CVS / Subversion terminology for completely different concepts as anything but perverse. Linus has made it very clear many times that he hated CVS, and joked that a design principle for git was WWCVSND (What Would CVS Never Do); I’m sympathetic to that, as I’m sure most are, especially after having switched to the DVCS mindset. However, could that attitude have extended to deliberately disregarding concerns about terminology that might make it actively harder for people to migrate to git from CVS / Subversion? I don’t know nearly enough about the early development of git to know. However, it wouldn’t have been tough to find better choices for commit, checkout and update in each of their various senses.Posted2012-05-07ingitbymarkTags:Comments21 responses to “The most confusing git terminology”Dmitriy Matrosov AvatarDmitriy Matrosov2012-06-09Hi, Mark.Thanks for yor excellent explanation of “tracking” branches! This is exactly what i searched for.Also, i think, here is typo:You wrote “The documentation sometimes refers to this relationship as making foo in origin “upstream” of bar.”, but it should be “bar in origin upstream of foo”.Replymark Avatarmark2012-06-11Hi Dimitriy: thanks for your kind comment and the correction. I’ve edited the post to fix that now.Replyharold Avatarharold2012-10-05Really good article – a lot of these terms are thrown about in git documentation/tutorials without much explanation and can certainly trip you up.Replyahmet Avatarahmet2012-11-26“Why are there so many of these points of confusion?”Probably beacuse Linus is a dictator and he probaby didn’t give shit about others’ opinions and past works for the terminology.ReplyBri AvatarBri2014-08-14Right! He forgot to include the input of someone who knows an inkling about user-friendliness. Imagine how much easier GIT SWITCH would be to remember than GIT CHECKOUT. The whole system was made unnecessarily complicated just by having awfully named functions.ReplyAlberto Fonseca AvatarAlberto Fonseca2018-03-22Good point. Another reason is probably the agile paradigm that tells us that only “working code” is important, so developers happily cast aside anything even remotely resembling logical, structured, conceptional work, which works only based on a precisely defined set of domain vocabulary (which is exactly what we’re missing here).ReplyJennifer AvatarJennifer2013-02-05I find the term upstream confusing. You mention one usage in this article. However, GitHub help recommends To keep track of the original repo [you forked from], you need to add another remote named upstream. So when someone says to push upstream, it’s ambiguous. I’m not sure if this usage is typical of Git in general or just GitHub, but it sure left me confused for a while.ReplyIan AvatarIan2013-11-21“remote branch”This term is only used occasionally in the git documentation, but it’s one that I would always try to avoid because it tends to be unclear whether you mean “a branch in a remote repository” or “a remote-tracking branch”. Whenever a git beginner uses this phrase, I think it’s worth clarifying this, since it can avoid later confusion.Doh! That’s exactly what I’m trying to find out, but then you don’t actually give us the answer!Replyadmin Avataradmin2013-11-21Hi Ian, That’s the thing – you often can’t tell without context. Where did you see “remote branch” used, or in what context? (“a branch in a remote repository” is the more accurate interpretation, I think, but people who don’t understand remote-tracking branches might use the term differently.)ReplyOleh AvatarOleh2014-05-28Thanks for the helpful git articles. I’m a git noob, and the more different sources I read, the closer my understanding converges on what’s really going on. It’s just a shame that numerous git commands have ended up being labels where one must simply memorize what they really do — might as well have called them things like “abc” and “xyz” instead of terms that initially lead one astray.If you’re motivated to write another one, here’s a suggestion. Maybe you could review all the places that git stores stuff, like working repository, index (aka cache or staging area), remote repository, local copy of remote repository (remote tracking branch?), stash (and stash versions), etc., and list the commands that transfer data in various ways among these places. I’ve had to gradually formulate and revise these ideas in my own mind as I’ve been reading about git, and if I had an up-front explanation of this at the outset, it would’ve saved me lots of time.Replymark Avatarmark2014-06-03Thanks for the kind words and your suggestion Oleh. I’ll certainly consider it – there are quite a few good diagrammatic representations of the commands that change what’s stored in the index, working tree, etc., but maybe there’s something more to do. (You can find some good examples with an image search for git index diagram.)Replybryan chance Avatarbryan chance2017-12-06Oleh, That would be a great write up. I think those are the rest of the most confusing git terminoloy. :p I’m new to git as well and I thought I was just stupid because I couldn’t get a handle on it. This is 2017 and yes git terminologies still baffle users.ReplySavanna AvatarSavanna2018-11-13I’ve been using git for years and I’m only just now diving into how it actually works and what the commands actually mean since I’m working at a place that has a rather complicated branching structure… in the past I simply worked on master since I was the only or one of few developers. git pull and git push were usually enough. Now I find myself often totally confused and realized I don’t actually know what the heck I’m doing with git, so I’m reading all these articles and watching videos. It’s been too long coming! HahaReplyAlistair AvatarAlistair2014-11-12The sense of “track” in the phrase “remote-tracking branch” is indicating that the remote-tracking branch is tracking the state of the branch in the remote repository the last time that remote-tracking branch was updated.In the light of your paragraph about “update”, maybe “fetched” would be a better word than “updated”?3. “track” as in “git branch –track foo origin/bar”I don’t understand how senses 2 and 3 are different. Specifically, sense 2 seems to be the special case of 3 where “foo” and “bar” coincide. I.e. the “upstream” of a remote-tracking branch is the branch of the same name on “origin”. Is that right?commitAs a nounI agree. This terminology is fine, and not really confusing at all. The closest english word would be “commitment”, I think, but in the context it is good to have a slightly different word.“fetch” and “pull” […] “push” and “pull”This juxtaposition is hilarious. In the former you are very diplomatic in suggesting that git’s version is an arbitrary choice, made by historical accident, but then in the latter you immediately make it look like exactly the opposite choice would be better.Replyadmin Avataradmin2014-12-30In the light of your paragraph about “update”, maybe “fetched” would be a better word than “updated”I see what you mean, but I think it would probably be more confusing to use “fetched”; the colloquial sense of update is what I mean and the “needs update” sense in git is surprising and wouldn’t make sense in this context anyway.I don’t understand how senses 2 and 3 are different. Specifically, sense 2 seems to be the special case of 3 where “foo” and “bar” coincide. I.e. the “upstream” of a remote-tracking branch is the branch of the same name on “origin”. Is that right?In sense 2 (“remote-tracking branch”) the tracking can only be about this special type of branch tracking the branch in a remote repository; very often there’s no corresponding “local” branch (as opposed to a remote-tracking branch). In sense 3 (“–track”) there must be a local branch, and the tracking is describing an association between the local branch and one in the remote repository (and usually to a corresponding remote-tracking branch).ReplyPhilip A AvatarPhilip A2015-04-01Thanks so much for this really useful and informative blog post which I have only recently discovered. I think focusing in on the confusing aspects of Git is a great way to improve one’s general understanding. Particularly helpful are your explanations of ‘remote tracking branches’ compared with ‘branches which track branches on a remote’. One very minor thing which confused me was: “Your branch foo is 24 commits ahead of origin/bar and can be fast-forwarded”. Shouldn’t it be ‘behind’ instead of ‘ahead’ ? Apologies if I have misunderstood. Anyway, once again … really good post :-)ReplyMatthew Astley AvatarMatthew Astley2016-02-17Thanks – good list, and several of us here misuse the jargon.When describing the state of the work tree: clean, up-to-date and unmodified are easily mixed up.Going by the command name, clean should mean no untracked files (aka. cruft) in the work tree.Unmodified should mean no changes to tracked files, but the command to do this is reset – another naming clash?Up-to-date is trickier to define. There may be copies elsewhere containing commits you can’t yet see. Is the HEAD (local branch) up-to-date with one remote at a point in time when the fetch ran? Should all local branches be up-to-date?ReplyMike Weilgart AvatarMike Weilgart2016-04-16This is an excellent article; thank you!Particularly helpful was the explanation of the three senses in which the word “track” may be used.The light went on when I realized that there are actually *three* things going on when you deal with a remote, not just two:1. The branch on the remote repository;2. The remote tracking branch in your local repository;3. The local branch associated with the remote tracking branch.This clarified for me how I should go about instructing students in this aspect of Git; thank you!ReplySimon Bagley AvatarSimon Bagley2016-06-02Thank you for the very informative article. I am a complete novice with Git, and your article will help me understand the many vary confusing names of commands, options etc.It would be useful if you defined the terms ‘refs’, and ‘refspec’ before you used them.ReplyDoug Kimzey AvatarDoug Kimzey2019-09-06I could not agree more. I have been using git for about 3 months now. Git is not a full blown language but a source control tool.The ambiguity of the git syntax contributes greatly to the confusion experienced by developers facing git for the first time. The concepts are not difficult. The ambiguity of the git syntax constantly forces you to reach for a cheat sheet. The commands do not clearly describe their corresponding actions. Source control is a crucial part of any software project. Source control is not an area that should be managed by confusing and ambiguous terminology and syntax. Confusion puts work at risk and adds time to projects.Commands that concisely describe actions taken with source control are very important.Thousands of developers use git. Millions of Americans use IRS Tax Forms. This does not mean that Millions of Americans find IRS Tax Forms clear and easy to understand. Git terminology unnecessarily costs time and effort in translation.ReplyDoug Kimzey AvatarDoug Kimzey2023-04-11This is very well said. If the git syntax was concise:– There would be far fewer books on amazon.– The number of up votes on git questions on stackoverflow would be much smaller (some of these are literally in the thousands).– There would be fewer cheat sheets and diagrams.– The investment in time and effort would be much less.The git documentation is poor because one vague command is described in terms and several ambiguous and confusing commands.Grammar is not the only measure of the quality of documentation.I am adopting some of the rules in the Simplified Technical English standard (ASD-STE100) to the naming of commands used in console applications and utilities.The goals of this standard are:– Reduce ambiguity.– Improve clarity of the technical text.– Make user manuals more comprehensive for non-native speakers of English.– Create better conditions for both human and machine translation.These goals should be carried to command syntax.Reply
thanks for having me up here i ammatthew mccullough a frequent speaker ongit and when flying around i'm evenwilling tohelp people through the aisles fastentheir seat belts and know what they needto do in the sake of an emergency i havean octocat named after my daughter butthe trouble that we're talking abouthere today takes two forms and it has aname that trouble is named giti apologize for having proliferated itso much at this point because whenthinking about it in two forms whenthinking about it from the positive andthe negative side the happy people sayit has distributed capabilities and ican work when the network is offline andit's robust it saves to disk i canindependently version things i don'tneed any connectivity at all to makethis function and at the surface thatsounds awesome but i remember back tothe time when i had to drive in and goto a cubicle and sit down at thatspecific terminal to check in code andthere's something nice about thosefluorescent lights and that wonderfulelevator music that played while i wasthere and then the git proponents saywell it's fault tolerant and it's it'snetwork enabled in the sense that youcan easily work when it's online or it'soff it's like a mobile device it cachesyour results and this simply means thatwe can work whether or not we have thenetworkbut i find this in this hard economictime that we have today to be a reasonto stop buying hardware which is sickwhen this economy needs your help themostsometimes peoplebut when you think about the positiveside they then back again say it's lowmaintenance you don't have to really domuch with it every five thousand objectsit automatically garbage collects so inthat sense it's beautiful it's hands-offit seems like it just basically takescare of itselfbut what really that means is it'staking away the opportunity of my friendterence who hopes someday to graduatewith an svn admin job what's he going todo when he graduates and there's no moresvn to maintaini don't know think about terrencethey also claim the disblazing fast andyou can commit three thousand objectswatch this over my tethered device itgoes up into the cloud and seven secondslater it's beautifully committed andvisible on github and the user interfacecan you do that with your versioncontrol system isn't that fantastic butyou know whati remember when we could talk up for 30minutes with clearcase about who wasgonna win survivor and i enjoyed thatbonding with guys like tad and ryani don't have it anymoreand in fact then they come back and sayit's still fine because it's open sourceand free and it's all this liberty thatwe have at a conference like open sourceembodied in a tool that cross-cuts allthe possible languages that we mighttalk about here at a conference likethisbut you know what if you don't fundsoftware development if you don't givethem your heart in dollars you might nothave clearcase in a few years and canyou imagine what it would be like tolive in a world without clearcase atyour disposal do you want to live inthat worldsay it's disconservative it only takes asmall amount of hard disk to store everyversion everof every branch that you've evercommitted to anyone on the team with anytagbut you know what in my box is anamazing hard disk a solid state flashhard disk and another one in another boxwith perpendicular bits and that r ddidn't come from not using disks likewild so think about what your nest discwill be and whether you're ruining thatchance with gitthen i used to commit a lot of bugs andone of the things from the proponents ofget they'd say you could search for bugsand find where the regression happenedexactly what commit brought that intoexistence isn't that wonderful you canisolate and find what you did wrong buti remember the days where my friends tadand ryan would help support me with 150comment commits to help drown that bugso when that engineer wanted to go findit hope of that was zero like i wantedit to bebut then they still keep coming back andsaying it's like cats and dogs livingtogether the compatibility is wonderfulwe can talk to clearcase and tosubversion and perforce with theseconversion back and forth andround-tripping utilities subversionbeing the special mark there but youknow whatthis is a bridge for them to be usingtools covertly that are not approved bythe it organization and that leads toall kinds of other havoc in every framesoon we'll be using libraries you don'teven understand what they arethey claim it's free as in beer as theirlast one to say it helps in hard timeswith keeping a minimal budget and reallyallowing to have a great tool at a smallcost that doesn't need maintenance andthat helps us all but i bet you if youkeep on this track one day somebody'seven going to take it to the edge sayingthey're going to have a free operatingsystem and no my word i can't even thinkwhat that all chaos that'll bring to thetable do you want that do you want afree operating system i don't think soso i ask you even though i've taught gitfor this long reconsidersalvage it for the sake of the people incollege today stop using git you'reruining their lives their hopes theirdreams their future and it needs to endnowthank youyou
# Document Titlebadmerge -- abstract versionSee also the concrete example with real C code where darcs merge does the right thing and the others do the wrong thing. See also an attempted counter-example where darcs merge allegedly does the wrong thing, along with my argument that this is not actually a counter-example.a/ \b1 c1|b2Now the task is to merge b2 and c1. As far as I know all revision control tools except for darcs, Codeville, SCCS, and perhaps BitKeeper treat this as a 3-merge of (a, b2, c1).The git or svn 3-merge algorithm and the GNU diff3 algorithm apply the changed line from c1 to the wrong place in b2, resulting in an apparently clean but actually wrong result.merged by SVN v1.2.0, GNU diff3 v3.8.1, or git v1.5.6.3Whereas darcs applies the changed from from c1 to the right place in b2, resulting in a clean merge with the correct result.merged by darcs v1The difference between what svn does and what darcs does here is, contrary to popular belief, not that darcs makes a better or luckier guess as to where the line from c1 should go, but that darcs uses information that svn does not use -- namely the information contained in b1 -- to learn that the location has moved and precisely to where it has moved.Any algorithm which merges c1 with b2 by examining (a, b2, c1) but without reference to the information in b1 is doomed to do worse than darcs does in this sort of case. (Where a silent but incorrect merge such as SVN currently does is worse than raising it to the user as a conflict, which is worse than merging it correctly.)It is important to understand that darcs merge performs better in this example not because it is "luckier" or "more complicated" than 3-merge (indeed darcs merge is actually a much simpler algorithm than 3-merge is for this example) but because darcs merge takes advantage of more information -- information which is already present in other systems such as SVN but which is not used in the current SVN merge operation.code listingslisting a, abstractABCDElisting diff from a to b1, abstract@@ -1,3 +1,6 @@+G+G+GABClisting b1, abstractGGGABCDElisting diff from b1 to b2, abstract@@ -1,3 +1,8 @@+A+B+C+D+EGGGlisting b2, abstractABCDEGGGABCDElisting diff from a to c1, abstract@@ -1,5 +1,5 @@AB-C+XDElisting c1, abstractABXDElisting 3-way merged, abstractABXDEGGGABCDElisting darcs merged, abstractABCDEGGGABXDEThanks for Nathaniel J. Smith for help with this example.Thanks for Ken Schalk for finding bugs in this web page.zookoLast modified: Sat Jan 10 15:04:11 MST 2009
- https://www.boringcactus.com/2021/02/22/can-we-please-move-past-git.html- https://jneem.github.io/merging/- https://jneem.github.io/pijul/- https://jneem.github.io/cycles/- https://jneem.github.io/ids/- https://jneem.github.io/pseudo/- https://pod.link/1602572955/episode/8070d95c8c25fd131709f03f6495589a- https://nibblestew.blogspot.com/2020/12/some-things-potential-git-replacement.html- https://v5.chriskrycho.com/essays/jj-init/- https://tildes.net/~comp/aqk/what_other_version_control_systems_do_people_use_other_than_git- https://arxiv.org/abs/1311.3903- https://stackoverflow.blog/2023/05/23/for-those-who-just-dont-git-it-ep-573/- https://foojay.io/today/foojay-podcast-26/- https://www.youtube.com/watch?v=7MpdZkGj5AI- https://www.youtube.com/watch?v=M4KktA_jbOE- https://www.youtube.com/watch?v=o0ooKVikV3c- https://www.youtube.com/watch?v=lW0gxMbyLEM- https://www.youtube.com/watch?v=bx_LGilOuE4- https://jvns.ca/blog/2023/12/31/2023--year-in-review/#a-lot-of-blog-posts-about-git- https://stackoverflow.blog/2023/01/09/beyond-git-the-other-version-control-systems-developers-use/- https://katafrakt.me/2017/05/27/beyond-git/- https://www.fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki- https://www.youtube.com/watch?v=WKVX7xq58kA- https://www.youtube.com/watch?v=ghtpJnrdgbo- https://www.youtube.com/watch?v=jPSOxVjK8A0- https://www.youtube.com/watch?v=584sAUbHU1o- https://www.youtube.com/watch?v=o4PFDKIc2fs- https://www.youtube.com/watch?v=2XQz-x6wAWk- https://www.youtube.com/watch?v=7d3Qnr5F9zs- https://jesseduffield.com/Lazygit-5-Years-On/- https://www.forrestthewoods.com/blog/dependencies-belong-in-version-control/- https://dev.to/yonkeltron/is-it-time-to-look-past-git-ah4- https://longair.net/blog/2012/05/07/the-most-confusing-git-terminology- https://mitchellh.com/writing/github-changesets- https://www.youtube.com/watch?v=RPmeZH8sOs8- https://matklad.github.io/2023/10/23/unified-vs-split-diff.html- https://blog.waleedkhan.name/bringing-revsets-to-git/- https://blog.waleedkhan.name/git-ui-features/- https://blog.waleedkhan.name/in-memory-rebases/- https://engineering.fb.com/2022/11/15/open-source/sapling-source-control-scalable/- http://darcs.net/RosettaStone- https://en.wikipedia.org/wiki/Comparison_of_version-control_software- http://www.cheat-sheets.org/#Bazaar- http://www.cheat-sheets.org/#Git- http://www.cheat-sheets.org/#CVS- https://meldmerge.org/- https://duckrowing.com/2013/12/26/bzr-init-a-bazaar-tutorial/- https://beza1e1.tuxen.de/monorepo_vcs.html- https://www.inkandswitch.com/local-first/- https://mattweidner.com/2023/09/26/crdt-survey-1.html- https://inria.hal.science/hal-02983557/document- https://mattweidner.com/2022/02/10/collaborative-data-design.html- https://crdt.tech/papers.html- https://inria.hal.science/hal-00738680/PDF/RR-8083.pdf- https://lewiscampbell.tech/sync.html- https://initialcommit.com/blog/pijul-creator- https://tahoe-lafs.org/~zooko/badmerge/simple.html- https://ohshitgit.com/- https://nest.pijul.com/tae/pijul-for-git-users- https://www.youtube.com/watch?v=ahyF8e9qKBc- https://github.com/git-game/git-game