fetsorn/pijul-spec: resources/20221018

um yeah so my name is Martin Von

schweigberg

I

expected my speaker notes to be here

somewhere

um but yeah

um I work for uh on Source control at

Google

and I'm going to talk about uh a project

I've been working on for almost three

years it's a git compatible VCS called

Jiu Jitsu and in case you're wondering

the name has nothing to do with

jiu-jitsu kaisen the anime

um it's called Jujitsu just because the

binary is called JJ and the binary is

called JD because it's easy to type

um

oh okay now there it is

um so here's an overview of the

presentation

um

first I'm going to give you some

background about me and about the

history of source control at Google

and I'm going to go through the

workflows and architecture of JJ

um and then at the end I'll explain

what's what's next for the open source

project and uh and um our the

integration at Google

so background about me I after

graduating I worked for Eric's son for

about 70 years

and while there I think it was it's fair

to say I drove the immigration to get

there from from clear case

and I cleaned up some get three base

scripts in my spare time

then I joined Google and worked on

a compensation app

and for the last eight years I've worked

on fig which is a which is a project to

integrate material as a client for our

in-house monorepo

um so for for context uh let me tell you

a bit about the history our Version

Control at Google

so long time ago we supposedly started

with CVS

um

and uh then we switched to the perforce

and after a while proforce uh wasn't

able to handle the repository anymore

because it got too large so we wrote Our

Own VCS called Piper

um but the working copy was still too

big so we created a virtual flight

system called City

on top of Piper

and that's what almost every user go at

Google uses now

um

people who are still missing the dbcs

workflows that they were used to from

outside Google so we added Mercury on

top of that and as I said that's what

I've been working on for the last eight

years

um and also in case you you didn't know

um this our monoree but Google is

extremely large and has like all the

source code

um at Google

and you can you can watch this uh uh

talk by Rachel plattwood and potvin

um from at scale that if you're curious

so generally people really like fig but

there's still some major problems uh

we're having

the probably the biggest one is

performance

um and and that's partly because of

python python is slow

uh and partly because of

um eager data structures that don't

scale to the size of the repo

another problem is uh with uh

consistency we're seeing right races

because Mercurial is not designed for

distributed storage so we get corruption

when when we store it on top of our

distributed file system

um and another pain has been

integration because we're calling

Mercurial the Mercurial CLI and parsing

the outputs which is not fun

so

a few years ago we started talking about

what we would want from a next-gen VCS

and one of the ideas that came up there

was to automatically commit or to to

make the

make to make a make a commit from every

save from from your editor

and I got really excited by that idea

and started JD to experiment with it

um

and then I worked on it as my 20 product

for about two years

and this spring we decided to invest

more so now it's my 100 project

um

next

so you may be wondering why we're why I

didn't decide to to just add these

features to to git but as I said I want

I was want to experiment with a

different ux so I I think that would end

up being a completely separate

set of commands inside of git and that

would be

really ugly and wouldn't and shouldn't

get up accepted Upstream

and also I wanted to be able to

integrate into Google's ecosystem and we

had already decided what fig that

against using git there because because

of the problems with integrating it with

um our ecosystem at Google

and one of the problems is that there

are multiple implementations I'll get

that read from the file system so we

would have to add any new features in in

at least two or three places

okay so let's take a look at the

workflows

so the first the first teacher

um is anonymous branches which is

something I copied from mercurial

so instead of this

gets scary uh detached head workflow or

state

um JJ keeps track of all your commits

without you having to name them

so that means they didn't show up in log

outputs and they will not be garbage

collected

um so it may seem like you would get a

very cluttered log output very quickly

but

um whenever you rewrite the commit so if

you amend it or rebate it for example

the the old version gets gets hidden and

you can also manually hide commits with

JG abandon

and um so one of the first things you'll

notice if you start using JJ is that the

working copy is is an actual commit that

gets automatically committed

and it shows up in the log output at

with an at symbol

and whenever you make any changes in the

working copy

and run the run any JJ command after

that it will get automatically amended

into that that commit

um so and you can use the JD checkout

command and that actually creates

a new commit on top of the

committee specify to keep to store your

working working copy changes and and if

you instead wanted to resume

um editing an existing commit you can

use JJ edits

so this is some very interesting

consequences like the one important one

is the the working copy would never be

dirty

so if you check out the different commit

or rebase or something it will never

tell you that you have unstaged changes

you get automatic backup because every

command you run trace is another

automatic backup for you

it makes stash unnecessary because your

working copy commits is effectively a

stash

it also makes commit unnecessary

because the well it's already committed

whenever you run a command you your

working copy is committed so

and you can

um if you if you want to set the

description commit message you can run

jda describe to do that at any time

we get more consistent CLI because the

um the the ads commit the working copy

commits behaves just like any other

commit so there are no special Flags to

to work um to act on the working copy

um and for example JD restore can

restore files between any two commits

it defaults to to restoring from the

parent of the working copy and into the

working copy just like gets restored as

I think

um but you can pass any two commits

and also uncommitted changes stay in

place so they don't move around with you

like like they do with Git checkouts

so um

another one of of JD's distinguishing

features is a first-class conflicts

um

so if you if you look at the screenshot

there we merge the working copy at with

some other commits from another branch

and that succeeds

even though there are conflicts and we

can see in the log output afterwards

that there are conflicts in the working

copy

and that's so as you see the the working

copy commit there with the ad is is a

merge commits with conflicts

so these conflicts are recorded in the

commit as in a structured way so they're

not they're not just conflict markers

stored in a file

um

and and this like design leads to even

more magic like um maybe obvious one is

that you can

you can delay

resolving conflicts until you're ready

ready until you feel like it so if in

this case

when we had this conflict we can just

check out any another commit and deal

with the conflicts later who want to

and you can collaborate on on conflict

resolutions

you can resolve some some of the

conflicts in in these files and and

leave the rest for your co-worker for

example

um maybe less obvious is that a rebase

never fails so

um and same I mean we saw that here that

merge doesn't fail the same thing is

true for rebates and or all the other

similar commands

um which makes continue and abort and

for rebase and and cherubick and all

those similar commands unnecessary

we we also get a more consistent

conflict resolution flow you don't need

to remember which command

um

creative your conflicts you just check

out the command commit with conflict

and resolve the conflicts in the working

copy and then squash the that conflict

resolution into into the parent that

that conflict or as in in the screenshot

here we we were already

editing that commit so we just resolve

the conflict in The Working copy and and

it's it's gone from the log output

um so

um

yeah I remember so so now in the working

copy here in the second screenshot

the conflict has been resolved so in the

working copy

uh sorry in in the merge commit there in

the working copy

um that that working copy resolves the

conflict right

so we're going to come back to that in

the next slide

um and and and and a very important

feature that took me a year to to

realize it was possible is that um

because uh rebase always succeeds we can

we can always uh rebase all descendants

so if you

um check out the commit in the middle of

a stack and amend in some changes uh JD

will always rebase all The Descendants

on top so you will not have to do that

yourself and it moves the branches over

and so on

um

yes so so you can uh we can actually

even rebase

conflicts and conflict resolutions I

don't have time to explain how that

works but it does

um so in the top screenshots here

this is continuation from from the

previous slide

we re-based one side of the of the merge

from conflicts conflicting change to

there and it's and it's descendants onto

conflict and change one and of course we

get the same same conflict as we had in

the the merge commit before

and that stays in the non-conflicting

change but then in the the working copy

because we we had the conflict

resolution in the working copy which was

a merge commit before

um that resolution gets rebased over and

it stays in the working copy so we don't

have a conflict in the word copy here

um then the the second screenshot

um we were on JJ move 2 that commits

with the the first commit with the

conflict and that command

is a bit like the uh all the squash that

someone talked about earlier earlier

today

but much more generic you can can move

a change from any commit into any other

commit

so in this case we're moving the changes

from the working cup it does the default

as well there's no dash from

Move It from The Working copy and into

the conflict the first quit with the

conflict

and then the conflict is gone from the

entire stack and the working copy is in

this case becomes empty

we can also rebase merges correctly

by

rebasing the diff compared to the auto

merged parents

onto the new auto merge parents

and and that's actually how JD treats

all diffs or changes in in commits not

just not just on rebasing Souls when you

when you diff a merge commit it's come

to Auto merge the parents just like the

uh

three merged if I think Elijah talked

about before

um and the the last feature I was going

to talk about is the what I call the

Operation Log

so you can think of it as the entire

repo being under Source control

um and by the entire repo I mean refs

and on Anonymous heads and the working

copies position in in each work tree

so the

um

it's it's kind of like it gets ref Vlogs

but they're

um

across all the refs at once

and there's some extra metadata as well

so in the top screenshot there you can

see the

um

their username and hostname and time of

the operation

um and as as you can can imagine having

these snapshots of the repository at a

different points in time lets us go back

and look at the repository from a

previous snapshot

so in the in the middle uh

snapshot screenshot there

we run JD status at a particular

operation and that shows like just

before we had set set the description on

the commit

so we see that the working copy commit

has no description but it has a modified

font because it was from that point in

time

yeah you can run this with any command

not just status you can do run log or

div for anything

and of course this is very useful when

for when you're trying to figure out how

um aeropostory got into a certain

certain State especially if it's not

your own repositories you don't remember

what what happened

um

and like the last screenshot there shows

how we restored the entire repo to an

earlier State before before the file was

even added in this case

so that's the second operation from the

bottom when when we're just created the

repository so the working copy is empty

and and there's no description

um and and restoring back to

an earlier State like that is is very

useful when you have made a mistake but

uh JD actually even lets you undo

just a single operation without undoing

all the later operations

so that works kind of like git reverted

us but instead of acting on

on the files in the committed acts on

the refs in an operation

so the screenshots here show how we

undo and and

an operation the second to last

operation there that abandoned the

commit so we undo that operation and it

becomes visible again

um and the Operation Log also gives us

safe concurrency even if you have run

commands on

a repositories in on different machines

and sync them via Dropbox for example

you can get conflicts of course but you

will not

lose data or or get corruption

and then in those cases if you sync two

operations via Dropbox you'll see a

merge in this Operation Log output

um we also got a simpler simpler

programming model because transactions

never fail when you started a command

um the repository is loaded at the

current the latest operation and any any

changes that happen

um concurrently will not be seen by that

by that command

and then when the command is is done it

commits that transaction as a new

operation which gets recorded recorded

and can't fail just like write me in

commit can't fail

uh and and this is actually why I

developed the the Operation Log for the

simpler programming model and then uh

the undo and trying travel stuff is just

an accident

okay so um that's all about features and

workflows and take a look let's take a

look at the architecture

uh so JD is written as rest

um

it's written as a library to be to be

easy to to reuse

uh in the CLI is the only the current

only current user of that library but

the goal is to to be able to use it in

in a in a server or a CLI or or uh GUI

or or an idea for example

um and and I hope also by making it the

library we reduce the temptation to to

rewrite and have the same problem as

kids

um and and because I wanted JD to be

able to

um integrate with the ecosystem at

Google

I try to make it easy to replace

different modules a bit

um

and so for example it comes with two

different

commit storage backends by default

one is the the kids back end the source

commits as git commits

and the other one is just a very simple

custom one but it should be possible to

write one that stores commits in the

cloud for example

and same thing with the working copy

it's

of course stored on local disk normally

but the goal is to be able to to write

to replace that with an implementation

that writes

um writes the work in copy to a VFS for

example

integrates with a smarter vs VFS by not

actually writing all the files

and also to be able to use that at

Google

it needs to be scalable to very very

large repositories

which means we can't have any operations

that need all the ancestors for example

of the commits

so to achieve that we achieved that

mostly by by laziness not fetching

objects that we don't need so

try to be careful when designing

algorithms to

like not scale to the size of the

repository or the size of a tree unless

necessary

so another important design decision was

to

um

perform operations in the repository

first not in the working copy so

when all operations are at like right

commits and update references to point

to those commits and then only

afterwards after transaction commits

actually

um is that it's the working copy updated

to to match that that states the new

state if if they're working copy even

Changed by the operation

um and it helps here that the

um the working copy is to commit because

then you even when you're running a

command that updates the working copy

that's to say the same thing you just

commit that transaction that modifies

the working copy commit and then update

to working copy afterwards

um and we get a simpler programming

model because we don't have to we always

create commits

whatever commit we need and we don't

need to worry about what's in the

working copy at that

so same same thing as Elijah was talking

about I think with merge award

um and this makes it much much faster

by not touching the working copy

unnecessarily

um

and it also helps with the laziness by

because you don't need to

fetch objects from the back end right

which might be from a server

um just in order to update the working

copy

and then update it back later for

example

and as I said JD is a is git compatible

and

um you can you can start using it on on

the same independently of others who use

the same

um git project so

um

and this this compatibility happens at a

few different levels at the lowest level

there's the the commit storage

so there is as I said there's one back

and you commit storage back in the

stores commits in in back in git

Repository

um

and uh so so that that reads commits

from the get back the backing git

repository the the object store uh and

converts it into the in-memory in memory

uh representation that JJ expects

um and there's also a functionality for

uh importing gitreps uh into jda and

exporting refs to get

um and and for like interacting with Git

remotes

and of course these commands only work

in when you're when the back end is the

get back end

not with the custom backend

um and of course since I I worked on

Mercurial for many many years

um there are many good things I want to

replicate

such as the the simple clean ux with for

example

um its rev sets language for selecting

revisions

and uh the simpler workflows without the

staging area so

I copied those

mercurial's history rewriting sport is

also pretty good with HD split and phone

for example for for splitting and uh and

uh squash squashing commits

and uh HD rebase which rebases a whole

three up commits not just a linear

Branch or a single Branch at least

um so I copy those things as well

and then Mercurial has is very

customizable

mostly thanks to being written in Python

so that since JJ has written in Rust we

can't just copy that but we can you

can't use monkey patching

in the same way

um but I hope the the modular design I

talked about earlier at least helps

there but it still have a long way to go

if

to be as customizable as as material

okay so

um let's take a look at our plans for

the projects the open source project and

and our Integrations at Google

um so remember this is has probably has

been my 20 project for most of its

lifetime so there's a lot of features

and functionality missing

um like for example GC and repacking is

not implemented

so we need to implement that for both

commits and and operations

for backings that lately fetch objects

over the network to make to make that

perform okay we need to do

some batch prefetching and caching

there is no support for

copies or renames

so we have to

decide if we want to do to track them

like Mercurial or bitkeeper does for

example or if you want to detect them

like like it does

happy to hear suggestions

and just lots of other features are

missing on this particular one I

actually got a pull request from for a

few days ago so but still like get blame

for example is not there

um we need to make it easier to replace

different components like for example

adding a custom command or adding a

custom backend without like using while

using the API or not having to to patch

the source code

um

and yeah language bindings again want to

avoid rewrites so hope people can use

the API instead

so we want to make that as easy as

possible in different languages

security hardening and

the the custom back end uh the

non-get-back and uh is has very little

support like there is no support for

push and pull for example

um so

that would need to be done at some point

but the the get back is just

um

what I would recommend to everyone for

many many years ahead anyway so

and of course

contributions are very welcome the URL

is for the project is there

we don't want this to be just Google

running the product

um and yes I said my this is now my

full-time project at Google

uh and and the the goal of that project

is to

um to improve the internal development

ecosystem

there are

two big main parts of this this project

one is to

replace Mercurial by JJ internally

the other part is to move the

commit storage and

repository storage out of the file

system and into the cloud

so we're hoping to create a single huge

commit graph with all the commits from

all Google Developers

in in just one graph

and and by having the

repository is available in the cloud we

you should be able to

access them from anywhere and and from

like a cloudy Audi for example or review

tool should be able to access your

repository and modify it

so

um and this diagram here shows how we're

planning to accomplish that

so most of this is

the same as on on a previous slide

I've added some new systems and

Integrations with in red

one of the first things we're going to

have to do is to replace the commit

graph or which is called commit inductor

in the in the diagram

uh because it currently assumes that all

ancestors are are available which of

course doesn't scale to the size of our

monorail

so that's uh we're going to have to

rewrite to make that lazy

um

I'm probably going to to use something

called segmented change log that our

friends at meta developed for their

mercurial-based VCS

um then we're going to add

new new backends for the internal

cloud-based storage of commits and

operation logs well operation logs are

basically repositories

um and we will add a custom working copy

implementation for our

VFS

so we we we're hoping to not use sparse

checkouts so instead this

our VFS is we can just tell the the BFS

which commit to check out basically and

and like a few files on top

um

and we'll add some we'll add a bunch of

commands probably for

integrating with our internal

integration with our existing review

reviews and and like starting to reuse

or merging into Mainline and stuff like

that

and then um we're going to add a server

on top of this that will be used by the

cloud ID and review tool

um yeah that was all I had so thanks for

listening if you have any questions you

can find a link to

to Discord chat for example on on from

their poster repository page GitHub page

um and feel free to email me at

martinvancy google.com

thank you

[Applause]