fetsorn/pijul-spec: resources/20230711_pijul_version_control_post

thank you

my name is I works on way too many

projects at the same time right now I'm

working on Cotton X which is a way to

share renewable electricity with your

neighbors efficiently and using

mathematics to do that and I've been

working for a few years now on which is

a version the version control system

based also on mathematics that I'm going

to talk about today

so I have way too many things to tell

you I'll start with talking about what

version control is and do a brief recap

on uh like where it comes from how it

started where we're at now then I'll

talk about our solution

and the principles behind it and then

I'll talk about implementation of of

that Version Control System including

one of the fastest database backends uh

in the world that I've been forced to

write in order to implement payroll

correctly

and then finally I'll have a some some

announcements some surprise

announcements to make about the hosting

platform to host uh repositories

so first version control is actually

very simple and it's not it's not

specific to to uh to uh coders it's when

one or more co-authors edit three of

documents concurrently and one key

feature a Version Control compared to

things like Google docs for example is

the ability to do asynchronous edits so

this sounds like it's it should be uh

easier uh when when you're doing when

you're going asynchronous because you're

giving more flexibility your users it's

actually the opposite so when you allow

co-authors to choose when they want to

sync or merge there are there are

changes or their their work then uh

Things become much more complicated uh

the main reason is because edits May

conflicts and and that their like

conflicts happen in in uh in human work

like when I don't have I'm not claiming

here to have a universal solution to all

conflicts human humans may have

um but I'm merely trying to uh help them

model their conflicts in a proper way

then finally another like not finally

but yet another feature Version Control

that we might like is to be able to

review a Project's history to tell when

a feature or when a bug was introduced

who introduced it and uh sometimes it

gives indication on how to fix it

so many of you here or many people I

talk to about this project think uh that

version control is a sold problem uh

because our tools like gets Mercurial

SVN CVS uh people sometimes mention

fossil per force like we have a huge

collection of tools

but

the our tools like we're they're

probably considered one of the greatest

achievements of our of our industry and

yet there are nobody outside of us uses

them so these days we have silverware

provided by uh NASA with materials that

can go on Mars but our greatest

achievements cannot be used even by uh

editors of legal documents or

parliaments or any and even even the

video game industry doesn't use them

so it's not because they're they're too

young uh they've been around for quite

quite a number of decades now

um lately there's been a trend of doing

distributed version controls that's cool

there's no no Central Central server

except that the tools are unusable if

you don't use a central server and even

actually worse or well sorry uh worse

than the central server a global Central

server Universal to all projects

and uh our current tools require strong

work discipline and planning you have to

plan your your uh things in advance so

the picture on the right is a simple

example of a workflow considered useful

and I personally don't understand it

um well I actually do both

um

but but yeah onboarding people and

letting like diverse people uh from

outside of like without for example a

formal training in computer science uh

know about uh about this tool they'd be

like uh are you crazy or whatever and

that's just a small part of what I'm I'm

about to say because there's flows that

any other any engineer in any other

industry in the world would just laugh

at us if they knew about that and so my

claim is that by using these tools we

are wasting significant human work time

at a global scale I don't know how many

millions of engineer hours are wasted

every year into fixing that rebase that

didn't work or refixing that conflict

again or wait wait re-refixing it or I

don't know a re-refixing it there's

actually a command in git called rear

um

uh some improvements have been proposed

using uh using mathematics physics like

darks for example but unfortunately They

Don't Really scale uh well just a note

before I go any further it's like

puregold is open source and I I'm not

considering uh non-open version short

systems because get git is good enough

uh it's it's an okay system it's it's

phenomenal in some ways but uh well if

you if you're going for a commercial

system uh then you'd better be like at

least as good as that so and I don't

know any system that achieved out anyway

so our demands for a version control

system so we want associative merges so

we want like what what that means in so

associativity is a mathematical term and

it means that when you take changes A

and B together

they should be the same as a followed by

B so that sounds like something

absolutely trivial and I'll give you a

picture in a minute you'll uh that'll

make it even clearer next we want

commutative mergers so we want the

property that if A and B can be produced

independently there are ordered the

order in which you apply them doesn't

matter you have Alice and Bob working

together if Alice pools Bob's changes it

should result in the same thing as if

Bob pulls Alice's changes

then we want branches well or maybe not

we have branches in pihood but they are

less fundamental than in in gits to in

order not to uh do the same kind of

workflows considered useful that I

showed in previous slide

and obviously we're going to label

algorithmic complexity an ideally fast

implementations and actually here we

have both

so more about uh like the two properties

starting with associative merges so this

is really easy so this is like you have

Alice producing a comments a and Bob

producing your comets B in parallel

and Alice wants to First review Bob's

first comments merges and then review

Bob's second comments and merge it and

this should do the same as if she uh had

merged both comments at the same like at

once I'm not reordering comments here

I'm just merging and actually in git or

SVN or Mercury or CBS or RCS or like any

system based on freeware Merch this is

not the case

such a simple property isn't even

um isn't even Satisfied by by three-way

merge so here's an example of why it's a

counter example of the assistant you've

get so you start with a documents with

only two lines in the

um Alice this she's following the top

path she will start by introducing a g

and then

she will later add another comment with

two new lines above that G A and B and

Bob in parallel to that we'll just add

an X between the original A and B and if

you try that if you tried that scenario

on git today you will see that Bob's new

line is like gets merged into Alice's

new lines so that's a giant line

reshuffling happening and git that just

does that silently uh so this is this

isn't even a conflict the reason for

that is that it's trying to run a hack

to optimize some Metric and uh it turns

out there may be several uh Solutions

sometimes and it just doesn't know about

them it just picks one and said ah okay

done

so I don't know about you but if I were

running high security applications and

if I were writing code related security

this was absolutely terrify me so the

fact that your tool can just silently

reshuffle your lines even if it doesn't

happen often uh it's just super scary it

also means that the code your review is

not the code that gets merged so you

should review your your tool request

before merge and after merge so that's

double the work

well you should test them after merge

anyway but you shouldn't be as careful

in your review and tests don't catch all

bugs

now the community emerges that's a

slightly less trivial thing because all

our all the tools other than darks and

people

um explicitly prevent that

so community of merges means exactly

what's in this diagram you have Alice

producing your comments a bob producing

your comments or a change B and then

they want to be able to pool each

other's changes and they should just

happen transparently uh and without

without anything special happening and

they should like the order should the

order is important because they

obviously your local local repository

history is something super important but

it shouldn't matter in the sense that

um like there should there there is no

Global way to order things that happen

in parallel so uh so that's this should

be reflected in in how the tool handles

uh parallelism

all right so why do we why would we even

want that uh Beyond just academic

curiosity because our the tools we are

currently using right now are never

Community even they explicitly prevent

that so why would you we won't want this

well one reason would be that you might

want to unapply all changes for example

you pull the change

and then went on like you you or you

push the change into into production

um because you've you tested it

thoroughly and it seemed to work then

you push more changes

and then after a while you realize that

your initial change was wrong and you

want to unprove it quickly uh without

having to change the entire uh the

entire sequence of patches that came

afterwards so you might be able to to

you might you want you might want to be

able to do that and uh well if you if

you disallow uh commutation you can't

and here uh Community allows you to like

change that like

move that change that buggy change to uh

the uh latest uh like to the top of the

uh to the top of the list and and then

and apply it simply then you might want

to do uh cherry picking so cherry

picking is like oh my colleague produced

a nice a nice bug fix while working in

feature I want that bug fix but the

feature is not quite ready so how do I

do that without changing the entire

identity uh and solving conflicts and

resolving them and re-resolving them and

re-re-resolving them so another reason I

might want that is because I want

partial clones so I have a giant mono

Ripple and I want to pull just

dispatches related to Tiny sub-projects

and so that's the way we handle mono

repos in people and you don't need sub

modules you don't need hacks you don't

need lfs you don't need any of that it

just it just works and it's just a

standard situation

okay so how do we do that

um well first we have to change

perspective and take some uh like on on

Zoom uh the the the the the space and

and and try to look at what we're doing

fundamentally uh what What's what is it

that we're doing when we work

um so we'll start talking about States

or snapshots and changes also called

patches sometimes so all our tools today

are storing States snapshots and they

only compute changes when needed like

for example in three-way merge computes

changes and changes between the lists of

changes

but now what if we did the opposite what

is if we change perspective and started

considering uh changes as the as first

class citizens why would we want to do

that well because my claim is and it's

not backed by anything but my claim is

that when we work what we're

fundamentally producing is not a new

version of the world what or a new

version of a project when we work what

we're producing is changes to a project

and so this seems to match the way we

work or the way we think about work

closer and so it probably will be able

to get some benefits out of that and so

now what if we did a hybrid system where

we stored both it actually that's

actually what we do

all right so this has been looked at

before

um I'll just give you two examples of of

ideas in the in that space that some of

you may already know about so the first

one is operational transforms it's the

idea behind the Google docs for example

so in Google doc like this is this is an

example so in operational transforms you

have transforms and or or changes on on

an initial State here a document with

three letters ABC and then

what you do in operational transforms is

that when you have two changes two two

transforms coming in constantly

um you change they might change each

other in order to be to be able to apply

them in a sequence so for example here

um on the path down downwards we're

we're we're inserting an X at the very

beginning of the file so that's change

T1 and on the path to the right we're

doing T2 which deletes the letter c

and what happens when you combine these

two changes well if you follow the path

on the top you're you're first deleting

the C and then while T1 was at the

beginning of the file so you don't need

to do anything because your previous

change uh the deviation changed the the

end of five so that's okay nothing

nothing special going on there on the

other path uh going first downwards and

then to the right

you have to uh well you you're first

inserting something and then that so

that shifts shifts the index of your

deletion so now you're instead of

deleting the character that was at uh

position R2 you're deleting the

character that's at position three so

darks for example does this uh it

changes uh it changes it's edit like it

changes uh uh patches as as it as it

goes

and it actually does does something

really clever to detect conflicts they

don't have time to uh get into the

details but that's what what they're

doing there is really cool

um unfortunately

this technique leads to uh quadratic

explosion of cases because for if you

have like n different types of changes

you have n times n minus one over two uh

different cases to consider and when

you're just doing insertions and

deletions that's easy when you're doing

anything uh worse than that or more

complicated that's that becomes a

nightmare to implements and I'm actually

here I'm I'm quoting uh like saying a

nightmare I should implements actually a

quote by Google Engineers who try to

implement that for for uh Google Docs so

so it actually is a nightmare

um all right uh another approach that uh

some of you may have heard about is

crdts or config-free replicated data

types the general principle is to design

a structure and operations at the same

time together so that all operations

have the properties we want so they are

associative commutative

natural examples and the the easiest

examples uh that you might come across

when you're learning about trdts are

increment only counters where the only

operation on the counter is just to

increment it and in certainly sets or

append only sets so these are easy now

what happens when you want to do

deletions then you get into the more

subtle examples of crdts then you start

needing uh tomestones and Lamport clocks

and all these things from distributed

programming and so I've done the natural

the subtle now let's move on to the

useless if you consider a full git

repository that's a crdt so what are we

even doing here

um well the thing is why my claim is why

I claim this is useless is because

saying git repository the crdt just

means that you can clone it and you can

design a protocol to clone it and and

that's just it

um now if you just if you consider a

head which is the thing we're interested

in which is the current state of your

repository then that's not a serology

that's like absolutely not one uh simply

because as I said uh concurrent changes

don't compute

so

that was like a really brief regard of

the literature on that thing now let's

move on to uh our our solution or people

so this all started because we were

looking at conflicts and because they

easy cases the cases where you can just

merge in everything uh goes goes right

then that's not super interesting so

what happens when you look at conflicts

where that's where we need a good tool

the most because conflicts are confusing

and you want to be able to just talk

about the fundamental things behind the

conflict like we disagree on something

and and not about how your tools result

like model the conflicts so the exact

definition depends on the tool different

tools have different definitions of what

what a conflict is so for example one

commonly accepted definition is that

when Alice and Bob write the same file

at the same place so that's obviously

conflict there's no way to order their

uh there are there are changes

another example is when Alice renames a

file from F to G and Bob in parallel

renames it to H so that's also a

conflict again that depends on the tool

another example which actually very few

systems handle and people doesn't handle

this uh that's when Alice renames the

function f while bobco adds a call to

ALF so that's extremely tricky uh darks

tries to do that unfortunately uh it's

undecidable to tell whether Bob actually

added a call to F or I did something

else so that's one of the reasons we

don't handle this uh there's also many

other reasons but that's good enough

reason for me

okay so how do we how do we do that so

why are reflection and conflicts helped

us shape a new tool

because we were inspired by a paper by

Samuel and mimraim and Cynthia di Gusto

about uh using category to solve that

problem so category theory is a very

general theory in mathematics that

allows you to model many different kinds

of proofs in this particular 2D

framework with points and arrows between

between the points that's that's most of

what we have in category Theory it's a

very it's it's very uh very simple

and very abstract at the same time

so what we want is that for any two

patches f and g produced from an initial

State X so F leads to Y and G leads to Z

we want the states p and we want a

unique State P such that anything we do

in the future so for any state q that we

can reach after both f and g

so for anything at least and Bob could

do to reach a common state in the future

they could start by uh merging now

reaching a minimal common state uh p and

then and then they can reach Q so we we

what we want is that for any two patches

you can start by finding a minimum

common state and and then and then doing

something to reach any other future

common States

so I realize I'm going a bit fast on

this slide but a category theorists have

the tool to handle that uh they they say

that if P exists which implies its

uniqueness we call P the push out of FNG

so why is why is this important well

because

as you can imagine push outs like it's

not that simple so push outs don't

always exist

and this is this is strictly equivalent

to saying that sometimes uh there are

conflicts in in between our edits so how

do we how do we uh deal with that

then well category Theory tells you that

the quest that gives you a new question

to Lucas so now the question becomes how

to generalize the representation of

States so states are like X Y uh z p or

Q so that all pairs of changes like f

and g uh have a push out

well the solution is that uh your the

minimal like the the minimal uh

extension of files that can tolerate

conflict so that's what we're actually

looking at so the minimal extension of

files that can model conflicts is

um uh directed graphs where vertices are

bytes or byte intervals and edges

represent the union of all known order

between bytes so I know that so probably

sounds a little abstract but I'll give

you a few examples so for example

let's let's see how we uh how we deal

with uh bytes with insertions like let's

add some bytes to an existing file so

um well first some details so vertices

in people are labeled by change number

uh that's the the change that introduced

the introduced the the vertex and then

the interval within that change and the

edges are labeled by the chains that

introduce them so for example here we're

starting with just one vertex uh C 0 0 n

so that's the first n byte in change G

zero

and we're turn we're trying to insert uh

M bytes between positions I minus 1 and

I of that vertex so what we do is we

start by splitting the vertex in so we

get two vertices C 0 0 I and C 0 i n and

now we're inserting a new vertex uh

between these two halves of the split so

that's super easy and now we can we can

we can tell from that graph that our

file has three blocks so one is the

first I bytes of c0 followed by the

first M bytes of C1 and then uh some

bytes in in c0 so bytes I I to n

okay so uh that was easy enough so now

how do we delete bytes well a good thing

about Version Control is that we need to

uh keep the history of the entire

repository anyway so it doesn't cost

cost more to uh to just uh keep the keep

the deleted parts so that's what we do

here

so starting from the the graph we

obtained in the last slide what I'm

doing is uh I'm I'm now deleting bytes

like a contiguous interval uh by its

first j2i from c0 and then 0 to K from

C1 so that's by it's starting from uh J

and then uh I minus J plus K I'm

deleting I minus J plus K bytes from

that from there so the way I do I do it

is exactly the same thing the same way

as as for insertions

I start by splitting my vertices

splitting the relevant vertices at the

relevant positions and and then the way

to mark them as deleted is just

um is just modifying the the label of

the edges so here I'm marking my edges

as deleted by uh turning them into Dash

dashed lines

and and that's all we need that's that's

it so that payroll is not more

complicated it's a bit more complicated

than that but that's fundamentally the

these are the two constructs we uh we

need and then there's a lot of stuff

above above that but that's a like at

the very base the very basic it's just

just that so other vertex in the context

the context of parents and children of

the vertex then change the night and

edges label

so

um how does that handle conflict I won't

dive into that too uh too too deep uh

for uh regions of time but I'll just

State the definition of conflicts and

I'll stop there so um

they like first between getting into

before getting into conflicts first live

vertices I call the live vertices that's

the definition uh there are vertices

with incoming edges are all alive and

dead vertices are vertices with incoming

edges are all that and all the other

vertices so verses that have both alive

and dead uh Edge is pointing to them

they're called zombies and now I'm ready

to State my definition of complex so a

graph has no conflicts if and only if it

has no zombie and all its alive vertices

are totally ordered so that's that's my

definition of conflict and actually it

actually matches what you expect so

that's just a a sequence of bytes that's

that can be ordered unambiguously and uh

it can be you can tell for each byte

that it is either alive or dead but done

both at the same time

and well there's an extension to that an

extension of that to uh files and

directories and so on but that's uh

that's a significant significantly more

involved so I won't talk about that

okay

so

um just some concluding remarks on that

part uh changes are so I said I I wanted

them to be commutative so I I can get

that uh using using this uh this

framework

uh they're not completely commutative in

the sense that changes are partially

ordered by their dependencies on other

changes so each change has

um encodes explicitly a number of uh

dependency dependencies that are

required in order to apply the change

like for example you can you cannot

write to a file before introducing that

file or you cannot delete a line that

doesn't exist yet so that's like basic

dependencies

so now cherry picking well there's no

there's there there isn't even a cherry

pick command in people because chirping

is the same as applying a patch we don't

need to do anything special

there's no git driver so git rather is

about uh solving a conflict like having

to solve the conflict several times I

don't know if many of you have used that

commands but uh the goal of I think it's

somewhat automated now but the goal is

like once you've solved the conflict you

record the conflict resolution and then

try maybe if git allows to maybe replay

it sometime in sometimes in the future

if it works and it doesn't always work

so now conflicts are just like the

normal case and they're sold by changes

and changes can be Cherry Picked so if

you've solved the conflict in one

context you don't need to solve it again

in another context

um for partial clones and monoreepos so

I already mentioned that but there's a

they're easy to implement as long as

white patches are disallowed so for

example if you do Global reformatting

like a patch a patch that reformats all

um all of your repository at once well I

don't know we want to do that but if you

do that obviously then you introduce

dependencies like unwanted dependencies

between changes so if you want to do

this Global reformatting one thing you

can do is just just make like one patch

by uh one reformatting patch by uh by a

sub project

and then you can keep going

for large files

um well one thing I haven't meant like I

haven't really talked about uh in detail

is the way we handle large files is that

like patches patches are actually have

two parts

one part is the description of what they

do so insert inserting some bytes

deleting some bytes and the other thing

is uh the actual bytes that are inserted

or deleted

and the way we handle large files is by

splitting patches into the description

of what they do and that's like the

operational parts and the

and the the actual contents and the

operational part can be exponentially

smaller than the actual content so for

example if one of you are if you work at

a video game company and uh one of your

artists has produced 10 version of two

gigabyte assets during the day you don't

know you don't need to download all all

20 or all 10 versions you only need to

download the bytes that end up being

still alive at the end of the day

so that allows you to just handle large

files easily while you still need to

download some some contents but much

less content than deleting all the

versions

all right so let's move on to uh some

implementation tricks and some like cool

like some things I like like there's a

lot there's a lot to say about

implementation but I'm just going to

tell you about some things I I like and

some things I'm proud of and and the

implementation of this system

um the main the main challenge was

working with large graphs on disk so

obviously when you're doing any uh kind

of like more complicated data structure

than just files

the question arised of of how you should

store them on disk so you don't have to

load the entire thing each time because

that would be like the the cost would be

proportional to the size of history and

that's just an acceptable

so we want it to be actually logarithmic

in the size of history and that's what

we achieve so we can we can upload the

entire graph each time so we have to

keep it on disk and and manipulate it

from there so the trick is to store

edges in a key value store so vertices

and edges vertices mapping to uh their

uh edges to their surrounding edges

another thing we absolutely want is

transactions we want passive crash

safety uh if like the goal with pihul is

to be much more intuitive than anything

else than all the existing tools my goal

is to introduce it to lawyers artists

maybe Lego builders or sonic Pi

composers or the these kinds of people

and

uh these people cannot tolerate uh non-p

like active crash safety they don't they

cannot possibly uh they cannot possibly

tolerate like some operation on the log

that should be done after you've

unplugged the machine for example or

after a crash happens so we absolutely

want that

and next another feature is that we want

uh branches so they're not as useful as

in gits but we still want them and so we

want an efficiently forkable store so we

want to be able to take a database and

then just clone it without copying a

single byte

and so in order to solve these problems

I've written a library called Santa

claria so there's a Finnish word uh

meaning dictionary and it's an on this

transactional key value store but it's

not just that actually it's a more like

a general purpose file block allocator

so it's

um

it allocates blocks in the file in a

transactional way so if you unplug the

machine at any time but I really do mean

anytime

um your your all your equations will go

away and memory will be automatically

freed

and so it uses crash safety using

referential transparency and copy and

write tricks so it never modifies the

previous version it just creates a new

version and with that comes at a no cost

because you don't

because you you already like you already

need to so when you're working with disk

uh with this files you you already need

to read them and so this is such an

expensive operation that just a few

copies even like one I do only one copy

at most but just a few copies each time

you read a block from a file don't cost

anything more than uh just like don't

cut don't cost like the the cost of that

is just negligible compared to the cost

of reading reading a block

um it's workable in bigoof login but

login is like an approximation it's an

absolute worst case is logarithmic in

the size in the total number of keys and

values

and it's written in Rust which

might make some of you feel that it's

probably safe to use and so on but it

actually it's actually it actually uses

a super tricky API because it's way too

generic and it's actually super hard to

use and anyone anyone wants to uh use

sanakedia often has to write a layer on

top of it in order to just provide the

normal safety guarantees that we might

want from a rust library and it uses a

generic underlying storage layer so I

can store stuff in an M map file but I

can also do my read and write

independently individually and manually

or I can I can use a IOU ring like the

new fancy uh i o system in Linux or I

can do well other thing I'll talk about

in the rest of this talk

so now just a really brief

like a really brief description of how I

like how like I I won't okay just a

re-brief description of how I how I

manage crash safety using uh using this

system and using multiple beat trees and

roots

so B trees are these magical data

structure that always stay balanced

without having to do anything special uh

the reason is that in a b tree

insertions so they're a search tree with

more than just one element in each in

each node so there can be usually

there's like my nodes for example in

Santa claria are limited to their to the

size of one memory page or one disk

sector so four kilobytes

and I store as many keys and values as I

can in these blocks so here for the sake

of this example I've just limited my uh

block size to just two elements

to to keep the to keep the picture

simple so for example let's say I want

to insert a five uh so I first I I first

start by uh deciding where I want to

insert it so routing from the top

like I know I need to insert it between

between three and seven because five is

between between three and seven uh so I

go down to this children to this child

and now I know that I need to insert the

five between the four and six so this

node is already full because I told you

the limit is two elements so this causes

a split in this node so now I get two uh

blocks two uh two leaves 4 and 6 and I

wasn't able to insert the 5 in any of

them so this means that I have to insert

it uh in the parents so between the

three and seven but then again that node

is full it's it's already at maximum

capacity so I need to split it and now

this is what I get and so this this is

magical and because it's super it's a

super simple way of doing insertions

that keep the tree balanced

because the only way the depth can

increase is by splitting the roots and

this gives you automatically the

guarantee that all paths will have the

same length so I really love that idea

it's one of the oldest data structures

uh but it's still really cool and it's

very suitable for storing stuff on disk

uh so now just a bit about uh crash 15

how how we do how we use that uh to uh

to uh keep our data safe

the way we do it is by having a number

of sectors at the beginning of the file

pointing each to one copy of the entire

database so for example here in the

first page I'm pointing to the old

version of my my B tree which is this

version here

and on the next one I'm building the new

version by modifying some stuff and and

um the new version well I don't have to

to just to copy everything I can just I

can just copy just the bits I edits

and the old uh it will share uh most of

it's like everything that hasn't been

modified will be shared with the

previous version

so that's that's all we do and so what

happens when you unplug the machine at

any time really uh well that part will

not get written like the the the pages

at the beginning of the file would not

get written and so nothing will happen

the the allocations will get back to

what they were before you start the

transaction

and the the Comets of a transaction

actually happens when we're uh changing

the first eight bytes of the file so uh

hard drives usually guarantee that you

can write a full sector they have a

little battery inside that keeps going

to write at least like one full sector

would often they tell you well it's best

efforts so there's no actual guarantee

that they do that so they guarantee it

but with no actual guarantee I don't

really know what that means but what I

know is that

um writing eight bytes should be okay so

if they try to do best effort for uh

four thousand and Ninety Six bytes then

probably there's eight bytes they they

can they can certainly do it with high

probability another feature of this

system is that Riders don't block

readers because the old versions are

still available so if you start a

transaction while a read-only

transaction while you're writing uh

writing writing something you can still

read the old version and so that's

that's really cool as well it's not

super useful in Imperial well unless you

start running people in the cloud as

I'll share in a minute

and while this sounds like something

super uh fancy and with lots of like

redundancy crash safety copy and write

and should be super expensive but

actually it's the fastest key Value

Store I've tested

so this these are two curves

um showing how long it takes to retrieve

things so get and insert things into my

B trees this is not specific to people

it's not particularly optimized for

people the only thing that's related to

people is that I not implementing long

values yet just because I have I've

never needed to do that but so here I'm

comparing four systems so four different

implementations of key values

um the the most like the slowest one is

a rust driver equals sled so sled is

super slow but it's it's also really

really cool it's using state-of-the-art

technology to um to to do Lock Free

transactions on the database so you can

have a giant computer with thousands of

uh of course or maybe hundreds of course

more realistically and your transactions

won't block each other and there will

still be uh have acid guarantees so this

is super cool but unfortunately it's

still a research prototype and so for

the kind of stuff I'm doing in a single

core uh it's not super relevant so the

green line is the fastest uh C library

um lmdb it's battle tested and all that

and uh it's claimed to be the like the

fastest possible

um in many places

and now this is uh Santa claria the

system I've just introduced and this is

like the orange line is a benchmark of

something that cannot be achieved so

this is the standard Library the

implementation of B trees in the

standard library of rust and so it

doesn't store anything on disk so if

you're storing stuff on disk I will

obviously obviously take more time so

this is just like the reason I've I've

added it I added it there is to just uh

see how close we are to uh doing that so

we are not paying a lot you know to

stores well this is an SSD drive

obviously but we're not paying a lot

because we're not we're minimizing the

number of times we're writing and

reading uh to the disk so so that's it

while the puts thing has a similar uh

performance

removing sleds we can see it more

clearly so this is about twice as fast

as the fastest uh C equivalence

okay

so and this was actually no unexpected

like performance was never the goal uh

the goal was to just to be able to Fork

um and initially I contacted the author

of lmdb to get him to introduce a fork

primitive but uh it wasn't it was

apparently Impossible on the design so I

had to write my own

um all right so now some announcements

so a hosting platform so we we all like

working together but we don't like

setting up servers so how do we uh

collaborate and share repositories uh

one one way to do it in bihull and

that's been the case since the beginning

is to use

um self-hosted repositories using SSH

but uh it's not often convenient you

have to set up a machine in order to

work together so I've wanted to build a

hosting platform and I actually built it

the first version was released quite a

while ago in 2016. it's using uh it's

written entirely in Rust just like

bihull and uh and postgresql to deal

with all the user accounts and uh

discussions and text and so on

uh there was running for a while and

seeing on a single machine it went

through all the iterations of the rust

uh asynchronous ecosystem so that's a

lot of refactoring and rewrite and so on

and it's never been really stable really

uh but the worst time for stability was

definitely ovh to Strasbourg data center

fire in March March 2021 where my

machines so I've seen a slide yesterday

in one of the talks where someone talked

about your server being on fire but I

don't think they really mean it like

here I do really mean it uh like there

is an actual fire in the actual data

center and so the machines were down for

uh two weeks and because it was an

experimental prototype uh he had no uh

real backups replications or anything of

the kind in place so during these two

weeks I took advantage of that little

break in my work to uh

to rebuild something to rebuild a

replicated setup using well the fact

that people is a crdt itself so it's

easy to replicate and then I've used

raft the raft protocol to replicate

postgres and at the time it was also

convenience because my two largest

contributors like the two largest

contributors to people who were using

the South and cross cable if you guys

know what that means so they were they

were communicating with the server in

strasbour by first going from New

Zealand and Australia to uh San

Francisco and then across the us across

the Atlantic Ocean uh to uh uh start in

across France to to Salisbury so they

had absolutely unbearable latencies

and so I was able to give them so this

was cool and convenient because it was I

was finally able to give them a proper

server with short uh short latencies

short response times but it's been

working okay for two years now a little

bit over two years but the problem is

that this is a at the moment it's a

personal project it's totally

yarn-funded so the machines are really

small and I'm using postgres in like

ways that aren't really intended because

my like the core of my database is

actually Santa Clara and pihul it's not

it's not installed in in Plus in uh

postgres so I need to communicate

between these two databases and so I

need the databases to be located close

to

um they're not like the replica are not

just backups they're they're backups and

caches at the same time so the

consequence of that is that when the

machines are under a high at like too

high a load it causes a failure of

postgres so postgres takes a little more

time to answer

and and so the raft uh thing understands

that as a total failure and triggers a

switchover of the main uh like the

leader of the cluster and that would be

okay uh just having some down time right

but actually the consequence of that is

way worse than downtime is data loss so

having small smaller machines is fine I

don't mind uh if some of my users are

using my experimental system and it just

crashes sometimes or is down for a

little while that it doesn't really

matter but when they're starting to lose

data that's that's a problem so I've

decided to uh rewrite it uh and because

I were working on with cloudflare

workers and function as a service and in

other projects and my renewable energy

projects I started thinking about how we

could use people uh to do that so the

like really quickly function as a

service is uh different from traditional

architecture where you have a big

process overhead or a virtual machine

overhead for each uh for for like each

little piece of server you're running

instead of doing that you're just

sharing the machine and sharing just a

single giant JavaScript runtime with

lots of different uh processes or

functions and even but even from other

users so cloudflare uses on like each

machine uh this giant runtime shared by

all its customers

so that's really cool because uh you can

answer from all of like cloudflare's 250

data centers and it gives you optimal

latency

it's also very easy to write

so that's uh the the minimal example

taken from their documentation where

you're just answering a hello worker

from like you're responding now to a

request

and now the question becomes like can we

run or at least simulate people

repository in a pure functional service

framework like the storage options are

fairly Limited in function as a service

you don't have access to to a hard drive

you don't even have an actual machine so

how do you do that or at least how do

you pretend to be a full-fledged behold

repository where in fact you're just a

like some key Value Store some

replicated

eventually consistent key value store in

the cloud so that's the main challenge

it's completely unlike everything I had

been like completely at odds with my

hypothesis when I first wrote Santa

Clara and NP hole has not no hard drive

at all

so the solution is to compile Santa

claria to wasm because you can run one

of them on on cloudflare workers

um and you are storing sudo memory pages

and storage engine so instead of instead

of using uh these sectors I'm using key

keys and values in the in their storage

engine

the main

the main problem now becomes uh the

eventually the eventual consistency I'm

solving that problem by using the

multiple heads I talked about earlier

like the multiple routes so I keep the

older Roots because I know that maybe

like the changes I'm making to uh to my

key value store I haven't propagated to

all of data centers so I keep the old

routes while they haven't propagated so

cloudflare guarantees for example a one

minute propagation time so that's what I

use to keep to keep my older older

branches in order to avoid like stepping

on each other's Foods

so and we don't need a full period like

checking dependencies and maintaining

maintaining a list of fetches is enough

okay so some technical details to as

like almost my conclusion um so this

this service is using typescript for web

Parts

um as well for the UI and then Rustin

wasm for the people parrots um it can be

self-hosted although I've never tested

that yes uh using cloudflare's workers

so they've released their runtime uh in

like as an open source projects it's

open source uh agpl license and it will

be released progressively because

there's a lot a lot of stuff to release

that's currently just experimental in

prototypal and it's starting today so

I've just opened it just before the

beginning of this talk uh so now you

guys can connect to a nest.phool.org and

start black creating an accounts and

like there's no documentation uh things

may crash there's probably lots of bugs

but this will come in the next few days

or weeks

uh okay so as a conclusion uh this is a

new open source version control system

based in proper algorithms rather than

collections of hacks like uh we we've

had uh for some time uh it's scalable to

Mono repos and large files it's

potentially usable by non-coders the

like the craziest like the the farthest

stretch I I've seen uh in discussions in

that project is using it uh as uh as a

tool to help parliaments do their jobs

so parliaments are giants Version

Control Systems operated manually by uh

highly qualified and highly paid lawyers

who are paid to like check the

consistency of The Logical consistency

of the of the law but actually spend a

significant share of their time actually

editing Word documents to apply changes

that have been voted voted by a member

of parliaments so they're doing manual

Version Control and they're wasting lots

of time on that and I've collaborated

with the French parliaments uh which

would have been a good test case because

we're not actually using our Parliament

at the time like the cabinet passes

their bills as they wish

so it's like the test mode of uh of an

API

uh it can be usable by artists by uh

I've talked to lawyers as well

um by maybe Sonic pie composers we had a

really cool discussions last night about

that and maybe why not buy uh Lego

builders wanting to build larger

projects

the hosting service is available since

today I've said that

and another conclusion uh is is a

personal conclusion of mine so I have a

tendency to do work and way too many

things at the same time but uh and and

it never works well until it does like

for example here working on uh

electricity sharing at the same time as

um as a Version Control help me see how

these would fit together and share ideas

across across projects so to conclude I

would like to acknowledge uh some some

of my co-authors and contributors of

Florence Baker for all the uh

discussions Inspirations and early

contributions so tank feeder so that's

the most patient tester I've ever seen

um he's still there after many years uh

passion patiently checking all my bags

so a huge thanks to him Rohan Hart and

uh Chris Bailey uh though Hart and Angus

Finch are actually the two folks using

the Southern Cross cable and they've

contributed like really cool stuff to

people who increase uh Bailey who have

bridged the gap between lawyers and

legal people and and uh what what I'm

doing all right so thanks for your

attention