thank you
my name is I works on way too many
projects at the same time right now I'm
working on Cotton X which is a way to
share renewable electricity with your
neighbors efficiently and using
mathematics to do that and I've been
working for a few years now on which is
a version the version control system
based also on mathematics that I'm going
to talk about today
so I have way too many things to tell
you I'll start with talking about what
version control is and do a brief recap
on uh like where it comes from how it
started where we're at now then I'll
talk about our solution
and the principles behind it and then
I'll talk about implementation of of
that Version Control System including
one of the fastest database backends uh
in the world that I've been forced to
write in order to implement payroll
correctly
and then finally I'll have a some some
announcements some surprise
announcements to make about the hosting
platform to host uh repositories
so first version control is actually
very simple and it's not it's not
specific to to uh to uh coders it's when
one or more co-authors edit three of
documents concurrently and one key
feature a Version Control compared to
things like Google docs for example is
the ability to do asynchronous edits so
this sounds like it's it should be uh
easier uh when when you're doing when
you're going asynchronous because you're
giving more flexibility your users it's
actually the opposite so when you allow
co-authors to choose when they want to
sync or merge there are there are
changes or their their work then uh
Things become much more complicated uh
the main reason is because edits May
conflicts and and that their like
conflicts happen in in uh in human work
like when I don't have I'm not claiming
here to have a universal solution to all
conflicts human humans may have
um but I'm merely trying to uh help them
model their conflicts in a proper way
then finally another like not finally
but yet another feature Version Control
that we might like is to be able to
review a Project's history to tell when
a feature or when a bug was introduced
who introduced it and uh sometimes it
gives indication on how to fix it
so many of you here or many people I
talk to about this project think uh that
version control is a sold problem uh
because our tools like gets Mercurial
SVN CVS uh people sometimes mention
fossil per force like we have a huge
collection of tools
but
the our tools like we're they're
probably considered one of the greatest
achievements of our of our industry and
yet there are nobody outside of us uses
them so these days we have silverware
provided by uh NASA with materials that
can go on Mars but our greatest
achievements cannot be used even by uh
editors of legal documents or
parliaments or any and even even the
video game industry doesn't use them
so it's not because they're they're too
young uh they've been around for quite
quite a number of decades now
um lately there's been a trend of doing
distributed version controls that's cool
there's no no Central Central server
except that the tools are unusable if
you don't use a central server and even
actually worse or well sorry uh worse
than the central server a global Central
server Universal to all projects
and uh our current tools require strong
work discipline and planning you have to
plan your your uh things in advance so
the picture on the right is a simple
example of a workflow considered useful
and I personally don't understand it
um well I actually do both
um
but but yeah onboarding people and
letting like diverse people uh from
outside of like without for example a
formal training in computer science uh
know about uh about this tool they'd be
like uh are you crazy or whatever and
that's just a small part of what I'm I'm
about to say because there's flows that
any other any engineer in any other
industry in the world would just laugh
at us if they knew about that and so my
claim is that by using these tools we
are wasting significant human work time
at a global scale I don't know how many
millions of engineer hours are wasted
every year into fixing that rebase that
didn't work or refixing that conflict
again or wait wait re-refixing it or I
don't know a re-refixing it there's
actually a command in git called rear
um
uh some improvements have been proposed
using uh using mathematics physics like
darks for example but unfortunately They
Don't Really scale uh well just a note
before I go any further it's like
puregold is open source and I I'm not
considering uh non-open version short
systems because get git is good enough
uh it's it's an okay system it's it's
phenomenal in some ways but uh well if
you if you're going for a commercial
system uh then you'd better be like at
least as good as that so and I don't
know any system that achieved out anyway
so our demands for a version control
system so we want associative merges so
we want like what what that means in so
associativity is a mathematical term and
it means that when you take changes A
and B together
they should be the same as a followed by
B so that sounds like something
absolutely trivial and I'll give you a
picture in a minute you'll uh that'll
make it even clearer next we want
commutative mergers so we want the
property that if A and B can be produced
independently there are ordered the
order in which you apply them doesn't
matter you have Alice and Bob working
together if Alice pools Bob's changes it
should result in the same thing as if
Bob pulls Alice's changes
then we want branches well or maybe not
we have branches in pihood but they are
less fundamental than in in gits to in
order not to uh do the same kind of
workflows considered useful that I
showed in previous slide
and obviously we're going to label
algorithmic complexity an ideally fast
implementations and actually here we
have both
so more about uh like the two properties
starting with associative merges so this
is really easy so this is like you have
Alice producing a comments a and Bob
producing your comets B in parallel
and Alice wants to First review Bob's
first comments merges and then review
Bob's second comments and merge it and
this should do the same as if she uh had
merged both comments at the same like at
once I'm not reordering comments here
I'm just merging and actually in git or
SVN or Mercury or CBS or RCS or like any
system based on freeware Merch this is
not the case
such a simple property isn't even
um isn't even Satisfied by by three-way
merge so here's an example of why it's a
counter example of the assistant you've
get so you start with a documents with
only two lines in the
um Alice this she's following the top
path she will start by introducing a g
and then
she will later add another comment with
two new lines above that G A and B and
Bob in parallel to that we'll just add
an X between the original A and B and if
you try that if you tried that scenario
on git today you will see that Bob's new
line is like gets merged into Alice's
new lines so that's a giant line
reshuffling happening and git that just
does that silently uh so this is this
isn't even a conflict the reason for
that is that it's trying to run a hack
to optimize some Metric and uh it turns
out there may be several uh Solutions
sometimes and it just doesn't know about
them it just picks one and said ah okay
done
so I don't know about you but if I were
running high security applications and
if I were writing code related security
this was absolutely terrify me so the
fact that your tool can just silently
reshuffle your lines even if it doesn't
happen often uh it's just super scary it
also means that the code your review is
not the code that gets merged so you
should review your your tool request
before merge and after merge so that's
double the work
well you should test them after merge
anyway but you shouldn't be as careful
in your review and tests don't catch all
bugs
now the community emerges that's a
slightly less trivial thing because all
our all the tools other than darks and
people
um explicitly prevent that
so community of merges means exactly
what's in this diagram you have Alice
producing your comments a bob producing
your comments or a change B and then
they want to be able to pool each
other's changes and they should just
happen transparently uh and without
without anything special happening and
they should like the order should the
order is important because they
obviously your local local repository
history is something super important but
it shouldn't matter in the sense that
um like there should there there is no
Global way to order things that happen
in parallel so uh so that's this should
be reflected in in how the tool handles
uh parallelism
all right so why do we why would we even
want that uh Beyond just academic
curiosity because our the tools we are
currently using right now are never
Community even they explicitly prevent
that so why would you we won't want this
well one reason would be that you might
want to unapply all changes for example
you pull the change
and then went on like you you or you
push the change into into production
um because you've you tested it
thoroughly and it seemed to work then
you push more changes
and then after a while you realize that
your initial change was wrong and you
want to unprove it quickly uh without
having to change the entire uh the
entire sequence of patches that came
afterwards so you might be able to to
you might you want you might want to be
able to do that and uh well if you if
you disallow uh commutation you can't
and here uh Community allows you to like
change that like
move that change that buggy change to uh
the uh latest uh like to the top of the
uh to the top of the list and and then
and apply it simply then you might want
to do uh cherry picking so cherry
picking is like oh my colleague produced
a nice a nice bug fix while working in
feature I want that bug fix but the
feature is not quite ready so how do I
do that without changing the entire
identity uh and solving conflicts and
resolving them and re-resolving them and
re-re-resolving them so another reason I
might want that is because I want
partial clones so I have a giant mono
Ripple and I want to pull just
dispatches related to Tiny sub-projects
and so that's the way we handle mono
repos in people and you don't need sub
modules you don't need hacks you don't
need lfs you don't need any of that it
just it just works and it's just a
standard situation
okay so how do we do that
um well first we have to change
perspective and take some uh like on on
Zoom uh the the the the the space and
and and try to look at what we're doing
fundamentally uh what What's what is it
that we're doing when we work
um so we'll start talking about States
or snapshots and changes also called
patches sometimes so all our tools today
are storing States snapshots and they
only compute changes when needed like
for example in three-way merge computes
changes and changes between the lists of
changes
but now what if we did the opposite what
is if we change perspective and started
considering uh changes as the as first
class citizens why would we want to do
that well because my claim is and it's
not backed by anything but my claim is
that when we work what we're
fundamentally producing is not a new
version of the world what or a new
version of a project when we work what
we're producing is changes to a project
and so this seems to match the way we
work or the way we think about work
closer and so it probably will be able
to get some benefits out of that and so
now what if we did a hybrid system where
we stored both it actually that's
actually what we do
all right so this has been looked at
before
um I'll just give you two examples of of
ideas in the in that space that some of
you may already know about so the first
one is operational transforms it's the
idea behind the Google docs for example
so in Google doc like this is this is an
example so in operational transforms you
have transforms and or or changes on on
an initial State here a document with
three letters ABC and then
what you do in operational transforms is
that when you have two changes two two
transforms coming in constantly
um you change they might change each
other in order to be to be able to apply
them in a sequence so for example here
um on the path down downwards we're
we're we're inserting an X at the very
beginning of the file so that's change
T1 and on the path to the right we're
doing T2 which deletes the letter c
and what happens when you combine these
two changes well if you follow the path
on the top you're you're first deleting
the C and then while T1 was at the
beginning of the file so you don't need
to do anything because your previous
change uh the deviation changed the the
end of five so that's okay nothing
nothing special going on there on the
other path uh going first downwards and
then to the right
you have to uh well you you're first
inserting something and then that so
that shifts shifts the index of your
deletion so now you're instead of
deleting the character that was at uh
position R2 you're deleting the
character that's at position three so
darks for example does this uh it
changes uh it changes it's edit like it
changes uh uh patches as as it as it
goes
and it actually does does something
really clever to detect conflicts they
don't have time to uh get into the
details but that's what what they're
doing there is really cool
um unfortunately
this technique leads to uh quadratic
explosion of cases because for if you
have like n different types of changes
you have n times n minus one over two uh
different cases to consider and when
you're just doing insertions and
deletions that's easy when you're doing
anything uh worse than that or more
complicated that's that becomes a
nightmare to implements and I'm actually
here I'm I'm quoting uh like saying a
nightmare I should implements actually a
quote by Google Engineers who try to
implement that for for uh Google Docs so
so it actually is a nightmare
um all right uh another approach that uh
some of you may have heard about is
crdts or config-free replicated data
types the general principle is to design
a structure and operations at the same
time together so that all operations
have the properties we want so they are
associative commutative
natural examples and the the easiest
examples uh that you might come across
when you're learning about trdts are
increment only counters where the only
operation on the counter is just to
increment it and in certainly sets or
append only sets so these are easy now
what happens when you want to do
deletions then you get into the more
subtle examples of crdts then you start
needing uh tomestones and Lamport clocks
and all these things from distributed
programming and so I've done the natural
the subtle now let's move on to the
useless if you consider a full git
repository that's a crdt so what are we
even doing here
um well the thing is why my claim is why
I claim this is useless is because
saying git repository the crdt just
means that you can clone it and you can
design a protocol to clone it and and
that's just it
um now if you just if you consider a
head which is the thing we're interested
in which is the current state of your
repository then that's not a serology
that's like absolutely not one uh simply
because as I said uh concurrent changes
don't compute
so
that was like a really brief regard of
the literature on that thing now let's
move on to uh our our solution or people
so this all started because we were
looking at conflicts and because they
easy cases the cases where you can just
merge in everything uh goes goes right
then that's not super interesting so
what happens when you look at conflicts
where that's where we need a good tool
the most because conflicts are confusing
and you want to be able to just talk
about the fundamental things behind the
conflict like we disagree on something
and and not about how your tools result
like model the conflicts so the exact
definition depends on the tool different
tools have different definitions of what
what a conflict is so for example one
commonly accepted definition is that
when Alice and Bob write the same file
at the same place so that's obviously
conflict there's no way to order their
uh there are there are changes
another example is when Alice renames a
file from F to G and Bob in parallel
renames it to H so that's also a
conflict again that depends on the tool
another example which actually very few
systems handle and people doesn't handle
this uh that's when Alice renames the
function f while bobco adds a call to
ALF so that's extremely tricky uh darks
tries to do that unfortunately uh it's
undecidable to tell whether Bob actually
added a call to F or I did something
else so that's one of the reasons we
don't handle this uh there's also many
other reasons but that's good enough
reason for me
okay so how do we how do we do that so
why are reflection and conflicts helped
us shape a new tool
because we were inspired by a paper by
Samuel and mimraim and Cynthia di Gusto
about uh using category to solve that
problem so category theory is a very
general theory in mathematics that
allows you to model many different kinds
of proofs in this particular 2D
framework with points and arrows between
between the points that's that's most of
what we have in category Theory it's a
very it's it's very uh very simple
and very abstract at the same time
so what we want is that for any two
patches f and g produced from an initial
State X so F leads to Y and G leads to Z
we want the states p and we want a
unique State P such that anything we do
in the future so for any state q that we
can reach after both f and g
so for anything at least and Bob could
do to reach a common state in the future
they could start by uh merging now
reaching a minimal common state uh p and
then and then they can reach Q so we we
what we want is that for any two patches
you can start by finding a minimum
common state and and then and then doing
something to reach any other future
common States
so I realize I'm going a bit fast on
this slide but a category theorists have
the tool to handle that uh they they say
that if P exists which implies its
uniqueness we call P the push out of FNG
so why is why is this important well
because
as you can imagine push outs like it's
not that simple so push outs don't
always exist
and this is this is strictly equivalent
to saying that sometimes uh there are
conflicts in in between our edits so how
do we how do we uh deal with that
then well category Theory tells you that
the quest that gives you a new question
to Lucas so now the question becomes how
to generalize the representation of
States so states are like X Y uh z p or
Q so that all pairs of changes like f
and g uh have a push out
well the solution is that uh your the
minimal like the the minimal uh
extension of files that can tolerate
conflict so that's what we're actually
looking at so the minimal extension of
files that can model conflicts is
um uh directed graphs where vertices are
bytes or byte intervals and edges
represent the union of all known order
between bytes so I know that so probably
sounds a little abstract but I'll give
you a few examples so for example
let's let's see how we uh how we deal
with uh bytes with insertions like let's
add some bytes to an existing file so
um well first some details so vertices
in people are labeled by change number
uh that's the the change that introduced
the introduced the the vertex and then
the interval within that change and the
edges are labeled by the chains that
introduce them so for example here we're
starting with just one vertex uh C 0 0 n
so that's the first n byte in change G
zero
and we're turn we're trying to insert uh
M bytes between positions I minus 1 and
I of that vertex so what we do is we
start by splitting the vertex in so we
get two vertices C 0 0 I and C 0 i n and
now we're inserting a new vertex uh
between these two halves of the split so
that's super easy and now we can we can
we can tell from that graph that our
file has three blocks so one is the
first I bytes of c0 followed by the
first M bytes of C1 and then uh some
bytes in in c0 so bytes I I to n
okay so uh that was easy enough so now
how do we delete bytes well a good thing
about Version Control is that we need to
uh keep the history of the entire
repository anyway so it doesn't cost
cost more to uh to just uh keep the keep
the deleted parts so that's what we do
here
so starting from the the graph we
obtained in the last slide what I'm
doing is uh I'm I'm now deleting bytes
like a contiguous interval uh by its
first j2i from c0 and then 0 to K from
C1 so that's by it's starting from uh J
and then uh I minus J plus K I'm
deleting I minus J plus K bytes from
that from there so the way I do I do it
is exactly the same thing the same way
as as for insertions
I start by splitting my vertices
splitting the relevant vertices at the
relevant positions and and then the way
to mark them as deleted is just
um is just modifying the the label of
the edges so here I'm marking my edges
as deleted by uh turning them into Dash
dashed lines
and and that's all we need that's that's
it so that payroll is not more
complicated it's a bit more complicated
than that but that's fundamentally the
these are the two constructs we uh we
need and then there's a lot of stuff
above above that but that's a like at
the very base the very basic it's just
just that so other vertex in the context
the context of parents and children of
the vertex then change the night and
edges label
so
um how does that handle conflict I won't
dive into that too uh too too deep uh
for uh regions of time but I'll just
State the definition of conflicts and
I'll stop there so um
they like first between getting into
before getting into conflicts first live
vertices I call the live vertices that's
the definition uh there are vertices
with incoming edges are all alive and
dead vertices are vertices with incoming
edges are all that and all the other
vertices so verses that have both alive
and dead uh Edge is pointing to them
they're called zombies and now I'm ready
to State my definition of complex so a
graph has no conflicts if and only if it
has no zombie and all its alive vertices
are totally ordered so that's that's my
definition of conflict and actually it
actually matches what you expect so
that's just a a sequence of bytes that's
that can be ordered unambiguously and uh
it can be you can tell for each byte
that it is either alive or dead but done
both at the same time
and well there's an extension to that an
extension of that to uh files and
directories and so on but that's uh
that's a significant significantly more
involved so I won't talk about that
okay
so
um just some concluding remarks on that
part uh changes are so I said I I wanted
them to be commutative so I I can get
that uh using using this uh this
framework
uh they're not completely commutative in
the sense that changes are partially
ordered by their dependencies on other
changes so each change has
um encodes explicitly a number of uh
dependency dependencies that are
required in order to apply the change
like for example you can you cannot
write to a file before introducing that
file or you cannot delete a line that
doesn't exist yet so that's like basic
dependencies
so now cherry picking well there's no
there's there there isn't even a cherry
pick command in people because chirping
is the same as applying a patch we don't
need to do anything special
there's no git driver so git rather is
about uh solving a conflict like having
to solve the conflict several times I
don't know if many of you have used that
commands but uh the goal of I think it's
somewhat automated now but the goal is
like once you've solved the conflict you
record the conflict resolution and then
try maybe if git allows to maybe replay
it sometime in sometimes in the future
if it works and it doesn't always work
so now conflicts are just like the
normal case and they're sold by changes
and changes can be Cherry Picked so if
you've solved the conflict in one
context you don't need to solve it again
in another context
um for partial clones and monoreepos so
I already mentioned that but there's a
they're easy to implement as long as
white patches are disallowed so for
example if you do Global reformatting
like a patch a patch that reformats all
um all of your repository at once well I
don't know we want to do that but if you
do that obviously then you introduce
dependencies like unwanted dependencies
between changes so if you want to do
this Global reformatting one thing you
can do is just just make like one patch
by uh one reformatting patch by uh by a
sub project
and then you can keep going
for large files
um well one thing I haven't meant like I
haven't really talked about uh in detail
is the way we handle large files is that
like patches patches are actually have
two parts
one part is the description of what they
do so insert inserting some bytes
deleting some bytes and the other thing
is uh the actual bytes that are inserted
or deleted
and the way we handle large files is by
splitting patches into the description
of what they do and that's like the
operational parts and the
and the the actual contents and the
operational part can be exponentially
smaller than the actual content so for
example if one of you are if you work at
a video game company and uh one of your
artists has produced 10 version of two
gigabyte assets during the day you don't
know you don't need to download all all
20 or all 10 versions you only need to
download the bytes that end up being
still alive at the end of the day
so that allows you to just handle large
files easily while you still need to
download some some contents but much
less content than deleting all the
versions
all right so let's move on to uh some
implementation tricks and some like cool
like some things I like like there's a
lot there's a lot to say about
implementation but I'm just going to
tell you about some things I I like and
some things I'm proud of and and the
implementation of this system
um the main the main challenge was
working with large graphs on disk so
obviously when you're doing any uh kind
of like more complicated data structure
than just files
the question arised of of how you should
store them on disk so you don't have to
load the entire thing each time because
that would be like the the cost would be
proportional to the size of history and
that's just an acceptable
so we want it to be actually logarithmic
in the size of history and that's what
we achieve so we can we can upload the
entire graph each time so we have to
keep it on disk and and manipulate it
from there so the trick is to store
edges in a key value store so vertices
and edges vertices mapping to uh their
uh edges to their surrounding edges
another thing we absolutely want is
transactions we want passive crash
safety uh if like the goal with pihul is
to be much more intuitive than anything
else than all the existing tools my goal
is to introduce it to lawyers artists
maybe Lego builders or sonic Pi
composers or the these kinds of people
and
uh these people cannot tolerate uh non-p
like active crash safety they don't they
cannot possibly uh they cannot possibly
tolerate like some operation on the log
that should be done after you've
unplugged the machine for example or
after a crash happens so we absolutely
want that
and next another feature is that we want
uh branches so they're not as useful as
in gits but we still want them and so we
want an efficiently forkable store so we
want to be able to take a database and
then just clone it without copying a
single byte
and so in order to solve these problems
I've written a library called Santa
claria so there's a Finnish word uh
meaning dictionary and it's an on this
transactional key value store but it's
not just that actually it's a more like
a general purpose file block allocator
so it's
um
it allocates blocks in the file in a
transactional way so if you unplug the
machine at any time but I really do mean
anytime
um your your all your equations will go
away and memory will be automatically
freed
and so it uses crash safety using
referential transparency and copy and
write tricks so it never modifies the
previous version it just creates a new
version and with that comes at a no cost
because you don't
because you you already like you already
need to so when you're working with disk
uh with this files you you already need
to read them and so this is such an
expensive operation that just a few
copies even like one I do only one copy
at most but just a few copies each time
you read a block from a file don't cost
anything more than uh just like don't
cut don't cost like the the cost of that
is just negligible compared to the cost
of reading reading a block
um it's workable in bigoof login but
login is like an approximation it's an
absolute worst case is logarithmic in
the size in the total number of keys and
values
and it's written in Rust which
might make some of you feel that it's
probably safe to use and so on but it
actually it's actually it actually uses
a super tricky API because it's way too
generic and it's actually super hard to
use and anyone anyone wants to uh use
sanakedia often has to write a layer on
top of it in order to just provide the
normal safety guarantees that we might
want from a rust library and it uses a
generic underlying storage layer so I
can store stuff in an M map file but I
can also do my read and write
independently individually and manually
or I can I can use a IOU ring like the
new fancy uh i o system in Linux or I
can do well other thing I'll talk about
in the rest of this talk
so now just a really brief
like a really brief description of how I
like how like I I won't okay just a
re-brief description of how I how I
manage crash safety using uh using this
system and using multiple beat trees and
roots
so B trees are these magical data
structure that always stay balanced
without having to do anything special uh
the reason is that in a b tree
insertions so they're a search tree with
more than just one element in each in
each node so there can be usually
there's like my nodes for example in
Santa claria are limited to their to the
size of one memory page or one disk
sector so four kilobytes
and I store as many keys and values as I
can in these blocks so here for the sake
of this example I've just limited my uh
block size to just two elements
to to keep the to keep the picture
simple so for example let's say I want
to insert a five uh so I first I I first
start by uh deciding where I want to
insert it so routing from the top
like I know I need to insert it between
between three and seven because five is
between between three and seven uh so I
go down to this children to this child
and now I know that I need to insert the
five between the four and six so this
node is already full because I told you
the limit is two elements so this causes
a split in this node so now I get two uh
blocks two uh two leaves 4 and 6 and I
wasn't able to insert the 5 in any of
them so this means that I have to insert
it uh in the parents so between the
three and seven but then again that node
is full it's it's already at maximum
capacity so I need to split it and now
this is what I get and so this this is
magical and because it's super it's a
super simple way of doing insertions
that keep the tree balanced
because the only way the depth can
increase is by splitting the roots and
this gives you automatically the
guarantee that all paths will have the
same length so I really love that idea
it's one of the oldest data structures
uh but it's still really cool and it's
very suitable for storing stuff on disk
uh so now just a bit about uh crash 15
how how we do how we use that uh to uh
to uh keep our data safe
the way we do it is by having a number
of sectors at the beginning of the file
pointing each to one copy of the entire
database so for example here in the
first page I'm pointing to the old
version of my my B tree which is this
version here
and on the next one I'm building the new
version by modifying some stuff and and
um the new version well I don't have to
to just to copy everything I can just I
can just copy just the bits I edits
and the old uh it will share uh most of
it's like everything that hasn't been
modified will be shared with the
previous version
so that's that's all we do and so what
happens when you unplug the machine at
any time really uh well that part will
not get written like the the the pages
at the beginning of the file would not
get written and so nothing will happen
the the allocations will get back to
what they were before you start the
transaction
and the the Comets of a transaction
actually happens when we're uh changing
the first eight bytes of the file so uh
hard drives usually guarantee that you
can write a full sector they have a
little battery inside that keeps going
to write at least like one full sector
would often they tell you well it's best
efforts so there's no actual guarantee
that they do that so they guarantee it
but with no actual guarantee I don't
really know what that means but what I
know is that
um writing eight bytes should be okay so
if they try to do best effort for uh
four thousand and Ninety Six bytes then
probably there's eight bytes they they
can they can certainly do it with high
probability another feature of this
system is that Riders don't block
readers because the old versions are
still available so if you start a
transaction while a read-only
transaction while you're writing uh
writing writing something you can still
read the old version and so that's
that's really cool as well it's not
super useful in Imperial well unless you
start running people in the cloud as
I'll share in a minute
and while this sounds like something
super uh fancy and with lots of like
redundancy crash safety copy and write
and should be super expensive but
actually it's the fastest key Value
Store I've tested
so this these are two curves
um showing how long it takes to retrieve
things so get and insert things into my
B trees this is not specific to people
it's not particularly optimized for
people the only thing that's related to
people is that I not implementing long
values yet just because I have I've
never needed to do that but so here I'm
comparing four systems so four different
implementations of key values
um the the most like the slowest one is
a rust driver equals sled so sled is
super slow but it's it's also really
really cool it's using state-of-the-art
technology to um to to do Lock Free
transactions on the database so you can
have a giant computer with thousands of
uh of course or maybe hundreds of course
more realistically and your transactions
won't block each other and there will
still be uh have acid guarantees so this
is super cool but unfortunately it's
still a research prototype and so for
the kind of stuff I'm doing in a single
core uh it's not super relevant so the
green line is the fastest uh C library
um lmdb it's battle tested and all that
and uh it's claimed to be the like the
fastest possible
um in many places
and now this is uh Santa claria the
system I've just introduced and this is
like the orange line is a benchmark of
something that cannot be achieved so
this is the standard Library the
implementation of B trees in the
standard library of rust and so it
doesn't store anything on disk so if
you're storing stuff on disk I will
obviously obviously take more time so
this is just like the reason I've I've
added it I added it there is to just uh
see how close we are to uh doing that so
we are not paying a lot you know to
stores well this is an SSD drive
obviously but we're not paying a lot
because we're not we're minimizing the
number of times we're writing and
reading uh to the disk so so that's it
while the puts thing has a similar uh
performance
removing sleds we can see it more
clearly so this is about twice as fast
as the fastest uh C equivalence
okay
so and this was actually no unexpected
like performance was never the goal uh
the goal was to just to be able to Fork
um and initially I contacted the author
of lmdb to get him to introduce a fork
primitive but uh it wasn't it was
apparently Impossible on the design so I
had to write my own
um all right so now some announcements
so a hosting platform so we we all like
working together but we don't like
setting up servers so how do we uh
collaborate and share repositories uh
one one way to do it in bihull and
that's been the case since the beginning
is to use
um self-hosted repositories using SSH
but uh it's not often convenient you
have to set up a machine in order to
work together so I've wanted to build a
hosting platform and I actually built it
the first version was released quite a
while ago in 2016. it's using uh it's
written entirely in Rust just like
bihull and uh and postgresql to deal
with all the user accounts and uh
discussions and text and so on
uh there was running for a while and
seeing on a single machine it went
through all the iterations of the rust
uh asynchronous ecosystem so that's a
lot of refactoring and rewrite and so on
and it's never been really stable really
uh but the worst time for stability was
definitely ovh to Strasbourg data center
fire in March March 2021 where my
machines so I've seen a slide yesterday
in one of the talks where someone talked
about your server being on fire but I
don't think they really mean it like
here I do really mean it uh like there
is an actual fire in the actual data
center and so the machines were down for
uh two weeks and because it was an
experimental prototype uh he had no uh
real backups replications or anything of
the kind in place so during these two
weeks I took advantage of that little
break in my work to uh
to rebuild something to rebuild a
replicated setup using well the fact
that people is a crdt itself so it's
easy to replicate and then I've used
raft the raft protocol to replicate
postgres and at the time it was also
convenience because my two largest
contributors like the two largest
contributors to people who were using
the South and cross cable if you guys
know what that means so they were they
were communicating with the server in
strasbour by first going from New
Zealand and Australia to uh San
Francisco and then across the us across
the Atlantic Ocean uh to uh uh start in
across France to to Salisbury so they
had absolutely unbearable latencies
and so I was able to give them so this
was cool and convenient because it was I
was finally able to give them a proper
server with short uh short latencies
short response times but it's been
working okay for two years now a little
bit over two years but the problem is
that this is a at the moment it's a
personal project it's totally
yarn-funded so the machines are really
small and I'm using postgres in like
ways that aren't really intended because
my like the core of my database is
actually Santa Clara and pihul it's not
it's not installed in in Plus in uh
postgres so I need to communicate
between these two databases and so I
need the databases to be located close
to
um they're not like the replica are not
just backups they're they're backups and
caches at the same time so the
consequence of that is that when the
machines are under a high at like too
high a load it causes a failure of
postgres so postgres takes a little more
time to answer
and and so the raft uh thing understands
that as a total failure and triggers a
switchover of the main uh like the
leader of the cluster and that would be
okay uh just having some down time right
but actually the consequence of that is
way worse than downtime is data loss so
having small smaller machines is fine I
don't mind uh if some of my users are
using my experimental system and it just
crashes sometimes or is down for a
little while that it doesn't really
matter but when they're starting to lose
data that's that's a problem so I've
decided to uh rewrite it uh and because
I were working on with cloudflare
workers and function as a service and in
other projects and my renewable energy
projects I started thinking about how we
could use people uh to do that so the
like really quickly function as a
service is uh different from traditional
architecture where you have a big
process overhead or a virtual machine
overhead for each uh for for like each
little piece of server you're running
instead of doing that you're just
sharing the machine and sharing just a
single giant JavaScript runtime with
lots of different uh processes or
functions and even but even from other
users so cloudflare uses on like each
machine uh this giant runtime shared by
all its customers
so that's really cool because uh you can
answer from all of like cloudflare's 250
data centers and it gives you optimal
latency
it's also very easy to write
so that's uh the the minimal example
taken from their documentation where
you're just answering a hello worker
from like you're responding now to a
request
and now the question becomes like can we
run or at least simulate people
repository in a pure functional service
framework like the storage options are
fairly Limited in function as a service
you don't have access to to a hard drive
you don't even have an actual machine so
how do you do that or at least how do
you pretend to be a full-fledged behold
repository where in fact you're just a
like some key Value Store some
replicated
eventually consistent key value store in
the cloud so that's the main challenge
it's completely unlike everything I had
been like completely at odds with my
hypothesis when I first wrote Santa
Clara and NP hole has not no hard drive
at all
so the solution is to compile Santa
claria to wasm because you can run one
of them on on cloudflare workers
um and you are storing sudo memory pages
and storage engine so instead of instead
of using uh these sectors I'm using key
keys and values in the in their storage
engine
the main
the main problem now becomes uh the
eventually the eventual consistency I'm
solving that problem by using the
multiple heads I talked about earlier
like the multiple routes so I keep the
older Roots because I know that maybe
like the changes I'm making to uh to my
key value store I haven't propagated to
all of data centers so I keep the old
routes while they haven't propagated so
cloudflare guarantees for example a one
minute propagation time so that's what I
use to keep to keep my older older
branches in order to avoid like stepping
on each other's Foods
so and we don't need a full period like
checking dependencies and maintaining
maintaining a list of fetches is enough
okay so some technical details to as
like almost my conclusion um so this
this service is using typescript for web
Parts
um as well for the UI and then Rustin
wasm for the people parrots um it can be
self-hosted although I've never tested
that yes uh using cloudflare's workers
so they've released their runtime uh in
like as an open source projects it's
open source uh agpl license and it will
be released progressively because
there's a lot a lot of stuff to release
that's currently just experimental in
prototypal and it's starting today so
I've just opened it just before the
beginning of this talk uh so now you
guys can connect to a nest.phool.org and
start black creating an accounts and
like there's no documentation uh things
may crash there's probably lots of bugs
but this will come in the next few days
or weeks
uh okay so as a conclusion uh this is a
new open source version control system
based in proper algorithms rather than
collections of hacks like uh we we've
had uh for some time uh it's scalable to
Mono repos and large files it's
potentially usable by non-coders the
like the craziest like the the farthest
stretch I I've seen uh in discussions in
that project is using it uh as uh as a
tool to help parliaments do their jobs
so parliaments are giants Version
Control Systems operated manually by uh
highly qualified and highly paid lawyers
who are paid to like check the
consistency of The Logical consistency
of the of the law but actually spend a
significant share of their time actually
editing Word documents to apply changes
that have been voted voted by a member
of parliaments so they're doing manual
Version Control and they're wasting lots
of time on that and I've collaborated
with the French parliaments uh which
would have been a good test case because
we're not actually using our Parliament
at the time like the cabinet passes
their bills as they wish
so it's like the test mode of uh of an
API
uh it can be usable by artists by uh
I've talked to lawyers as well
um by maybe Sonic pie composers we had a
really cool discussions last night about
that and maybe why not buy uh Lego
builders wanting to build larger
projects
the hosting service is available since
today I've said that
and another conclusion uh is is a
personal conclusion of mine so I have a
tendency to do work and way too many
things at the same time but uh and and
it never works well until it does like
for example here working on uh
electricity sharing at the same time as
um as a Version Control help me see how
these would fit together and share ideas
across across projects so to conclude I
would like to acknowledge uh some some
of my co-authors and contributors of
Florence Baker for all the uh
discussions Inspirations and early
contributions so tank feeder so that's
the most patient tester I've ever seen
um he's still there after many years uh
passion patiently checking all my bags
so a huge thanks to him Rohan Hart and
uh Chris Bailey uh though Hart and Angus
Finch are actually the two folks using
the Southern Cross cable and they've
contributed like really cool stuff to
people who increase uh Bailey who have
bridged the gap between lawyers and
legal people and and uh what what I'm
doing all right so thanks for your
attention