fetsorn/pijul-spec: resources/20230523_just_dont_git

[00:00.000 --> 00:13.760]  Hello everybody, welcome back to the Stack Overflow podcast, a place to talk all things
[00:13.760 --> 00:16.000]  software and technology.
[00:16.000 --> 00:21.280]  I'm your host Ben Popper, world's worst coder, joined as I often am by my colleague and collaborator
[00:21.280 --> 00:25.760]  Ryan Thor Donovan, editor of our blog and our newsletter.
[00:25.760 --> 00:29.800]  Ryan, you helped to set up today's episode, so refresh me here.
[00:29.800 --> 00:32.040]  Who's our guest and what are we going to be chit-chatting about?
[00:32.040 --> 00:35.480]  So it's Pierre Etienne Meunier.
[00:35.480 --> 00:42.960]  He reached out because of this version control article I wrote, and he's creator, I believe,
[00:42.960 --> 00:49.720]  of Pijoul, which was briefly mentioned in the article, didn't hear much about it, but
[00:49.720 --> 00:53.640]  somebody was interested in that it used patch algebra.
[00:53.640 --> 00:54.640]  Cool.
[00:54.640 --> 00:57.920]  This was in the article about you're not using Git, but why?
[00:57.960 --> 00:59.760]  Is this where we're getting back to?
[00:59.760 --> 01:03.840]  Yeah, it's the version controls that people use other than Git.
[01:03.840 --> 01:04.840]  Okay, gotcha.
[01:04.840 --> 01:08.080]  Well, then without further ado, Pierre, welcome to the program.
[01:08.080 --> 01:10.080]  Well, thanks for having me.
[01:10.080 --> 01:17.440]  Yeah, I did read the article and I was super interested in all the options and alternatives.
[01:17.440 --> 01:24.040]  The fact that Pijoul was just briefly mentioned was like, okay, yeah, they probably haven't
[01:24.040 --> 01:25.840]  heard much of it.
[01:26.800 --> 01:31.320]  This is based on a previous version control system called Darks.
[01:31.320 --> 01:36.560]  Some of the ideas come from Darks, but at some point we were using Darks, like me and
[01:36.560 --> 01:42.040]  a colleague, Florent Becker, we're using Darks to write on paper about something completely
[01:42.040 --> 01:47.360]  unrelated, like the tilings and geometry in computer science.
[01:47.360 --> 01:53.920]  After work, we went out for a beer and started discussing version control, and Florent told
[01:53.960 --> 02:01.400]  me that, well, as one of the last remaining maintainers of Darks, he could tell me that
[02:01.400 --> 02:06.800]  the algorithm was not proper, it wasn't an actual algorithm, there was a bunch of holes
[02:06.800 --> 02:14.440]  in it, and this explains why it was so slow in dealing with conflicts.
[02:14.440 --> 02:18.280]  And so we started chatting about, oh, look, we're computer scientists, and so our job
[02:18.280 --> 02:23.160]  is to design algorithms and study their complexity, so, well, this is a job for us.
[02:23.160 --> 02:28.440]  We're also working on a model of distributed computing, so this was like, okay, this is
[02:28.440 --> 02:32.760]  exactly the kind of stuff we should be interested in.
[02:32.760 --> 02:39.200]  This is one of our chances to have an impact on something, and so there we started working
[02:39.200 --> 02:45.160]  on some bibliography first, we found some arguments about using category theory to solve
[02:45.160 --> 02:51.040]  the problem, and then we started working on that and writing code and code and more code
[02:51.120 --> 02:57.120]  and debugging, and it turned out to be a much bigger project than we first imagined.
[02:57.120 --> 03:02.120]  And so when I saw Ryan mention Darks, and yeah, well, there's this new thing coming
[03:02.120 --> 03:08.120]  out, maybe someday, but it doesn't look finished yet and something, I reached out and was like,
[03:08.120 --> 03:13.240]  oh, yeah, but, well, you're interested in version control, and probably there's something
[03:13.240 --> 03:15.320]  we could chat about together.
[03:15.320 --> 03:19.280]  Yeah, you sit down for a beer, you start talking version control, and things always go a bit
[03:19.280 --> 03:22.240]  farther than you expected, I think that sounds about right.
[03:22.240 --> 03:26.000]  Pierre, can you give people just a super short synopsis?
[03:26.000 --> 03:29.600]  We started out talking about you already as an engineer and stuff like that, but how did
[03:29.600 --> 03:30.960]  you get into this field?
[03:30.960 --> 03:34.760]  What got you started down this path to the point where you're sitting around a bar coming
[03:34.760 --> 03:36.360]  up with your own version control system?
[03:36.360 --> 03:39.720]  How did you get educated in this world and enter into software development?
[03:39.720 --> 03:46.560]  I'm not educated at all, really not, I don't know anything about software engineering,
[03:46.560 --> 03:55.120]  I'm not an engineer myself, so I started coding when I was a bit young, I guess, on an old
[03:55.120 --> 04:00.800]  computer that my uncle left when he went out for some trouble.
[04:00.800 --> 04:08.360]  I think I was like 12 back then, and then I've been coding on and off, and then started
[04:08.360 --> 04:14.720]  studying mathematics and physics, got into logics, theory improving, that kind of stuff,
[04:14.960 --> 04:21.720]  and where I'm from, there's one thing that is studied in France, which I don't think
[04:21.720 --> 04:25.040]  anywhere else in the world, and that's the Old Camel language.
[04:25.040 --> 04:32.200]  The Old Camel language is when you grow up as a French man and go to university to study
[04:32.200 --> 04:38.200]  general science, like the first two years of basic mathematics, science, physics, and
[04:38.200 --> 04:44.200]  all that, you get told that, well, this Old Camel thing, that's really something French
[04:44.240 --> 04:48.440]  and it's really something we should be proud of, because there's this legend of computing,
[04:48.440 --> 04:52.760]  Xavier Leroy, he did everything there, and he was the first to show the world that you
[04:52.760 --> 04:58.040]  can design a functional programming language that's at the same time really fast.
[04:58.040 --> 05:03.400]  But then I was interested in that and wanted to study computer science, did a PhD in computer
[05:03.400 --> 05:11.000]  science, like theoretical computer science, but then I ended up working on a theoretical
[05:11.000 --> 05:15.560]  computer science. What people call here fundamental computer science doesn't mean it's
[05:15.560 --> 05:21.000]  particularly important or useful, it just means it's like the basic things, the foundational,
[05:21.000 --> 05:27.000]  like probably foundational is a better name for that. So yeah, that's how I got started.
[05:27.000 --> 05:32.920]  So, you know, in researching the article, I started off way back in the day on Visual
[05:32.920 --> 05:39.480]  Source Safe, and it seemed like there were natural developments, but Visual interested
[05:39.480 --> 05:43.960]  me because it uses patch algebra, right? Can you talk about what that is?
[05:44.600 --> 05:51.720]  Yeah. So patch algebra is a completely different way of talking about version control,
[05:52.360 --> 05:57.560]  of thinking about version control. Why? Well, because instead of controlling versions,
[05:57.560 --> 06:02.360]  you're actually controlling changes. That's completely different. It's actually the dual
[06:02.360 --> 06:09.160]  of versions is changes. So instead of just saying, well, this version came after that version,
[06:09.560 --> 06:18.600]  which is something that's about CVS, RCS, SBN, Git, Mercurial, Fossil, and whatnot, like all
[06:18.600 --> 06:24.760]  this family of systems. So they keep controlling versions, they keep insisting on the version
[06:24.760 --> 06:31.800]  and the snapshots that come one after the other. In contrast to that, most of the research about
[06:32.360 --> 06:37.240]  parallel computing, about distributed data structures, focuses on changes.
[06:38.040 --> 06:43.800]  So a change is, for example, well, I introduced the line here and deleted the line there. I
[06:43.800 --> 06:49.880]  renamed the file from X to Z. I deleted that file, for example. I introduced a new file.
[06:51.000 --> 06:57.240]  I solved the conflict. That's also super important. And in contrast to talking only
[06:57.240 --> 07:05.000]  about snapshots and versions, this gives you much higher flexibility because all systems that deal
[07:05.000 --> 07:12.760]  with versions actually show versions or commits or snapshots as changes. If you look at a commit
[07:12.760 --> 07:17.720]  on GitHub, for example, you will never see the actual commit, as in you will never see the actual
[07:17.720 --> 07:23.720]  full entire version. What GitHub will show you when you ask about the commit is what it changed
[07:23.720 --> 07:29.000]  compared to its parents. So actually, there's a fundamental mismatch in the way people think
[07:29.000 --> 07:35.160]  about version control when they use Git. They think, well, everything they see is changes or
[07:35.160 --> 07:40.520]  differences. And then everything they need to reason about when they actually use the tool
[07:40.520 --> 07:48.040]  is versions. So how can you reason about that? Well, we found ways around it, right? We have all
[07:48.040 --> 07:55.640]  these workflows and Git gurus that will tell you what you should and should not do and all that.
[07:55.640 --> 08:01.240]  You have good practices and all these things. But fundamentally, what these good practices aim at
[08:01.240 --> 08:06.600]  is getting around this fundamental mismatch between thinking about having to think about
[08:06.600 --> 08:13.800]  something you never see. So what patches and change algebra gives you is that now you can reason about
[08:13.800 --> 08:19.240]  things. So you can say, well, these two patches are independent, so I can reorder them in history.
[08:19.880 --> 08:27.560]  This sounds like a completely useless and unimportant operation, but it's not. What that
[08:27.560 --> 08:34.280]  means is you can actually, for example, you can take a bug fix from a remote and
[08:34.280 --> 08:39.880]  cherry pick it into your branch without any consequences. You will just cherry pick the
[08:39.880 --> 08:45.080]  bug fix and that's it. And it will just work. You won't have to worry about having to merge that
[08:45.080 --> 08:50.120]  branch in the future. You won't have to worry about any of that. And if that bug fix turns out
[08:50.120 --> 08:56.280]  to be bad and turns out to be inefficient, for example, and you've continued working,
[08:56.280 --> 09:01.720]  well, you can still go back and remove just that bug fix without touching any of your further work.
[09:02.360 --> 09:07.800]  So this gives you this flexibility that people actually want to reason about. So when you're
[09:07.800 --> 09:13.640]  using Git, you're constantly rebasing and merging and cherry picking. And there's also all these
[09:13.640 --> 09:19.400]  commands to deal with conflicts, which Git doesn't really model. There's no conflict in
[09:19.400 --> 09:25.320]  commits. Conflicts are just failures to merge and they're never stored in commits. They're
[09:25.320 --> 09:30.600]  just stored in the working company. And so when you fix a conflict, Git doesn't know about it.
[09:30.600 --> 09:35.400]  It just knows that, oh, here's the fixed version. So this means that if you have to fix the same
[09:35.400 --> 09:40.120]  conflict again in the future, well, Git doesn't know about it. It just knows that, well, there
[09:40.120 --> 09:44.920]  was this conflict or there was these two versions that the user tried to merge. And then there was
[09:44.920 --> 09:49.000]  this version with the conflicts fixed, but it doesn't know how you fixed the conflict.
[09:49.000 --> 09:54.920]  So your conflicts might reappear and you might have to solve them again, or you might even have
[09:54.920 --> 09:59.800]  conflicts that just appear out of the loop. And then you don't know what these conflicts are about
[09:59.800 --> 10:05.240]  and you still have to solve them. And in contrast to them, having this ability to reorder your
[10:05.240 --> 10:11.560]  changes gives you a possibility to just remove one side of the conflict without touching the other
[10:12.200 --> 10:18.360]  or model precisely what happens when you change things. It also forces you to look at all the
[10:18.360 --> 10:23.880]  cases. When you look at all the cases of a merge, you're like, okay, what are all the cases of a
[10:23.880 --> 10:28.920]  conflict? Well, for example, if two people introduce a file with the same name in parallel, it's a
[10:28.920 --> 10:35.880]  conflict. If I change a function's name and if Alice changes a function's name and Bob at the
[10:35.880 --> 10:40.280]  same time in parallel calls that function, what should happen? Is that a conflict? Well,
[10:40.280 --> 10:47.960]  Girola actually doesn't model that, but it does model a large number of cases of conflicts. And
[10:47.960 --> 10:54.440]  so this is much easier. It will probably save a lot of expensive and ingenious time.
[10:54.600 --> 11:02.520]  Listen to season two of Crossing the Enterprise Chasm, hosted by Work OS founder, Michael
[11:02.520 --> 11:07.560]  Greenwich. Learn how top startups move up market and start selling to enterprises with features
[11:07.560 --> 11:13.880]  like single sign-on, directory sync, audit logs, and more. Visit workos.com slash podcast,
[11:13.880 --> 11:16.360]  make your app enterprise ready today.
[11:16.360 --> 11:25.560]  Yeah, in my experience, the merge conflicts are very manual. So it takes a lot of time to actually
[11:25.560 --> 11:31.400]  resolve them. Does visual and the patch algebra, does that help reduce the manual load?
[11:31.400 --> 11:36.920]  Yeah, absolutely. So first of all, you have much less conflicts. Why? Well, because all these
[11:36.920 --> 11:41.960]  artificial conflicts that gets just invents out of nothing, just because you didn't follow the
[11:41.960 --> 11:47.160]  good practices, for example, or you have long lived branches for some reason, because your
[11:47.160 --> 11:54.680]  job requirements won't need that. So you won't have all these conflicts. So there is a lot less
[11:54.680 --> 12:00.760]  manual work to do first, because there is less problems to fix. And then when you're in the
[12:00.760 --> 12:07.960]  process of solving the conflicts, what happens in Girola is that we keep track in the data structure
[12:07.960 --> 12:15.080]  used to merge the batches. We keep track of who introduced which bytes. It's down to the byte level.
[12:15.080 --> 12:21.880]  It's still super efficient, but we know exactly who introduced which bytes in which batch.
[12:21.880 --> 12:27.640]  We can tell, okay, this byte comes from that batch. And so this is a really useful tool to,
[12:27.640 --> 12:31.880]  you know, if you want to solve conflicts, because you can, while you're solving a conflict,
[12:31.880 --> 12:38.120]  you can know exactly what the sides of the conflict are, and this how to solve them much
[12:38.120 --> 12:44.040]  well, in my experience, at least how to solve conflicts much, much easier. So I think this is
[12:44.040 --> 12:51.320]  going to save a lot of time. I was going back in our history of podcasts we recorded, and I
[12:51.320 --> 12:56.840]  remember now we sat down with Arthur Breitman, who is also educated in France to talk about Tezos
[12:56.840 --> 13:03.800]  and the blockchain, and why he loves OCaml. So you're right. For every child who was educated
[13:03.800 --> 13:07.720]  in that, something interesting came out of it. A source of national pride and some interesting
[13:07.720 --> 13:12.760]  ideas about functional programming. Well, the initial version of Rust was also really OCaml.
[13:12.760 --> 13:18.760]  See? Today I learned. So one of the interesting splits I found in the version control article was
[13:19.320 --> 13:26.680]  between folks who deal with mostly code and their source control, and places like
[13:26.680 --> 13:33.160]  video game companies that have large binaries. Does patch algebra apply to the binary files as well?
[13:34.040 --> 13:39.240]  Absolutely. Because when you're describing changes, when you're describing what happens
[13:39.240 --> 13:47.000]  in the change, you might say things like, oh, Alice today introduced that file. She added the
[13:47.000 --> 13:55.400]  file to the repository, and the file is like two gigabytes in size. And so there's the actual two
[13:55.400 --> 14:00.760]  gigabytes, which Git might store, for example. Well, you better use aliphats if you do that.
[14:00.760 --> 14:07.400]  But in a classic version control, you might just add the file to SVN, for example. You might just
[14:07.400 --> 14:13.480]  upload the file, and that's it. When you're describing changes, so you can try to do that
[14:13.480 --> 14:18.840]  in darks, but I don't recommend it for performance reasons. But in people, what you'll tell actually,
[14:18.840 --> 14:24.440]  you'll be like, okay, here's a change. Alice introduced two gigabytes. That's what I just
[14:24.440 --> 14:30.680]  said is very short. And it's just like one file, and the information is really tiny. It's just like
[14:30.680 --> 14:34.840]  logarithmic in the actual two gigabytes. And then there's the two gigabytes themselves.
[14:35.640 --> 14:41.240]  And the thing is, using patches, you can separate the contents of the patches from what the patch
[14:41.240 --> 14:46.360]  did. So by modeling the actual operation, you can be like, okay, I can apply this patch without
[14:46.360 --> 14:52.120]  knowing what's in the file. I can just say that I added two gigabytes without telling you what the
[14:52.120 --> 14:59.880]  two gigabytes are. So this sounds like, okay, how can this be useful? Well, if Alice goes on and
[14:59.880 --> 15:07.000]  writes multiple versions of the two gigabyte file, she might just go on and do that, upload a few
[15:07.000 --> 15:11.880]  versions. And then when you want to know what the contents of the file, you don't have to download
[15:11.880 --> 15:15.960]  directly, you just have to download. Well, Alice added two gigabytes here, then she modified the
[15:15.960 --> 15:22.040]  file, added another gigabytes, then she compressed off and did something. And then like there's
[15:22.040 --> 15:28.040]  another three gigabytes patch. But then you don't have to download any of that, just have to download
[15:28.040 --> 15:36.600]  the information that Alice did some stuff. And then after the fact that you applied all of Alice's
[15:36.600 --> 15:41.400]  changes, you can just say, okay, here are the two, like the remaining parts of the file that are
[15:41.400 --> 15:47.400]  still alive after all these patches are those bytes. And that now I just have to download
[15:47.400 --> 15:53.560]  those bytes. So maybe I'll just end up downloading one full version of the file or two gigabytes,
[15:53.560 --> 15:57.640]  but I won't download the entire history, going through all the versions one by one.
[15:58.600 --> 16:04.920]  So like, I believe I've never tested that at scale on an actual video game project. But I
[16:04.920 --> 16:11.160]  believe that this has the potential to save a lot of bandwidth and make things a lot easier
[16:11.160 --> 16:15.640]  for video game studios. And actually, I have a project going going on with the authors of
[16:16.440 --> 16:23.720]  Godot, the open open source video game studio, like a video game editor. So we'll see what
[16:23.720 --> 16:31.000]  goes out of that. But we're totally aligned in what we want to do. We're like fully open source.
[16:31.080 --> 16:36.360]  So there's something exciting and new going on in the video game industry.
[16:37.320 --> 16:40.280]  I think Godot is really bringing in a lot of fresh air.
[16:40.840 --> 16:47.960]  Yeah, I mean, the fully open source projects are very popular. If you have a bug you want to fix,
[16:47.960 --> 16:53.720]  you can just go fix it. And it's been fascinating to see sort of the race going on these days between
[16:53.720 --> 16:59.560]  the closed corporate world that's developing cutting edge AI, and all of the places like
[16:59.560 --> 17:04.440]  hugging face and stable diffusion and you know, others that are trying to keep pace with them in
[17:04.440 --> 17:08.520]  an open source way and with a kind of a community of contributors. So very cool.
[17:08.520 --> 17:15.320]  Yeah. So we were talking about it doesn't create versions. It seems to me that the version system
[17:15.320 --> 17:21.720]  is sort of a legacy from when we actually burn disks or released binaries to download and install.
[17:22.280 --> 17:28.760]  With your background in distributed computing, do you think that this can be a better way to
[17:29.560 --> 17:32.520]  update and maintain all the distributed systems we have now?
[17:33.080 --> 17:41.960]  Yeah, I hope so at least. So one really cool example I can give is NixOS. So NixOS is not really
[17:41.960 --> 17:47.480]  a Linux distribution. It's actually a language with a massive standard library containing a lot of
[17:47.480 --> 17:54.360]  packages. And you can use this language to build your own system. So that's the promise of NixOS.
[17:55.000 --> 18:01.800]  And so while doing so, for example, if you're maintaining machines in the cloud, you probably
[18:01.800 --> 18:09.400]  want to build an image and use like one custom version of what's called NixPickages, standard
[18:09.400 --> 18:16.040]  library of NixOS. And so you want to customize this in one way or another and then release some
[18:16.040 --> 18:23.480]  of your patches to the official central repository for NixPickages, but then keep some of the others
[18:23.480 --> 18:29.080]  for yourself. So you want these multiple ingits. You would do that by having lots of different
[18:29.080 --> 18:37.160]  branches or feature branches, which you can push to the central repository. Then you would work on
[18:38.360 --> 18:45.320]  another branch, which would be the merge of all those patches plus the changes that occur in the
[18:45.320 --> 18:49.960]  NixPickages central repository. Then this quickly becomes a nightmare to maintain because you have
[18:49.960 --> 18:55.000]  to keep rebasing your changes on top of each other and on top of what happens in NixPickages.
[18:55.000 --> 19:01.800]  Then when your changes get merged in NixPickages, you get conflicts. And so you have to go back to
[19:01.800 --> 19:10.040]  some old comments, which might not even exist. So I believe that maintaining multiple versions or
[19:10.040 --> 19:17.560]  multiple fixes at the same time to one tool can be much, much, much easier using tools like Yehud.
[19:17.640 --> 19:24.040]  So there's one announcement I can make. This is something I've been working on for a while.
[19:24.680 --> 19:30.840]  So Pierrot has its own sort of like GitHub thing for Pierrot, which is called The Nest.
[19:31.480 --> 19:37.960]  And so far it's been not super successful, neither commercially nor, I should say,
[19:38.680 --> 19:45.240]  industrially because it doesn't scale very well. It's been through a data center fire,
[19:45.240 --> 19:52.680]  if you remember, two years ago in Strasbourg in the OVH fire. And so it's using a replicated
[19:52.680 --> 19:57.800]  architecture, but it's not very satisfactory. It's written in Rust. It operates in three different
[19:57.800 --> 20:03.240]  data centers, but it's not easy to maintain. So I've been working on a new serverless infrastructure
[20:03.240 --> 20:09.240]  for that, Function as a Service. So Function as a Service providers don't give you an actual disk
[20:09.240 --> 20:16.040]  on which you could run Yehud, but I've been able to fake Yehud repositories using Cloudfair's
[20:16.040 --> 20:24.360]  KUV, for example. And so this gives infinite scalability and an excellent reliability. So
[20:25.160 --> 20:32.040]  I'm working on that. My prototype is very close to being ready. So I hope I'll be able to release
[20:32.040 --> 20:36.120]  that in a few days, or in the worst case, a few weeks.
[20:40.680 --> 20:44.280]  All right, everybody. It is that time of the show. We want to shout out someone who came
[20:44.280 --> 20:48.280]  on and helped save a little knowledge from the dustbin of history, answered a question.
[20:49.400 --> 20:59.640]  Today, the lifeboat was awarded to Ratchet, R-A-C-H-I-T, passing objects between fragments.
[20:59.640 --> 21:02.360]  That's the question. It's not really a phrase. It's a question. They're using the built-in
[21:02.360 --> 21:06.280]  navigator drawer. They've got fragment menus, and they want to communicate over those fragments,
[21:06.280 --> 21:11.640]  passing data from one to another. This is about Android fragments. So if you ever wanted to pass
[21:11.640 --> 21:16.440]  objects between Android fragments, get that data moving around, we have an answer for you. And
[21:16.440 --> 21:20.040]  thanks to Ratchet and congrats on your lifeboat badge. Appreciate you sharing some knowledge on
[21:20.040 --> 21:25.880]  Stack Overflow. All right, everybody. Thanks for listening. I am Ben Popper, Director of Content
[21:25.880 --> 21:30.280]  here at Stack Overflow. You can always find me on Twitter, at Ben Popper. You can always reach us
[21:30.280 --> 21:34.680]  with questions or suggestions, podcast at stackoverflow.com. If you'd like to show,
[21:34.680 --> 21:41.960]  leave us a rating and a review. It really helps. And thanks for listening. I'm Ryan Donovan. I edit
[21:41.960 --> 21:47.560]  the blog here at Stack Overflow. It's located at stackoverflow.blog. And if you want to reach out
[21:47.560 --> 21:53.720]  to me, you can find me on Twitter at Arthur Donovan. Well, I'm Pierre-Étienne Meunier,
[21:54.440 --> 22:01.000]  and you can browse pirou.com or .org if you want to know more about this project
[22:01.640 --> 22:06.920]  and send a message to pe.pjoul.org. All right, everybody. Thanks for listening,
[22:06.920 --> 22:18.280]  and we will talk to you soon.