The sound distributed version control system

#359 Default support for malleable identifiers

Opened by yaahc on February 25, 2021 UI
yaahc on February 25, 2021

Problem Statement and Background

This issue is a follow up to the following blogpost and twitter thread:

To quote the relevant section of the blog post

And just for kicks, a couple of extra features that nobody has but everybody should:

  • contributor names & emails aren’t immutable - trans people exist, and git filter-graph makes it about as difficult to change my name as the state of Colorado did

As far as I know all version control systems available today immutably record contributor names as part of their history, where rewriting of history is necessary to remove references to old names. This harms trans people who frequently change their name as part of their transition process. These old names are commonly referred to as “deadnames” and can be a source of emotional distress.

Proposal

Pijul should default to making it easy to update all possibly gendered identifiers, including the Full Name, pseudo identifier (name), and email address. This needs to be enabled by default, rather than a possible configuration if you think ahead, because usually people who change their name don’t realize they’re going to do so eventually. I’m hesitant to make suggestions on how to implement this though because I am currently unfamiliar with pijul.

Possible Concerns

Minimizing impact on first time setup

From @nuempe:

Ok. We need a convenient system to customise that. One of the difficulties is that we want people to make their first patch as fast as possible. This VCS is meant to be extremely user-friendly.

Stolen Identity

I expect that people will be worried that these changes will make it easy for people to change the names associated with contributions from other contributors, and I don’t want to replace a problem that hurts one group of people with another problem that still just hurts a different group of people. I can think of at least a few ways to defend against this that don’t require recording people’s deadnames for eternity.

Include a history of identifier updates

The point here is to not record what the old name was, but when it was changed and that it was changed. This will help prove that an identity hasn’t been stolen, but doesn’t necessarily help prove that one has been, who stole it, or what it originally was.

Permanently record a hash of the original identifiers

This could be combined with the previous solution. This would require the person who claims their identifiers were modified to reveal their original identifiers, but this shouldn’t be a problem in the general case, since most people this happens to wouldn’t be trans and wouldn’t be trying to erase records of their old name. This may still cause problems if someone steals the identity of a trans person who has already previously updated their identifiers, since they may need to reveal the original name to convince people that the full string of identifiers refers to them.

Use some sort of cryptographic signature as the fundamental identifier

I’m sure there is probably some clever application of cryptography that could be used here to ensure that only the person who originally made a set of commits can update their name, or that would at least make it so you can tell when someone else updated the name for them.

ambersz on February 25, 2021

Some additional motivations for and possible issues with names and e-mail changes:

  • Pseudonym changes are common– no reason necessary

  • Email change due to switching email services would otherwise make it impossible to contact the person responsible for a change

  • Organization membership/re-org:

    • Transitioning a personal project into a organization also involves email changes– I work on my own projects with my personal email, but if I need to transition that project into a professional setting, I would prefer to update the email associated with all my changes in that project.
    • In an organization/company setting, if a contributor leaves, it could be useful to change the e-mail address for old changes to the new person responsible for knowing these changes. (If I’m reading the code and I need help understanding why a change was made or business reasons for the code in this area, who should I contact? If the person who originally made the changes has changed roles or left the organization, the ability to update the email or add a new contact email would be helpful)
  • Inconsistent change attribution per device/service: Based on my experiences with git/github, making changes through a third-party UI is attributed differently because it doesn’t have the same config as your local device. Having a consistent identity across devices and services could simplify the process to update your name and email address.

Use some sort of cryptographic signature as the fundamental identifier

I’m a fan of this idea. If all change records are signed by default, attribution identities can be associated with the cryptographic identity. Additionally, changes to a mapping from cryptographic identity to attribution identity would be independent of other changes to the repository. That would allow displaying up-to-date attribution identity even when checking out old tags/channels.

Additional QOL possibilities:

  • My understanding of pijul’s change system is that a change with no dependencies would be valid on any repository. If there was some way of configuring identity at the ~/.config/pijul/config.toml level, and recording those as changes, it could allow propagating that one change to all repositories that are on the local device, simplifying the identity creation/update process.

  • In pijul, the contents of a change are independent of the byte-range that the change applies to. This could enable an identity change to indicate that the old full name/name/email should be discarded, completely removing deadnames from the history of the repo.

zseri on February 25, 2021

As I’ve already had this problem multiple times with git, I would greatly appreciate a proper solution. (e.g. I’ve both changed the primary name I put into commits, and also my e-mail address multiple times, and it is not nice to the have a basically “outdated” history, even on private repos, where a complete rewrite is impractical, as it changes the identities of changes and makes updating the repo on all instances, which have cloned it, a hassle (e.g. need to resort to a mix of rebasing and git reset)) The ideas presented in the previous two posts are really good.

Use some sort of cryptographic signature as the fundamental identifier

I’m also a fan of this idea. The cryptographic signature should of course not be tightly tied to a fixed name. An example of this would be SSH keys, a non-example would be classic SSL certificates and GPG keys.

Include a history of identifier updates.

That is a good idea. It would be a good idea to have support for signing these updates. Notes above also apply.

edited: This could be done using a separate directory in the tree which associated KEYID (filename) -> name, email, etc. This separate directory should be created as part of pijul init. This simple approach has the advantage that it might more-or-less simply “work”. But it has an, already noted, obvious disadvantage: it would preserve outdated information, and could be harmful (e.g. deadnames). Probably, it would be a good idea to include only hashes into the repo as-is, and have another mechanism to save the current up-to-date per-contributor information in cleartext.

My understanding of pijul’s change system is that a change with no dependencies would be valid on any repository. If there was some way of configuring identity at the ~/.config/pijul/config.toml level, and recording those as changes, it could allow propagating that one change to all repositories that are on the local device, simplifying the identity creation/update process.

I think there should be a subcommand of pijul which just registers these changes, if there are any. this could be put into a hook, which does this on every record. I’m unsure if these changes (as above, changes in the “keyid-hash-change-log”) should be dependencies of changes which are made after it (changing this personal (contact) information) by the user. This might be just a matter of personal taste, but I don’t like things which implicitly mutate state of repositories… e.g. the behavoir should be controllable per repo, preferable via a subcommand or via the .pijul/config file.

pmeunier added tag UI on February 26, 2021
zseri on February 26, 2021

It should be noted that this definitively affects the format of changes and the format of data stored on disk, and is thus not an UI-only aspect.

Ralith on February 26, 2021

The traditional problem with cryptographic identities is that they must be carefully managed, whereas with git-style metadata you can just set your configuration however you like. SSH keys are an interesting model here because, unlike traditional cryptographic identities, it’s typical for a single user to have many identities. An identity -> metadata mapping table should hence probably be a many-to-many relationship, so an arbitrary collection of cryptographic IDs can all be considered aliases for the same arbitrary (mutable) collection of metadata.

zseri on February 26, 2021

optional (e.g. necessary to think about it, but not a blocker): SSH keys are probably a really initial idea to manage this, because they are really easy to use when compared to GPG keys and SSL certs. It should be noted that it should be possible to post some kind of revocation to the repo, such that old changes can still be verified, but newer changes, which are signed with the key are considered invalid or untrustworthy (until signed by a valid key). edit this should be verified on the server end, as the client isn’t trustworthy in that case. revocation is partially, what makes GPG and SSL more complex, and could also be done completely out-of-band.

yaahc on March 2, 2021

For what it’s worth, I would be happy to contribute whatever implementation we decide upon for this, I’ll just need a clear description for how this should be implemented since I’m completely unfamiliar with the codebase. I’m interested in making sure this feature gets implemented and don’t want it to sit around waiting for someone to make time for it.

zseri on March 2, 2021

It is also a priority that this should be ready and merged before the 1.0.0 release, as it’ll get harder to change the on-disk format later on, and I suppose a on-disk format change would be needed to avoid hacky wordarounds or edge cases. I think it might be a good idea to start at libpijul/src/change.rs, and change the definition of Author to something which fits (but I’m not completely sure what might fit, …) maybe:

enum Author {
    Ed25519 {
        // values taken from the `ring::signature::Ed25519*` implementation
        pubkey: [u8; 32],
        // I think the signature should be over the content of a change, so that it does not change when other things change
        // added benefit of basing that upon signatures: the authorship can be easily verified.
        // but this is not a requirement, thus e.g. older signature algorithms could be kept,
        // while just stating that the signature was not verified, if the algo is not yet, or not anymore, supported.
        sign: [u8, 64],
    }
    // ... other signature algos or possible even blanket impls which allow "unstructued pseudo-identifiers",
    // similiar to `name` now, but it should not be the default. I'm unsure if such an escape hatch might be useful.
    // a better idea would be to just keep this structure extensible.
    // (perhaps by storing it in a structure which can deal with arbitrary long byte sequences and at 255 different "bases" (e.g. algos))
}

After that, checking which parts break, using cargo test. (I really hope that no other parts have implicit assumption about that structure. But we’re not sure until we test it afterwards.) Then, there should be a new root in the pristine, I suppose (based upon my extremely limited knowledge of the current architecture), which maps algorithm(Author enum kind value) + pubkey’s to (name, email, … etc.).

boringcactus on March 24, 2021

I think one solution is

  • changes don’t record author name/email but instead a key (or key fingerprint, or UUID, or some other non-human-readable identifier) so that the only questions answerable by looking at change author info are “were these written by the same person” - immutability here doesn’t cause problems, I think
  • repository metadata includes a list of “as of this timestamp, this is my name, this is my email” messages (“nametags” i think is a good way to refer to them) signed by keys attached to changes - human-readable logs can trust those signatures & just check “what name & email are associated with this key”
  • metadata sync (on both server & client side):
    • “does the peer have any newer nametags than the ones i currently have?”
    • “do the signatures validate based on the keys?”
    • “sweet, i will remove the outdated nametag and use this new one instead”

benefits:

  • in principle, only the person who actually made the changes can update the nametag attached to those changes
  • maintains decentralizedness
  • extends semi-easily to also include change signing (which is a Git feature that i think some people actually use? maybe?)

drawbacks:

  • probably involves rerolling the repository format again (although with sufficient overthinking you could repurpose the existing format and just build this on top)
  • might make changes larger than they otherwise would be
pmeunier on April 24, 2021

The Nest is starting to be operational again after the fire, so I’m going back to fixing the dozens of bugs which have accumulated while I was working on the replication/HA strategy.

I’d like to reach a decision on this, I’m convinced that this is an important feature to have. Since I don’t want to change the patch format, I looked at the details of the current format.

In libpijul/src/change.rs, we have the following struct, serialized to bincode:

#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct Author {
    pub name: String,
    #[serde(default)]
    pub full_name: Option<String>,
    #[serde(default)]
    pub email: Option<String>,
}

Whatever we decide here could be implemented as a toml/json inside the name field, and the other two fields would be deprecated (I believe bincode would just add two 64-bit zeros there).

pmeunier on April 24, 2021

@boringcactus: I like your proposed solution, but I disagree with your drawbacks: the problem is not so much about changing the repository format, as it is about deciding how we propagate identity changes across repositories.

In organisations, the mapping could be done with things like LDAP (or other databases), whereas distributed projects could use a special “Contributors” file (possibly inside the .pijul repository).

The idea is that since we have a distributed text store, we could use it to store this.

pmeunier on May 12, 2021

Because of/Thanks to @rohan’s work on #179, it seems we’ll have to change the change format once again. This isn’t a major issue, since the current change format isn’t lossy, so converting will be easy.

This means that we have a little bit more freedom on this discussion. My current idea for implementing malleable identifiers is to use a key fingerprint as the identity, and then use an optional central server to store identities, in order to allow users who lose their keys to ask for a new key to be linked to their identity. The goal of using a central server is to avoid the issue of having to ask the owners of repositories an author contributed to, to update the author’s identities (what if someone blocks an update and keeps serving the old one? what if someone sends two different identity updates in parallel?).

I’ve also explored solutions based on Pijul, for example using a separate channel for identities, and changes on that, but it suffers from two problems:

  • Repository owners still have to accept patches to the “identity channel”. If we make the process automatic, one could imagine “identity update bombing attacks”, where you keep generating new identities on massive servers in order to saturate a repository with these patches.
  • Pijul files are meant to model text, not efficiently-searchable datastructures such as balanced binary trees. Therefore, for large projects with thousands of contributors (like Rust), storing everything as text seems rather inefficient (actually, the scale of Rust is probably fine, but Pijul also has applications other than code, this could become very inefficient in some of these applications).

In the new format, I’d like all changes to be signed, and I would also like eliminate password authentication from nest.pijul.com entirely, replacing them with a mixture of (1) SSH keys and 2FA and (2) OAuth (at least temporarily).

When you change identity, you just sign your new identity with your own private key and send it to the server. Nobody needs to trust the server for that, since the signature on your identity can be checked by all clients. The only place where you have to trust the server is that keys expire, and you can only renew them by proving your identity to the server (via 2FA or other means).

Clients would have to trust the server’s signature on other contributors’ public keys (like in PKI).

Comments welcome!

Ralith on May 12, 2021

I’m not a cryptographer, but I like the sound of that a lot. Much less hacky than embedding structured data in text fields, and decouples most of the complexity from the repository structure, which should make iteration a lot easier if needed.

My one concern is that this could significantly increase complexity for both users and for people hosting their own repositories independently from the nest. Pijul needs low barriers to entry if it’s to stand any chance of adoption. There must exist an authentication flow that doesn’t rely on a private key stored by the user, but rather leans on OAuth as the ultimate source of truth, though perhaps that was already the plan.

pmeunier on May 12, 2021

I certainly don’t want to make people depend on the Nest. However, I see the Nest as a way to make it easier to start using Pijul, for a number of reasons: this is a bit like in Rust, where you can certainly host your own crate servers, and compile things without cargo, but things are much easier with crates.io. In order to make it as beginner-friendly as possible, here are a few ideas to make it easy:

  • Thrussh includes a cross-platform agent, which could be used to generate keys automatically and store them encrypted, all that automatically (including on Windows).
  • I have relatively concrete plans for “Pijul for non-code text”, where I’ll have to write a Pijul-compatible text editor in a browser anyway, so that could be a start.
  • I like the explicitly non-federated design of Signal, I believe it removes a lot of PGP’s complexity, and significantly mitigates technical debt (technical debt and crypto don’t go along very well, as PGP shows).
Ralith on May 12, 2021

To be clear, I was discussing the usability issues of forcing the end user to store a private key on their own, regardless of repo host. No amount of built-in tools in the client will help if someone loses an irrecoverable key.

I like the explicitly non-federated design of Signal

Architectural support for self-hosting is a hard requirement for me. I don’t think Pijul can be competitive without it. Not that a great experience in the first-party nest isn’t important as well.

pmeunier on May 12, 2021

Architectural support for self-hosting is a hard requirement for me.

For me too, and I actually have Pijul repositories self-hosted over SSH, on private servers (without any Nest at all).

This doesn’t mean these services have to be federated: federations introduce a whole new layer of complexity, as exemplified by PGP, where that complexity has become unmanageable over the years, and hasn’t ever really been useful anyway (how many fingerprints have you verified in person?).

Ralith on May 12, 2021

Oh, yeah, deeming federated identity out of scope seems totally reasonable. Regardless, I just don’t want to see a situation where end users have to deal with key management for the sake of the identity system if they don’t want to.