initial writeup

This commit is contained in:
Jan Christian Grünhage 2018-06-09 12:49:22 +00:00
parent eab0949754
commit 036f9b3425
1 changed files with 55 additions and 2 deletions

View File

@ -1,3 +1,56 @@
# git-on-matrix
# Federating git (using Matrix)
## Introduction
### What is currently out there
Even though git is inherently free (as in freedom) and decentralised, the tools around it that make collaboration in git nice are usually proprietary (see Github and Bitbucket) or if they are free, they are still centralised (see GitLab, Gogs and Gitea).
thought experiment on git on matrix
After Microsoft bought Github, there suddenly is a movement away from Github towards the free alternatives, but they are still centralised entities which have the common network effect problem. One of them (GitLab most likely) will probably gain more and more traction and we'll be in the bad situation again that we have a central entity which has been trusted with all the code.
There also have been discussions on how to federate the different (currently centralised) services using an extended version of ActivityPub (see https://github.com/git-federation/gitpub) and some people build a Github clone on ZeroNet called GitCenter.
### Why Matrix and not GitPub or GitCenter
While GitPub seems interesting, they've decided to put their discussions into a private mailing list, where you need to apply to get access to, instead of putting stuff out in the public. With those things in mind, I can't make an educated guess about whether I'd be fine with the protocol itself. The benefit here is that they are trying to stick federation onto existing products, which means that the nice user experience of GitLab and Gitea might be able to federate with each other at some point.
GitCenter is also a very nice project, but currently, short of encrypting your repository, there is no way to have private repositories. Also, having different people on a repo with different levels of access doesn't seem possible to me considering how ZeroNet works. The difference here is that they didn't try to take an existing thing and bolt decentralisation onto it, they took a decentral network and put a git app on there.
My proposal with git on Matrix would follow the second path, take a decentral network (Matrix) and put git on there.
## How does git collaboration map to matrix
### Issues and PRs
Issue tracking and pull requests map quite nicely to matrix. The root of the project is a matrix room. Issues are matrix rooms linked from the project rooms state. Pull requests are issues that have a certain pull instruction in their state (this means that an issue can be converted to a PR if that's what people want).
### git repositories
What seems map a little less perfect to matrix is git itself, although IMO the mapping is actually quite good. For an explanation why, we need a slightly deeper look into how git works.
Commonly people think that in git we have a tree of commits, and the file tree of the commits somehow attached to it. That is kinda wrong though. What git actually is, is a list of references and an object store.
The references are files that point from something like `refs/heads/master` to a certain git object. The git objects themselves are divided into multiple types:
- commit objects: those contain information about a commit (parents, tree, author, etc)
- tree objects: these represent a folder, which can contain more tree objects and also blob objects
- blob objects: they are basically just files
- tag objects: those contain annotations for a given object. While tags usually are used for annotation a commit (when tagging a release for example, you can tag trees or blobs (or even tags!) too. A lot of software is going to be confused when you tag things different from commits though. See https://git-scm.com/docs/git-fast-export#_limitations for an example.
In git, those objects are stored in a simple file based object store, and git addresses those by their SHA1 hash. An object with the hash `ea1301063d14878381daef03cb2aee4935cfffa9` will end up in `.git/objects/ea/1301063d14878381daef03cb2aee4935cfffa9`.
When trying to map that to matrix, that seems rather easy. Put the git objects into matrix's object store, the media repository, and put the references into a matrix event.
#### Challenges:
1. git remote helpers: the hard part here is mostly writing a git remote helper that uses git's weird remote helper protocol to push to a matrix room. The matrix side of this seems very easy. It would also be possible to not write a remote helper and read / write the .git folder ourselves. That is (afaict) what GitCenter does and people seem fine with it. It prevents us from doing something like `git clone mx:<insert_uri>` though, which is probably what we want. The main problem here is git docs, because git is not very good at documenting how exactly remote helpers work.
2. Find mxc URLs of git object: what I kinda skipped over above is that the matrix media repository currently has some limitations that make it hard to use it for what we want. What we get from git is "fetch me that object", containing it's SHA1 hash. Since we get a URL from the media repository and not the media repository sets what we want it to be, we can't construct the mxc URL from the information git gives us. That means (unless the matrix media repositoy API changes to help here), we are stuck storing a mapping for all SHA1 hashes to mxc URLs somewhere.
1. One solution would be to have it in the room state, with the SHA1 being the state key, and the object being the mxc URL. The problem with this is that we definitely don't want people to push large repositories with lots of objects though, because the larger the state the slower the participating servers become. Pushing for example the linux kernel into a room would completely crush everything, because that would add nearly 7 million state events.
2. Another way (this is more of a workaround and less a solution) would be to put a file in the media repository that contains these mappings, as well as an array of include links from other such files. That way, when new changes are pushed, only the new object land in a file and the mapping before is just a reference in the mapping file of the current push. This doesn't feel very matrix-y though.
3. The third possibility, this seems to be the endgame to me, is to add hash addressed content to the media repository API. Ask the media repository for something like `ea1301063d14878381daef03cb2aee4935cfffa9` and give it a room and a primary server. It can then ask the primary server (the one that belongs to the person who sent the hash in some event), and if that doesn't have it, it can ask other servers from that room. This is of use to the current main usecase of IM too, see 4.2.
3. Performance/Scaling: Cloning a repository out of a matrix room, scales linearly with the amount of objects in the repository. For the linux kernel, that means making nearly 7 million http requests to the media repository. Cloning tiny repositories is not a problem, but anything that becomes larger will become slow. There are multiple ways to tackle this:
1. Put an optional component next to the homeserver: Pulling and pushing can become a lot faster, if you put a service next to the homeserver that clones the repos and then does a regular https clone from that service. We loose some of the nice decentralisation feel, but since that is just a transparent proxy that caches a lot, that would be okay.
2. More changes to the media API: Allow people to request multiple files in one request: We still have a much less efficient transport than when doing the proxy service thing, but this is again also useful for other things outside of git.
4. Loosing servers: When a server that was used for pushing the repo before goes offline, that means that we kinda loose access to some of the git objects, because new servers will try to access mxc://server/fileid, which will of course fail. Again, we have multiple solutions here:
1. When you push, upload all git objects that your server does not store to your server. This is kinda painful on the client side. Only really possible as an add-on to 2.1 or 2.2.
2. The third solution to the second problem (2.3) actually solves this too and gives free deduplication. Hash addressed content FTW!