thought experiment on git on matrix

Find a file

Jan Christian Grünhage cd89e7ec53 fix gitpub section		2018-07-24 21:54:41 +00:00
LICENSE	Initial commit	2018-06-09 10:57:33 +00:00
README.md	fix gitpub section	2018-07-24 21:54:41 +00:00

README.md

Federating git (using Matrix)

Introduction

What is currently out there

Even though git is inherently free (as in freedom) and decentralised, the tools around it that make collaboration in git nice are usually proprietary (see Github and Bitbucket) or if they are free, they are still centralised (see GitLab, Gogs and Gitea).

After Microsoft bought Github, there suddenly is a movement away from Github towards the free alternatives, but they are still centralised entities which have the common network effect problem. One of them (GitLab most likely) will probably gain more and more traction and we'll be in the bad situation again that we have a central entity which has been trusted with all the code.

There also have been discussions on how to federate the different (currently centralised) services using an extended version of ActivityPub (see https://github.com/git-federation/gitpub) and some people build a Github clone on ZeroNet called GitCenter.

Why Matrix and not GitPub or GitCenter

The benefit of GitPub is that they are trying to stick federation onto existing products, which means that the nice user experience of GitLab and Gitea might be able to federate with each other at some point. This could turn out very, very nice, but it is more specialised software. This will end up having "your federated git server", and "your federated social media" and so on, while with matrix you have more of a federated communication framework to build stuff on instead. This git project will be just a git client, like Riot.im is just an IM client and like journal is just a blogging client.

GitCenter is also a very nice project, but currently, short of encrypting your repository, there is no way to have private repositories. Also, having different people on a repo with different levels of access doesn't seem possible to me considering how ZeroNet works. The difference here is that they didn't try to take an existing thing and bolt decentralisation onto it, they took a decentral network and put a git app on there.

My proposal with git on Matrix would follow the second path, take a decentral network (Matrix) and put git on there. Matrix already provides us with access control and other goodies, gives us a DAG containing all pushes and in general seems very fit for this task.

How does git collaboration map to matrix

Issues and PRs

Issue tracking and pull requests map quite nicely to matrix. The root of the project is a matrix room. Issues are matrix rooms linked from the project rooms state. Pull requests are issues that have a certain pull instruction in their state (this means that an issue can be converted to a PR if that's what people want).

git repositories

Previously this section contained something about how git objects in the matrix media repository would be a good fit, but thinking about this more made it clear that this wouldn't scale well enough. Because you would need to do an http get for every single object, large repositories with millions of objects (the linux kernel is close to 7 million) would take forever to clone. Being very generous and saying that it can do an http GET in one ms and we don't need any time to parse the objects (which we would need), cloning the kernel would take close to 2 hours. Apart from that, it would also not use any deltas at all and store full copies of every file ever, which would be bad for repositories with large, frequently changing files. So, onto other plans.

We still need to store refs and objects somewhere. There are currently two possibilities in my mind how to do that, both are less nice architecture wise but a lot faster:

Packfiles: git itself usually doesn't send around loose object, and in a lot of cases it doesn't store loose object either. It packs them up in packfiles, which store compressed objects only, and for files with similar name, size and content it only stores deltas. While I haven't found how, I'm sure that it's possible to tell git to pack up the object diff between what we're trying to push and what already is on the server, upload that packfile and just upload the packfile while pushing and then storing the new refs. The refs could be stored like it was proposed earlier, with having them in state events, like it can be seen in #4, with the addition to storing some mxc urls for packfiles in there too.
Bundles: as an alternative to the "closely coupled with git" packfiles approach above, we could also connect to git less closely and use git bundle to create files that contain object diffs between refs (or from the beginning of the repository) and also the refs themselves. This way the remote helper could just download the bundle files, clone them into a hidden directory somewhere and then clone/push/pull from there. This is a bit cheating, but could end up being the easiest. It also doesn't require us to store refs ourselves, as they are part of the bundles. For addressing the bundles, we need to face the same decision as for storing refs when doing the packfiles approach, as described in #4. Either we have a state event per branch or one big one with more fields. Both are possible and both have drawbacks. With an event per branch, we might store duplicate data (because we can't do deltas across packfiles afaik) but we can merge simultaneous pushes to different branches.

Challenges:

git remote helpers: the hard part here is mostly writing a git remote helper that uses git's weird remote helper protocol to push to a matrix room. The matrix side of this seems very easy. It would also be possible to not write a remote helper and read / write the .git folder ourselves. That is (afaict) what GitCenter does and people seem fine with it. It prevents us from doing something like git clone mx:<insert_uri> though, which is probably what we want. The main problem here is git docs, because git is not very good at documenting how exactly remote helpers work.
Loosing servers: When a server that was used for pushing the repo before goes offline, that means that we kinda loose access to some of the files, because new servers will try to access mxc://server/fileid, which will of course fail. This problem is relevant for all of matrix, so this is not neccessarily something we need to fix here and now. We can work around this by allowing people to upload files to their own server as a backup, this is something the remote helper could do when pushing.