definitely move away from loose git objects in the matrix media repository

This commit is contained in:
Jan Christian Grünhage 2018-07-24 19:40:51 +00:00
parent 7ae1cbffa0
commit cac4ae16f5
1 changed files with 6 additions and 31 deletions

View File

@ -19,39 +19,14 @@ My proposal with git on Matrix would follow the second path, take a decentral ne
### Issues and PRs
Issue tracking and pull requests map quite nicely to matrix. The root of the project is a matrix room. Issues are matrix rooms linked from the project rooms state. Pull requests are issues that have a certain pull instruction in their state (this means that an issue can be converted to a PR if that's what people want).
### git repositories
What seems map a little less perfect to matrix is git itself, although IMO the mapping is actually quite good. For an explanation why, we need a slightly deeper look into how git works.
Previously this section contained something about how git objects in the matrix media repository would be a good fit, but thinking about this more made it clear that this wouldn't scale well enough. Because you would need to do an http get for every single object, large repositories with millions of objects (the linux kernel is close to 7 million) would take ***forever*** to clone. Being very generous and saying that it can do an http GET in one ms and we don't need any time to parse the objects (which we would need), cloning the kernel would take close to 2 hours. Apart from that, it would also not use any deltas at all and store full copies of every file ever, which would be bad for repositories with large, frequently changing files. So, onto other plans.
Commonly people think that in git we have a tree of commits, and the file tree of the commits somehow attached to it. That is kinda wrong though. What git actually is, is a list of references and an object store.
The references are files that point from something like `refs/heads/master` to a certain git object. The git objects themselves are divided into multiple types:
- commit objects: those contain information about a commit (parents, tree, author, etc)
- tree objects: these represent a folder, which can contain more tree objects and also blob objects
- blob objects: they are basically just files
- tag objects: those contain annotations for a given object. While tags usually are used for annotation a commit (when tagging a release for example, you can tag trees or blobs (or even tags!) too. A lot of software is going to be confused when you tag things different from commits though. See https://git-scm.com/docs/git-fast-export#_limitations for an example.
In git, those objects are stored in a simple file based object store, and git addresses those by their SHA1 hash. An object with the hash `ea1301063d14878381daef03cb2aee4935cfffa9` will end up in `.git/objects/ea/1301063d14878381daef03cb2aee4935cfffa9`.
When trying to map that to matrix, that seems rather easy. Put the git objects into matrix's object store, the media repository, and put the references into a matrix event.
We still need to store refs and objects somewhere. There are currently two possibilities in my mind how to do that, both are less nice architecture wise but a lot faster:
- **Packfiles:** git itself usually doesn't send around loose object, and in a lot of cases it doesn't store loose object either. It packs them up in packfiles, which store compressed objects only, and for files with similar name, size and content it only stores deltas. While I haven't found ***how***, I'm sure that it's possible to tell git to pack up the object diff between what we're trying to push and what already is on the server, upload that packfile and just upload the packfile while pushing and then storing the new refs. The refs could be stored like it was proposed earlier, with having them in state events, like it can be seen in #4, with the addition to storing some mxc urls for packfiles in there too.
- **Bundles:** as an alternative to the "closely coupled with git" packfiles approach above, we could also connect to git less closely and use git bundle to create files that contain object diffs between refs (or from the beginning of the repository) and also the refs themselves. This way the remote helper could just download the bundle files, clone them into a hidden directory somewhere and then clone/push/pull from there. This is a bit cheating, but could end up being the easiest. It also doesn't require us to store refs ourselves, as they are part of the bundles. For addressing the bundles, we need to face the same decision as for storing refs when doing the packfiles approach, as described in #4. Either we have a state event per branch or one big one with more fields. Both are possible and both have drawbacks. With an event per branch, we might store duplicate data (because we can't do deltas across packfiles afaik) but we can merge simultaneous pushes to different branches.
#### Challenges:
1. git remote helpers: the hard part here is mostly writing a git remote helper that uses git's weird remote helper protocol to push to a matrix room. The matrix side of this seems very easy. It would also be possible to not write a remote helper and read / write the .git folder ourselves. That is (afaict) what GitCenter does and people seem fine with it. It prevents us from doing something like `git clone mx:<insert_uri>` though, which is probably what we want. The main problem here is git docs, because git is not very good at documenting how exactly remote helpers work.
2. Find mxc URLs of git object: what I kinda skipped over above is that the matrix media repository currently has some limitations that make it hard to use it for what we want. What we get from git is "fetch me that object", containing it's SHA1 hash. Since we get a URL from the media repository and not the media repository sets what we want it to be, we can't construct the mxc URL from the information git gives us. That means (unless the matrix media repositoy API changes to help here), we are stuck storing a mapping for all SHA1 hashes to mxc URLs somewhere.
1. One solution would be to have it in the room state, with the SHA1 being the state key, and the object being the mxc URL. The problem with this is that we definitely don't want people to push large repositories with lots of objects though, because the larger the state the slower the participating servers become. Pushing for example the linux kernel into a room would completely crush everything, because that would add nearly 7 million state events.
2. Another way (this is more of a workaround and less a solution) would be to put a file in the media repository that contains these mappings, as well as an array of include links from other such files. That way, when new changes are pushed, only the new object land in a file and the mapping before is just a reference in the mapping file of the current push. This doesn't feel very matrix-y though.
3. The third possibility, this seems to be the endgame to me, is to add hash addressed content to the media repository API. Ask the media repository for something like `ea1301063d14878381daef03cb2aee4935cfffa9` and give it a room and a primary server. It can then ask the primary server (the one that belongs to the person who sent the hash in some event), and if that doesn't have it, it can ask other servers from that room. This is of use to the current main usecase of IM too, see 4.2.
3. Performance/Scaling: Cloning a repository out of a matrix room, scales linearly with the amount of objects in the repository. For the linux kernel, that means making nearly 7 million http requests to the media repository. Cloning tiny repositories is not a problem, but anything that becomes larger will become slow. There are multiple ways to tackle this:
1. Put an optional component next to the homeserver: Pulling and pushing can become a lot faster, if you put a service next to the homeserver that clones the repos and then does a regular https clone from that service. We loose some of the nice decentralisation feel, but since that is just a transparent proxy that caches a lot, that would be okay.
2. More changes to the media API: Allow people to request multiple files in one request: We still have a much less efficient transport than when doing the proxy service thing, but this is again also useful for other things outside of git.
4. Loosing servers: When a server that was used for pushing the repo before goes offline, that means that we kinda loose access to some of the git objects, because new servers will try to access mxc://server/fileid, which will of course fail. Again, we have multiple solutions here:
1. When you push, upload all git objects that your server does not store to your server. This is kinda painful on the client side. Only really possible as an add-on to 2.1 or 2.2.
2. The third solution to the second problem (2.3) actually solves this too and gives free deduplication. Hash addressed content FTW!
2. Loosing servers: When a server that was used for pushing the repo before goes offline, that means that we kinda loose access to some of the files, because new servers will try to access mxc://server/fileid, which will of course fail. This problem is relevant for all of matrix, so this is not neccessarily something we need to fix here and now. We can work around this by allowing people to upload files to their own server as a backup, this is something the remote helper could do when pushing.