Understanding Git

Back in 2008 or 2009, when my company was still using SVN, I said, “Hey, let’s see how this Git thing works out”. Even with my small team sharing patches via email or manually running git-daemon on development machines, it worked very well. No one complained, people actually liked the more decoupled workflow.

To make things official, we went to the IT department and asked them to set up a “central” Git repository for us to work with. At that time the VCS landscape was still very much in flux and their response was “You should use Bazaar because it’s written in Python!”1. Instead, we managed to get a small budget for a GitHub account and went cloud-first before it was a thing.

The point is that there has been, and still is, a lot of bullshit in version control advocacy. You get regular whining about how Git is difficult and hard to understand and different from what people are used to. This may very well be true, but it’s not Git’s fault. It’s because the rest of the VCSs do things in contrived ways that you have to learn to work with. This mindset that there’s “a way of doing things” that you should follow is the problem. You immediately look for ways to replicate what you did before, and then you get discouraged when things are different.

I say that Git is the most logical and easy-to-understand version control system. Let me tell you why.

First of all, if you “just” want to know how to add or commit files and be done with it, you are going to have a bad time. Instead, you should learn how Git works internally, and then all the possible operations you can do with it will become obvious. I know this sounds hard, but it’s really not.

Handling files

When you add a file to a Git repository, its contents are stored as a blob, referenced by its SHA1 hash. The specifics of how similar blobs are stored internally are irrelevant. Git does some clever things to deduplicate data by calculating deltas, but to an external observer, there are just some blobs with unique identifiers.

File contents by themselves are pretty useless. You need to keep some sort of directory of files, and Git manages this with tree objects. A tree is nothing more than a list of SHA1 identifiers and their corresponding names. If you calculate the SHA1 hash of such a list, you get the identifier for the tree. Trees can not only store files, but also other trees. Therefore, by recursively parsing tree objects, you can reconstruct the entire file and directory hierarchy from a single SHA1 of the root tree object.

An example of a simple file system represented as Git objects

An example of a simple file system represented as Git objects

The image above shows three blobs with the following SHA1 identifiers: 1b62, 2c11 and 3fa0. It is important to understand that blobs are immutable. You can only create new blobs and they get a new SHA1.

There are also two trees in the image that store named references to either blobs or other trees. It’s worth noting that the about file in the 4c5e tree and the readme file in the 5bb2 tree reference the same blob. Looks like someone copied the file and renamed it! But what happens if you change the about file? Will the readme change too? No, blobs are immutable. A new blob will be created, and only the about entry will be updated to point to the new contents.

Commits

A commit is just another kind of SHA1 tagged object in git. It contains a SHA1 reference to a root tree, a SHA1 reference to a parent commit, the message explaining why the commit was made, the author, and the date the commit was made.

Note that the commit contains only the root tree reference. The diff you see when looking at the commit is just the result of calculating what has changed between two trees, the one in the commit and the one in its parent.

Typically, you will have commits that have only one parent. This will give you a nice, linear graph of the changes in the repository. The uninteresting special case is the commit that has no parents, it’s just an initial commit. However, when a commit has two (or more) parents, things get interesting. Such a thing is a merge, and your commit graph splits into multiple paths at that point. The nice thing about doing things this way is that you don’t have to do any magic handling, or store any extra metadata, to indicate that a merge has taken place.

Given the data contained in a commit, you can reconstruct the entire history of the repository, along with the changes made to the files, from a single SHA1 identifier. Because the SHA1 hash is unique to the contents of the commit, this entire repository history is tamper-proof. If someone had been manipulating it, you would get a completely different SHA1 at the point of change, and in all subsequent commits referencing that commit.

Branches

A branch in Git is just a reference to a specific commit by its SHA1. It is nothing more than a small file with a SHA1 in it. Let’s say we have a branch, say master. That’s a reference to a commit. Working backwards from that reference, we can get both the current state of the files in the repository and the full commit history.

To keep track of which branch you are working on, Git uses a special reference file called HEAD. For example, the HEAD file might point to the master branch, and any commit you make will go to that master branch. It may be worth following how making a commit works in such a scenario.

  1. A new commit object is created, with a reference to the root tree containing the desired state of the files in the repository, and a parent commit reference pointing to the current HEAD.
  2. The branch pointed to by the HEAD is updated to point to the newly created commit object.

It really is that simple. All you have to do is update one or two references.

Now, perhaps you want to see if some bug in your code was there two weeks ago? Let’s look at what happens when you do a checkout on a random commit. The fact is, the only thing that has changed is that the HEAD now points directly to a SHA1 instead of a named branch. You can still do everything you could before, but there is no branch to update. Git will warn you that you are working on a “detached HEAD” in this situation. And you need to be careful, because when you make some commits, the only reference to them will be the HEAD file. Those changes will be “lost”2 when you switch to another branch or commit. But not to worry, all you have to do to preserve them for posterity is to create a new branch at your current HEAD. It’s just one little file that points to the SHA1 of a commit.

Be aware that Git will periodically perform a cleanup of the repository. Any objects (commits, blobs, trees) that are not directly referenced and are older than a pre-configured time period will be removed at this time.

The index

There may be times when you do not want to commit your entire working directory. For example, you may have fixed two unrelated bugs, and now want to commit those two fixes separately. Or you may currently depend on some debugging code that you don’t want to commit.

To make it easy for you to split up your changes, Git uses a staging area called the index. This is where you prepare what will be your next commit. Committing simply takes the index and makes it the root tree of a new commit.

Git gives you some pretty powerful utilities that allow you to interactively choose what to stage. You can even manually edit the diffs if you don’t want to change the actual files in the working directory.

Remotes

To collaborate with other people, you need to somehow get their changes and send them your own. There are some interesting workflows that rely on sending email, but typically you will just use remotes. For example, when you clone a repository, it is registered as an origin remote. Later you can add as many additional remotes as you need.

What you are actually fetching from the remote repository are branch heads, not the commits. Of course you need to have the actual commits on disk to be able to reference them, so immediately after that Git calculates which commits and other objects are missing and proceeds to download them. The remote branches are stored as origin/master for example, and are protected by Git to prevent you from doing silly things like adding some local commits to such a branch.

Your local branches can track remote branches. Without this option, when changes are made to the remote repository, the local branches will not be updated. Also, when you want to push your changes, Git will not know which remote branch to update. Note that tracking remotes just makes your life easier, but it is not necessary for Git to work. You can do it all manually.

Cooperation

Pulling remote changes is easy if you have not made any changes to your repository. Your local branch label just needs to follow the remote branch label in this case (this is called fast-forwarding). However, if you have made some new commits, you will need to somehow make the two development branches (local and remote) coherent again. This can be done either as a merge or as a rebase. Either way, you will need to resolve any potential conflicts which may arise.

For example, your repository might look like this:

      A---B---C master
     /
D---E---F---G origin/master

I don’t think there’s any need to explain what a merge is. What’s important is that the two development paths remain visible in the repository, and are merged into a new commit, as shown below.

      A---B---C 
     /         \
D---E---F---G---H master

Rebasing, however, is a different beast. You figuratively detach your commit history at the point where it diverges from the remote history, and then reattach it at the new branch head. The end result is that you now have a linear history, but your original commits are no longer present, and new ones must have been created in their place. This is because the new commit tree cannot be described by the old SHA1 commit identifiers.

              A'--B'--C' master
             /
D---E---F---G origin/master

When you rebase, you are effectively rewriting the repository’s history. This is even more obvious when you do an interactive rebase, where you are free to manipulate commits as you wish. Be careful to rewrite only what you have done locally and not yet shared with anyone else. Otherwise, someone might be using the old version, and fixing such divergent paths is never fun.

A push to a remote repository is only possible if it would be a fast-forward.

Binary files

You might say that Git is weak at handling binary files. Rightly so, because all the clever delta calculations for text files just don’t work when you’re dealing with, say, compressed image files. Does this mean that large binaries and Git don’t mix? Of course not.

The solution is to store binaries outside of the Git repository. Git cleverly supports this through the use of filters. The repository can be configured so that certain file types (say, all png files) must be passed to an external program when the file is added to the index, or when Git writes it back to your working directory. The filter program can completely transform the file, for example by storing its contents in a special cache directory, then calculating its SHA1 hash and outputting that small hash for further use in Git. Filters should be reversible, so that when Git passes the hash to the filter to undo the transformation, the original image can be retrieved from the cache directory and then saved to disk.

This may sound a bit convoluted, but the gist is that you can apply a reversible transformation to your large binaries, resulting in Git only storing small reference files instead.

The most popular solution for this is Git LFS, but I cannot say that I had much luck trying it out some time ago. It seemed to be poorly designed, which caused it to fail at most simple operations. Maybe it’s better now, I don’t know.

What I decided to use instead was git fat, but it wasn’t perfect either. Managing tens of thousands of binary files with a Python script that keeps spawning new processes is not a good idea on Windows. My solution was to write a native reimplementation of the Python utility, one that would be able to integrate directly with Git’s internals3. The result is git lard.

Closing words

I hope this little bit of theory will help you get more comfortable with Git. While I left out all the practical stuff, you will now have a much easier time trying to understand what you are actually doing.

You might also be interested in this presentation on Git by Randal Schwartz, which goes into a bit more detail, including the practical parts.


  1. Seriously. ↩︎

  2. You can go back to them with a git reflog, a paper note with the SHA1 written down, or by looking at the full commit tree in any GUI. ↩︎

  3. Why don’t you use a library like libgit2, you might ask. I didn’t, because nothing matches the speed of raw Git. We’re talking orders of magnitude. ↩︎