Git - How it Works | Reza Nejatali

Git is a Distributed Version Control System (DVCS) created by Linux kernel creator Linus Torvalds.

Let’s dive deep into the internals of Git so that we can understand how it works.

It is expected that you have at least a basic general understanding of Git before reading this so that we can keep things as short and concise as possible. I know most of you are already well-versed in Git, but for those who are not, here’s a short introduction written by GitHub. Read that first and then come back, it should be fairly quick.

Let’s start by initializing a new repo in a directory called test:

$ git init test

Now let’s navigate inside of our new test directory and see what’s going on:

$ cd test
$ ls -a
.	..	.git

Notice that Git created its own subdirectory called .git.

Let’s peek inside the .git subdirectory:

$ ls -a .git
.		HEAD		description	info		refs
..		config		hooks		objects

You’ll notice it contains a few files, namely, HEAD, config, and description.

The HEAD file points to the branch you’ve currently checked out, the config file contains your project’s configuration options, and the description file is used by a program called GitWeb that provides a web frontend to Git repositories. There is also another file called index that hasn’t been created yet; it’s where your staging area information is stored.

You’ll also notice a few subdirectories, namely, hooks, info, objects, and refs.

The hooks directory contains your hook scripts, the info directory contains an exclude file that keeps a global list of patterns you want Git to ignore, and the refs directory stores your branches.

The objects directory is at the core of our repo; it stores all of our content in a simple key-value data store.

Git has four main object types: blobs, trees, commits, and tags.

Blobs

Let’s create our first object - a binary object we call a blob (Binary Large OBject).

Before we start, let’s make sure our objects directory is empty:

$ find .git/objects -type f

If no files were found, great, that’s what we want; otherwise, make sure you start with a fresh repo so you can follow along.

Let’s now create a file called guestlist.txt with some content and manually store it in our object database:

$ echo 'bob' > guestlist.txt
$ git hash-object -w guestlist.txt
696fb6baa5ce30099c89066294e5973ee42a1899

The command git-hash-object is a plumbing command that takes an object and computes its SHA-1 hash¹. The -w option actually writes that object into our object database.

You’ll notice the output of the command is the 40-character hash (including header) 696fb6baa5ce30099c89066294e5973ee42a1899.

Note that Git does not care about the name of the file when creating its hash - it only looks at its content. So if instead of guestlist.txt we used blacklist.txt, our hash would have been the exact same. Try it for yourself without the -w option!

If we now inspect our objects directory again, we should see it stored:

$ find .git/objects -type f
.git/objects/69/6fb6baa5ce30099c89066294e5973ee42a1899

Notice that Git has created a new subdirectory using the first two characters of our hash, 69, and then created a file inside of that subdirectory with filename set to the last 38 characters of our hash, 6fb6baa5ce30099c89066294e5973ee42a1899. In essence what Git has done is stored our object as a key-value pair in its database; the key being the SHA-1 hash and the object being the newly created blob.

We can verify that this object is indeed a blob by running the following command:

$ git cat-file -t 696fb6baa5ce30099c89066294e5973ee42a1899
blob

To prove it’s been stored properly, we can even try deleting our file and retrieving it from our object database:

$ rm guestlist.txt
$ git cat-file -p 696fb6baa5ce30099c89066294e5973ee42a1899 > guestlist.txt
$ cat guestlist.txt
bob

The command git-cat-file is used for inspecting Git objects and the -p option denotes pretty-print.

Trees

Tree objects are Git’s way of storing the directory state of our index (i.e. staging area).

To make this more interesting, let’s add a few more things to our working directory:

$ echo 'this is new' > README.md
$ mkdir lists
$ mv guestlist.txt lists/.

So we’ve created a new file called README.md in the top-level directory. We then created a subdirectory called lists and added our guestlist.txt file to it.

We’re now going to record the state of our index using tree objects.

If this all works as expected, we should end up with two trees and two blobs. The first tree will represent the top-level directory of the index, and it will contain the blob of our README.md file and another tree for our lists subdirectory. The lists tree should then contain the blob of our guestlist.txt file.

To start, let’s add everything to our index; we’ll be using plumbing commands to break this down into steps:

$ git update-index --add README.md lists/*
$ git write-tree
a9f2af46d3db09b6889a03313c5b89f5500aae3d

The command git-update-index adds files to the index. The --add option is there because our files are completely new (new files are ignored by default). The command git-write-tree then creates a tree using the current state of our index.

Notice that Git outputted our newly-created tree’s SHA-1 hash - a9f2af46d3db09b6889a03313c5b89f5500aae3d.

Just for sanity’s sake, let’s make sure it’s indeed a tree object:

$ git cat-file -t a9f2af46d3db09b6889a03313c5b89f5500aae3d
tree

Good, we’re not insane! Now let’s inspect it:

$ git cat-file -p a9f2af46d3db09b6889a03313c5b89f5500aae3d
100644 blob fff92cfdca952aad393fe93d1fc52995bac0b276	README.md
040000 tree fa367486f17fbd97cb9922f56cebe5d3df8768a8	lists

So as expected, our top-level tree has two entries - a blob of our README.md file and another tree for our lists subdirectory. The value 100644 to the left of the blob tells us this is a “normal” file². The value 040000 to the left of the tree tells us this is a directory.

Let’s now inspect the new lists tree:

$ git cat-file -p fa367486f17fbd97cb9922f56cebe5d3df8768a8
100644 blob 696fb6baa5ce30099c89066294e5973ee42a1899	guestlist.txt

The lists tree has an entry for our guestlist.txt blob, just as expected!

Here’s a visual representation of our object database:

trees

As you can see, all the content in our object database is stored as either trees or blobs.

Commits

Commit objects attach information like author, date, and a message to tree objects. In other words, they help us keep track of changes to our index over time.

Let’s create a commit object for our top-level tree:

$ echo 'Initial commit' | git commit-tree a9f2af46d3
5eac38ace7561353430551bf9daff2bd26fd8bac

The command git-commit-tree creates a new commit object based on the provided tree object. It normally takes the tree’s SHA-1 hash and the previous/parent commit’s hash as parameters. Because this is an initial (root) commit, there are no parents.

Notice we piped in a commit message - Initial commit.

You may also have noticed we ommitted most of our tree’s SHA-1 hash. Git is smart enough to know that we don’t always need all 40 characters to uniquely identify an object³.

In response, Git outputs a SHA-1 hash associated with our new object.

Note that your hash will differ due to your different author/date data

Let’s now inspect our new commit object (be sure to replace the hash below accordingly):

$ git cat-file -p 5eac38ace7
tree a9f2af46d3db09b6889a03313c5b89f5500aae3d
author Reza Nejatali <reza.n@me.com> 1615054555 -0800
committer Reza Nejatali <reza.n@me.com> 1615054555 -0800

Initial commit

The commit object has stored our top-level tree, the author/committer and timestamp data, and our commit message. If we had a parent commit, it would have been stored here as well.

Now let’s try adding a commit with a parent.

We’ll start by making a change to our working directory and adding it to our index. We’ll then record the state of the index into a new tree object. Finally, we’ll use that new tree object to create a new commit.

Add the name mike to our guestlist.txt file:

$ echo 'mike' >> lists/guestlist.txt

Update the index:

$ git update-index --add lists/guestlist.txt

Write the state of the index into a new tree object:

$ git write-tree
d396af9f9f0bb49ea2dab86327744dc24d18dc33

Now create a new commit object:

$ echo 'Add mike to guestlist' | git commit-tree d396af9f9f -p 5eac38ace7
c45b5aa23956fedbb75337b9039e2c7d0125a16f

Notice we included a parent commit this time - the hash of our previous commit.

Inspecting the new commit object looks like this:

$ git cat-file -p c45b5aa239
tree d396af9f9f0bb49ea2dab86327744dc24d18dc33
parent 5eac38ace7561353430551bf9daff2bd26fd8bac
author Reza Nejatali <reza.n@me.com> 1615133151 -0800
committer Reza Nejatali <reza.n@me.com> 1615133151 -0800

Add mike to guestlist

Let’s take a look at our log:

$ git log --stat --oneline c45b5aa239
c45b5aa (HEAD -> master, tag: v2) Add mike to guestlist
 lists/guestlist.txt | 1 +
 1 file changed, 1 insertion(+)
5eac38a (tag: v1, test) Initial commit
 README.md           | 1 +
 lists/guestlist.txt | 1 +
 2 files changed, 2 insertions(+)

The git-log command shows us a history of commits. By default it shows all commits in the current branch, but since our current branch does not point to any of our commits, we specify our last commit object instead.

The --stat option instructs Git to show us a diffstat of our changes while the --oneline option instructs Git to make the output as compact as possible⁴.

To summarize, we’ve created blobs of our files and added them into our index, we’ve created trees to record snapshots of our index, and we’ve created commits to track these snapshots over time.

Let’s inspect our objects directory again, we should see all of the objects I just mentioned:

$ find .git/objects -type f
.git/objects/69/6fb6baa5ce30099c89066294e5973ee42a1899 # blob (guestlist.txt v1)
.git/objects/b5/b00bf46743b0240efec87fcb99440c138e8902 # tree (lists v2)
.git/objects/d7/f503f4b3269f2216d46effd684d18d7966026f # blob (guestlist.txt v2)
.git/objects/5e/ac38ace7561353430551bf9daff2bd26fd8bac # commit 1
.git/objects/d3/96af9f9f0bb49ea2dab86327744dc24d18dc33 # tree (top-level v2)
.git/objects/a9/f2af46d3db09b6889a03313c5b89f5500aae3d # tree (top-level v1)
.git/objects/c4/5b5aa23956fedbb75337b9039e2c7d0125a16f # commit 2
.git/objects/fa/367486f17fbd97cb9922f56cebe5d3df8768a8 # tree (lists v1)
.git/objects/ff/f92cfdca952aad393fe93d1fc52995bac0b276 # blob (README.md)

The objects directory contains all of the blob, tree, and commit objects we expected. Together they track all of our content thus far.

Here’s an updated visual representation of our object database:

commits

References

References are a more generic way to describe user-friendly names that point to commit objects. Conceptually, the tags we described in the previous section were just references that stayed constant.

Creating a reference is just a matter of storing the SHA-1 hash of a commit object inside of a file named with the name of the reference inside the .git/refs directory.

Let’s take a harder look at the .git/refs directory:

$ ls -a .git/refs
.	..	heads	tags

Notice the two subdirectories heads and tags. We’re already familiar with the tags subdirectory because it contains references to the tags v1 and v2 we created in the previous section.

The heads subdirectory contains references to the latest commit in a line of work. Another name for this kind of reference is a branch.

Let’s create a master branch.

The simplest way to do this would be to create a file named master containing the hash of our commit in the .git/refs/heads directory⁷:

$ echo 'c45b5aa23956fedbb75337b9039e2c7d0125a16f' > .git/refs/heads/master

We can now conveniently use the master reference anytime we want to refer to this commit. For example, using the git-log command:

$ git log --oneline master
c45b5aa (HEAD -> master, tag: v2) Add mike to guestlist
5eac38a (tag: v1, test) Initial commit

As opposed to using the SHA-1 hash of the commit directly, which does the same thing:

$ git log --oneline c45b5aa239
c45b5aa (HEAD -> master, tag: v2) Add mike to guestlist
5eac38a (tag: v1, test) Initial commit

Let’s also create a branch called test and have it point to our first commit:

$ echo '5eac38ace7561353430551bf9daff2bd26fd8bac' > .git/refs/heads/test

Verify that test points to our first commit:

$ git log --oneline test
5eac38a (tag: v1, test) Initial commit

Awesome! So we’ve just created two branches using references - master and test.

Visually, this is what that looks like:

refs

Now, everytime we make a commit, Git updates the SHA-1 hash of our branch to point to the latest commit.

Let’s see this in action. Add the name john to our guestlist.txt file:

$ echo 'john' >> lists/guestlist.txt

Make a new commit:

$ git commit -am 'Add john to guestlist'
[master 3b1d022] Add john to guestlist
 1 file changed, 1 insertion(+)

You’re probably already familiar with the command git-commit - it’s the porcelain way to create a new commit object. The option -am instructs Git to automatically stage files that have been modified and to use the given message as the commit message.

Notice that we did not need to specify a parent commit this time. That’s because Git automatically resolves the latest commit from the reference pointed to by the HEAD file:
$ cat .git/HEAD
ref: refs/heads/master

In response, Git outputted the first several characters of our new commit’s hash - 3b1d022. We expect the master branch to now contain the same hash.

To check, let’s peek inside .git/refs/heads/master:

$ cat .git/refs/heads/master
3b1d022eb88c6b691c0d0170564a9db0f8601743

Affirmative - the master branch contains the same hash, meaning Git has automatically updated it to point to the latest commit.

The original Git developers made a distinction between “plumbing” commands and “porcelain” commands. High-level commands like git-add, git-checkout, git-merge, etc. are called porcelain commands while low-level commands are called plumbing commands. There is no clearly defined boundary between the two types and [as far as I can tell] the analogy is that of a toilet - most of the time we only really care about the porcelain part that we sit on when we do our business, but once in a while we may need to take a look at the plumbing situation underneath. ↩
You may notice these values are analogous to UNIX file descriptors and octal permissions, but as far as Git blobs go, the only possible values are 100644 (normal), 100755 (executable), and 120000 (symbolic link). ↩
In general, the more objects we have, the more hash characters we need to uniquely identify an object. As an example, the Linux kernel project (with its millions of objects) requires at least twelve hash characters to uniquely identify an object. But for most projects, ten characters is more than enough. ↩
A diffstat is a histogram of insertions, deletions, and modifications on a per-file basis. ↩
GPG or GNU Privacy Guard is the open-source variant of PGP (see my post on asymmetric-cryptography for details). ↩
Although we pointed our annotated tag to a commit object, it can technically be pointed to any object in Git. ↩
A safer alternative to this command is git-update-ref. ↩

Git - How it Works

Blobs

Trees

Commits

Tags

Annotated

Lightweight

References