the internals of git

52
The Insides Of Git by Konstantin Nazarov

Upload: konstantin-nazarov

Post on 16-Aug-2015

56 views

Category:

Technology


1 download

TRANSCRIPT

The Insides Of Gitby Konstantin Nazarov

Why?

Because It’s Simpler Than You Think

It’s Like A Filesystem

In many ways you can just see git as a filesystem — it is content-addressable, and it has a notion of versioning.

Which Boils Down To A Simple DAG(Directed Acyclic Graph)

Let’s Start With Building Blocks

• Blobs • Trees • Commits

Blobs (Files) Data from files gets there

But instead of names, git makes emphasis on data itself

by pointing to it with its SHA1 hash

When you put a file in git, it compresses the data, puts it to objects/

and names as the hash of the original data.

• Blobs • Trees • Commits

Trees are used to name things. They are objects too.

Trees can point to trees. Looks like directory structure.

Two trees can reference the same data it’s not stored twice

• Blobs • Trees • Commits

Commits represent a “snapshot” of the top-level tree

Like trees, they are objects too

Commits form history by referring to other commits

• Blobs • Trees • Commits

That’s it!

That’s it!Git is just a bunch of zlib-compressed text files.

I’ll show you how it looks inside

The following section is in shell, so you may try it

yourself, just remember to use your actual SHA1 values,

not mine.

$ mkdir gittest$ cd gittest

# initialize the empty git repository$ git init

Reading/Writing Blobs

# write a simple file to the object database$ echo 'homer' | git hash-object -w --stdin4aa0bfa07f1680c50a1567ecc37bc3b6aa567b8f

# -w means actually write the data. not just hash it.

# write a simple file to the object database$ echo 'homer' | git hash-object -w --stdin4aa0bfa07f1680c50a1567ecc37bc3b6aa567b8f

# -w means actually write the data. not just hash it.

$ find .git/objects -type f.git/objects/4a/a0bfa07f1680c50a1567ecc37bc3b6aa567b8f

# write a simple file to the object database$ echo 'homer' | git hash-object -w --stdin4aa0bfa07f1680c50a1567ecc37bc3b6aa567b8f

# -w means actually write the data. not just hash it.

$ find .git/objects -type f.git/objects/4a/a0bfa07f1680c50a1567ecc37bc3b6aa567b8f

$ git cat-file -p 4aa0bhomer

$ git cat-file -t 4aa0bblob

As I told you, blobs are just compressed

data. Let’s check that.

$ python>>> import zlib>>> f = open('.git/objects/4a/a0bfa07f1680c50a1567ecc37bc3b6aa567b8f')>>> print zlib.decompress(f.read())blob 6homer

Reading/Writing Trees

# Create directory structure$ mkdir foo$ echo "test" > foo/bar$ echo "test2" > baz

# Create directory structure$ mkdir foo$ echo "test" > foo/bar$ echo "test2" > baz

# This is just a way to create the tree$ git update-index --add foo/bar baz$ git write-treeaf6c7364afaa4488d8c6edd44306b91b20dcba93

# Create directory structure$ mkdir foo$ echo "test" > foo/bar$ echo "test2" > baz

# This is just a way to create the tree$ git update-index --add foo/bar baz$ git write-treeaf6c7364afaa4488d8c6edd44306b91b20dcba93

# This is how plain tree file looks like$ git cat-file -p af6c7100644 blob 180cf8328022becee9aaa2577a8f84ea2b9f3827 baz100644 blob 4200aa606ead5dd5777a0b391f085cc4f4690d04 bigfile.dat040000 tree 701ce0a12c61f997c092d30121a256d17144766a foo

# Create directory structure$ mkdir foo$ echo "test" > foo/bar$ echo "test2" > baz

# This is just a way to create the tree$ git update-index --add foo/bar baz$ git write-treeaf6c7364afaa4488d8c6edd44306b91b20dcba93

# This is how plain tree file looks like$ git cat-file -p af6c7100644 blob 180cf8328022becee9aaa2577a8f84ea2b9f3827 baz100644 blob 4200aa606ead5dd5777a0b391f085cc4f4690d04 bigfile.dat040000 tree 701ce0a12c61f997c092d30121a256d17144766a foo

# And the child tree$ git cat-file -p 701ce0100644 blob 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 bar

# Create directory structure$ mkdir foo$ echo "test" > foo/bar$ echo "test2" > baz

# This is just a way to create the tree$ git update-index --add foo/bar baz$ git write-treeaf6c7364afaa4488d8c6edd44306b91b20dcba93

# This is how plain tree file looks like$ git cat-file -p af6c7100644 blob 180cf8328022becee9aaa2577a8f84ea2b9f3827 baz100644 blob 4200aa606ead5dd5777a0b391f085cc4f4690d04bigfile.dat040000 tree 701ce0a12c61f997c092d30121a256d17144766a foo

# And the child tree$ git cat-file -p 701ce0100644 blob 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 bar

# And the data file$ git cat-file -p 9daeatest

In general, the plain tree structure is like this:

# format:tree [content size]\0[mode] [file/folder name]\0[SHA-1 of referencing blob or tree]...[mode] [file/folder name]\0[SHA-1 of referencing blob or tree]

Let’s try the same trick with python.

Since some data is binary, I’ve done a bit of pretty-printing.

$ python>> import zlib>> f = open('.git/objects/46/c826e9c8119915961f6acb01f6f842fb1e444a')>> d = zlib.decompress(f.read())>> (head, _, tail) = d.replace('\x00', '\n', 1).partition('\n')>>> print head>>> while tail:... pos = tail.find('\x00')... print tail[:pos] + " " + ''.join(x.encode('hex') for x in tail[pos+1:pos+21])... tail = tail[pos+21:]...

Result:tree 100100644 baz df6b0d2bcc76e6ec0fca20c227104a4f28bac41b100644 bigfile.dat 4200aa606ead5dd5777a0b391f085cc4f4690d0440000 foo 701ce0a12c61f997c092d30121a256d17144766a

Reading/Writing Commits

# Get the last tree we've created$ git write-tree46c826e9c8119915961f6acb01f6f842fb1e444a

# Get the last tree we've created$ git write-tree46c826e9c8119915961f6acb01f6f842fb1e444a

# actually do the commit$ echo '1st commit' | git commit-tree 46c82afa322a9790619a18ec6e751469008551b3a5c77

# Get the last tree we've created$ git write-tree46c826e9c8119915961f6acb01f6f842fb1e444a

# actually do the commit$ echo '1st commit' | git commit-tree 46c82afa322a9790619a18ec6e751469008551b3a5c77

# and read back the raw commit file$ git cat-file -p afa32tree 46c826e9c8119915961f6acb01f6f842fb1e444aauthor Konstantin Nazarov <[email protected]> 1421934034 +0300committer Konstantin Nazarov <[email protected]> 1421934034 +0300

1st commit

Let’s try the commit hierarchy

# change the tree$ echo "test4" >baz$ git update-index --add baz$ git write-treefb74bbb3f99afed23612d2f03e5cd80775bd2f8a

# change the tree$ echo "test4" >baz$ git update-index --add baz$ git write-treefb74bbb3f99afed23612d2f03e5cd80775bd2f8a

# commit it (also specify a parent)$ echo '2nd commit' | git commit-tree fb74b -p afa32224dde75daa6879629304840aa1fd3a76187aaba

# change the tree$ echo "test4" >baz$ git update-index --add baz$ git write-treefb74bbb3f99afed23612d2f03e5cd80775bd2f8a

# commit it (also specify a parent)$ echo '2nd commit' | git commit-tree fb74b -p afa32224dde75daa6879629304840aa1fd3a76187aaba

# see how it's changed$ git cat-file -p 224ddtree fb74bbb3f99afed23612d2f03e5cd80775bd2f8aparent afa322a9790619a18ec6e751469008551b3a5c77author Konstantin Nazarov <[email protected]> 1421934840 +0300committer Konstantin Nazarov <[email protected]> 1421934840 +0300

2nd commit

Now let’s dump the commit with python.

Just to prove there is no magic.

$ python>>> import zlib>>> f = open('.git/objects/af/a322a9790619a18ec6e751469008551b3a5c77')>>> d = zlib.decompress(f.read())>>> print d.replace('\x00', ‘\n')

Result:commit 197tree 46c826e9c8119915961f6acb01f6f842fb1e444aauthor Konstantin Nazarov <[email protected]> 1421934034 +0300committer Konstantin Nazarov <[email protected]> 1421934034 +0300

1st commit

As you see, no magic.Just plain text files, hashed with SHA1 and formed into a

graph

So, what are branches then?

Just references to the top commit!

$ cat .git/refs/heads/master6566bfcd3a111ea6a1cf594301c39c7c4b1baf3c

$ git cat-file -t 6566bfcommit

Questions?