aboutsummaryrefslogtreecommitdiff
path: root/posts/git-packfiles.org
diff options
context:
space:
mode:
Diffstat (limited to 'posts/git-packfiles.org')
-rw-r--r--posts/git-packfiles.org285
1 files changed, 285 insertions, 0 deletions
diff --git a/posts/git-packfiles.org b/posts/git-packfiles.org
new file mode 100644
index 0000000..9939df3
--- /dev/null
+++ b/posts/git-packfiles.org
@@ -0,0 +1,285 @@
+#+TITLE: Git Packfiles
+#+DESCRIPTION: Introduction to Git Packfiles
+#+TAGS: Git
+#+TAGS: Internals
+#+TAGS: Learning
+#+DATE: 2017-03-01
+#+SLUG: git-packfiles
+#+LINK: git-scm https://git-scm.com/
+#+LINK: git-in-reverse https://kennyballou/blog/2016/01/git-in-reverse
+#+LINK: udiff https://www.gnu.org/software/diffutils/manual/html_node/Unified-Format.html
+#+LINK: git-show https://www.kernel.org/pub/software/scm/git/docs/git-show.html
+#+LINK: git-diff https://www.kernel.org/pub/software/scm/git/docs/git-diff.html
+#+LINK: loose-objects-so http://stackoverflow.com/questions/5709687/what-are-the-loose-objects-that-the-git-gui-refers-to#5710039
+#+LINK: git-internal-packfiles https://git-scm.com/book/en/v2/Git-Internals-Packfiles
+#+LINK: git-verify-pack https://git-scm.com/docs/git-verify-pack
+#+LINK: pack-format-txt https://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/pack-format.txt
+#+LINK: unpacking-packfiles https://codewords.recurse.com/issues/three/unpacking-git-packfiles/
+#+LINK: git-gc https://www.kernel.org/pub/software/scm/git/docs/git-gc.html
+
+#+BEGIN_PREVIEW
+Previously, in [[git-in-reverse][Git in Reverse]], we learned
+about [[https://git-scm.com/][Git]] and how it internally stores information.
+Namely, we went over the [[loose-objects-so]["loose" object]] format that Git
+uses for storage. However, in the last post, we did not discuss how Git uses
+another storage format to more compactly store files, changes, and ultimately
+objects. In this post we will discuss packfiles and how Git uses these
+primarily for using less bandwidth and, only secondarily, using less storage
+space for storing repository contents.
+#+END_PREVIEW
+
+We're only going to discuss the high-level details of packfiles, there are
+[[git-internal-packfiles][plenty]] of [[git-verify-pack][sources]] that
+[[pack-format-txt][describe]] the [[unpacking-packfiles][details]] better.
+
+** Packfiles
+
+Packfiles, like [[git-in-reverse][git objects before]], are an internal file
+set for storing objects in a more compressed format. That is, instead of
+storing /each/ version of a file in its entirety, Git can store a single
+version of the file in its entirety and maintain an internal set of objects
+which contain patches to derive the other versions. Furthermore, Git can store
+entire repository's objects into a single packfile, thus eliminating large
+numbers of small files and improving efficiency of object access.
+
+The actual files themselves are in the ~.git/objects/pack~ folder of a
+repository and there are both pack, ~.pack~, files and index, ~.idx~, files.
+
+Here is the packfile that contains this repository (as of this writing):
+
+#+BEGIN_EXAMPLE
+ ± find .git/objects/pack -type f
+ .git/objects/pack/pack-31966bc41ef450ccfecdfb5ef6cd98f7097eea38.pack
+ .git/objects/pack/pack-31966bc41ef450ccfecdfb5ef6cd98f7097eea38.idx
+#+END_EXAMPLE
+
+Notice, there are not two "packs", but two files that describe the same "pack".
+There is the ~.pack~ file itself. This is the file that contains the actual
+objects. There is also the ~.idx~ file which provides an "index" of the
+objects contained in the pack.
+
+We'll take a small moment to describe each in a little more detail.
+
+*** Packs
+
+Packfiles are relatively straight forward, there's a 12 byte header, first four
+spell "PACK", next four provide the version, "2" as of this writing, and the
+final four provide the number of objects in this pack. Following the header,
+there's a number of objects stored in a very compact but variable length
+format. Finally, there's a 20 byte trailer that is the checksum of the
+packfile's contents-- header and objects.
+
+In the header, the number of objects is encoded in a 4-byte integer, thus,
+there can only be \(2\^{32}\) or little over 4 billion objects in a packfile.
+However, this does not give an upper bound of the /size/ of the pack files
+themselves on disk. The length of each object is encoded in a variable length
+integer prefacing each object in the packfile.
+
+The format of the objects in the packfile is not as they usually exist in the
+loose format, but it will compress them /more/, usually resulting in less space
+used on disk. That is, the objects stored in the packfile may be a base,
+/undeltified/ object, or it may be a /deltified/ object.
+
+Undeltified objects are not necessarily as interesting, for one, because they
+are already [covered][3]. The deltified objects, however, are pretty
+interesting, and definitely different.
+
+The deltified objects, as the name might imply, contain the delta, or,
+preferably, the patch and the base object name to create the defined object.
+That is, Git will store inside a regular Git object a patch used to derive the
+defined object. But it only does this in the context of packfiles.
+Furthermore, the structure allows for the base object to itself be a deltified
+object, thus, making it possible to only store one version of the full file,
+but then derive all other versions from deltas or patches.
+
+While it is entirely possible to use only the packfile itself to access the
+contained objects, it's not very efficient for random access. Therefore, the
+index file is created to maintain a way to peer into the packfile efficiently.
+
+*** Indexes
+
+Packfile indexes solve the random object access efficiency problems caused by
+heavily compacting objects into a single file.
+
+Although, the contents of the index are little more complicated than the pack
+file.
+
+In version 1 of packfiles, the index does not have a header. In version 2, the
+current version, there are 8 bytes dedicated to the header: the first 4 bytes
+will always be ~255, 116, 79, 99~, because these are invalid bytes for the
+fanout table; the other 4 bytes of the header are dedicated to the version,
+currently, ~2~.
+
+Following the "$header", there is, what Git calls, a fanout table. This header
+table consists of 256 4-byte integers, each entry of the table records the
+number of objects whose first byte are less than or equal to this entry.
+
+That is, if the repository has 2 objects that start with ~00~, there will be a
+2 in the ~00~th entry of the table. Furthermore, if there are 3 objects that
+start with ~01~, the ~01~th entry will report /5/ objects. Remember, each
+entry in the table is the sum of all previous entries ("less than or equal to
+this entry"). Examining at the 256th entry would provide the total number of
+objects in the packfile.
+
+Following the fanout table is a sorted table of 20-byte SHA-1 hashes.
+
+In version 2, there is another table following the sorted hashes that consists
+of 4-byte CRC32 values of the packed object data. This table enables easier
+copying of data between packfiles. For example, this improves the efficiency
+of creating new packfiles for new objects.
+
+Next, is another table of 4-byte offset values, usually packed into 31-bits,
+larger offsets being encoded as offsets for indexes into the next table.
+
+Last table, 8-byte offset entries, this table will be empty if the packfile is
+less than 2GiB.
+
+Finally, there is a 20-byte checksum of the packfile and another 20-byte
+checksum of all of the above data.
+
+All of these tables are used to make sure Git has very quick and efficient
+access to objects in the repository.
+
+*** Plumbing
+
+Git will automatically create packfiles when synchronizing a repository (e.g.,
+pushing, pulling, cloning), but they can also be created manually with the
+[[git-gc][~git-gc~]] command. Let's assume there are some loose objects in the
+current repository.
+
+#+BEGIN_EXAMPLE
+ ± find .git/objects -type f
+ .git/objects/f2/e90bed364168fcca0893437fb569d762cdbbce
+ .git/objects/f4/2946046ed0926d5c7b34772642478390a696c9
+ .git/objects/87/713bb957eef1ed6a8d12f36b2d8b328a72b453
+ .git/objects/8c/d57af30ad9bf0f2e0640d0141eb908d276d2f1
+ .git/objects/1f/846d4278f5741d33111d28c03d29b589dabffe
+ .git/objects/be/020e47fadb8d80281259b1f886c3940dc51a19
+ .git/objects/d1/2254d273712af99e0585e7dd9dfea2106d5692
+ .git/objects/ea/41dba10b54a794284e0be009a11f0ff3716a28
+ .git/objects/98/c37b0fb33a8b2f7ac4c5d94571382071ae859c
+ .git/objects/4d/5fcadc293a348e88f777dc0920f11e7d71441c
+ .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
+ ± git gc
+ Counting objects: 11, done.
+ Delta compression using up to 4 threads.
+ Compressing objects: 100% (5/5), done.
+ Writing objects: 100% (11/11), done.
+ Total 11 (delta 0), reused 0 (delta 0)
+ ± find .git/objects -type f
+ .git/objects/info/packs
+ .git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.idx
+ .git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.pack
+#+END_EXAMPLE
+
+We started with 11 objects, in the loose format, we ran [[git-gc][~git-gc~]]
+and we are left with a packfile.
+
+The output of [[git-gc][~git-gc~]] tells us how many objects we packed, how
+many delta objects were used to create the pack, in this case, 0, and how many
+objects were copied from an existing pack and how many deltas from an existing
+pack, both 0 in this example.
+
+Of course, we can also examine the packfile with the
+[[git-verify-pack][~git-verify-pack~]] command:
+
+#+BEGIN_EXAMPLE
+ ± git verify-pack -v .git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.idx
+ f2e90bed364168fcca0893437fb569d762cdbbce commit 225 153 12
+ d12254d273712af99e0585e7dd9dfea2106d5692 commit 220 145 165
+ 98c37b0fb33a8b2f7ac4c5d94571382071ae859c commit 172 117 310
+ e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 blob 0 9 427
+ be020e47fadb8d80281259b1f886c3940dc51a19 blob 9 18 436
+ f42946046ed0926d5c7b34772642478390a696c9 tree 93 81 454
+ 87713bb957eef1ed6a8d12f36b2d8b328a72b453 tree 31 40 535
+ 8cd57af30ad9bf0f2e0640d0141eb908d276d2f1 tree 31 40 575
+ 1f846d4278f5741d33111d28c03d29b589dabffe tree 31 42 615
+ ea41dba10b54a794284e0be009a11f0ff3716a28 tree 62 50 657
+ 4d5fcadc293a348e88f777dc0920f11e7d71441c tree 31 42 707
+ non delta: 11 objects
+ .git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.pack: ok
+#+END_EXAMPLE
+
+#+BEGIN_QUOTE
+ It does not matter whether the ~.pack~ or ~.idx~ file are specified to the
+ [[git-verify-pack][~git-verify-pack~]] command, the output will be the same.
+ However, tab completion will prefer the ~.idx~ files.
+#+END_QUOTE
+
+This output has a lot of information to it: first, it tells us about all the
+objects in the packfile, we see our 11 original objects from before. But we
+are also given each object's type, size, size in pack, and offset into the
+packfile, respectively. For undeltified objects, these sizes won't be very
+different, but for deltified objects, these two sizes can be significantly
+different.
+
+This output also tells us the pack contains no deltified objects. Let's see
+what this would look like with deltified objects:
+
+#+BEGIN_EXAMPLE
+ ± git gc
+ Counting objects: 17, done.
+ Delta compression using up to 4 threads.
+ Compressing objects: 100% (9/9), done.
+ Writing objects: 100% (17/17), done.
+ Total 17 (delta 1), reused 10 (delta 0)
+ ± git verify-pack -v .git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.idx
+ 47f24ac6ba3af12714f0dbf7219b9d854f269097 commit 219 146 12
+ 8cfd10e321ac6349132ceb93774f0a881a1b9316 commit 219 146 158
+ f2e90bed364168fcca0893437fb569d762cdbbce commit 225 153 304
+ d12254d273712af99e0585e7dd9dfea2106d5692 commit 220 145 457
+ 98c37b0fb33a8b2f7ac4c5d94571382071ae859c commit 172 117 602
+ 5716ca5987cbf97d6bb54920bea6adde242d87e6 blob 4 13 719
+ be020e47fadb8d80281259b1f886c3940dc51a19 blob 9 18 732
+ 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 blob 4 13 750
+ 3783c58c8b17ba95b2917e5f92a0395efcec9759 tree 93 100 763
+ 87713bb957eef1ed6a8d12f36b2d8b328a72b453 tree 31 40 863
+ 8cd57af30ad9bf0f2e0640d0141eb908d276d2f1 tree 31 40 903
+ 1f846d4278f5741d33111d28c03d29b589dabffe tree 31 42 943
+ 7470c9c852271284dfb0cb8f3ad9047709847e0d tree 93 101 985
+ e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 blob 0 9 1086
+ f42946046ed0926d5c7b34772642478390a696c9 tree 25 37 1095 1 7470c9c852271284dfb0cb8f3ad9047709847e0d
+ ea41dba10b54a794284e0be009a11f0ff3716a28 tree 62 50 1132
+ 4d5fcadc293a348e88f777dc0920f11e7d71441c tree 31 42 1182
+ non delta: 16 objects
+ chain length = 1: 1 object
+ .git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.pack: ok
+ ± find .git/objects -type f
+ .git/objects/info/packs
+ .git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.idx
+ .git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.pack
+#+END_EXAMPLE
+
+Notice, we repacked the repository then listed the contents of the new pack,
+also notice the old pack is gone, but the objects that were in the old pack are
+still available in the new pack.
+
+More importantly, notice that ~f42946~ is a deltified object based on
+~7470c9c~. That is, the tree defined in ~f42946~ is derived by patching
+~7470c9c~ with the contents of the object in the packfile. This is also
+evident in the size listings, the size on disk of the loose object is 25 bytes,
+but the size in the pack is 37. The increase in size is often, unfortunately,
+due to how text compression sometimes /doesn't/ work. This is the first look
+of what Git calls "chains".
+
+Chains are a simple way to describe the length of a deltified object set. The
+longest chain in this repository is only 1. But if we examine bigger
+repositories, this number could be much higher. Git itself, for example, has a
+chain length of 46 for one object, or another 6 objects with a chain length of
+44 each.
+
+Another thing to note, unlike the loose object format, it's much more difficult
+to get to the contents of the objects in the packfile /using/ only the packfile
+without some effort. However, ~git-cat-file~ and other plumbing commands will
+still work as expected given an object name, even if the object is contained
+within a packfile.
+
+** Summary
+
+Hopefully, we now have a deeper knowledge of the compact object format Git
+uses, namely, packfiles. Remember, the motivation for these files was not
+efficiency in storage, but efficiency in network bandwidth when transferring
+objects and lookup speed when there's a large number of loose objects. Thus,
+if working in stealth mode, it can be sometimes important to run
+[[git-gc][~git-gc~]] occasionally to keep your private repository quick and
+efficient.