aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKenny Ballou <kballou@devnulllabs.io>2018-08-17 22:31:51 -0600
committerKenny Ballou <kballou@devnulllabs.io>2018-08-19 08:13:05 -0600
commit2de601f4352efbb27a64c29d92b582ec32defff8 (patch)
tree18946983bd5db2389050c8bd0a712d62292e66eb
parent37fe56ef72c6207b71d5defb505ee5fcbb23bb64 (diff)
downloadblog.kennyballou.com-2de601f4352efbb27a64c29d92b582ec32defff8.tar.gz
blog.kennyballou.com-2de601f4352efbb27a64c29d92b582ec32defff8.tar.xz
git-packfiles post conversion
-rw-r--r--posts/git-packfiles.org (renamed from content/blog/git-packfiles.markdown)234
1 files changed, 118 insertions, 116 deletions
diff --git a/content/blog/git-packfiles.markdown b/posts/git-packfiles.org
index e03b49c..9939df3 100644
--- a/content/blog/git-packfiles.markdown
+++ b/posts/git-packfiles.org
@@ -1,82 +1,93 @@
----
-title: "Git Packfiles"
-description: "Introduction to Git Packfiles"
-tags:
- - "Git"
- - "Internals"
- - "Learning"
-date: "2017-03-01"
-categories:
- - "Development"
-slug: "git-packfiles"
----
-
-Previously, in [Git in Reverse][3], we learned about [Git][1] and how it
-internally stores information. Namely, we went over the ["loose" object][9]
-format that Git uses for storage. However, in the last post, we did not discuss
-how Git uses another storage format to more compactly store files, changes, and
-ultimately objects. In this post we will discuss packfiles and how Git uses
-these primarily for using less bandwidth and, only secondarily, using less
-storage space for storing repository contents.
+#+TITLE: Git Packfiles
+#+DESCRIPTION: Introduction to Git Packfiles
+#+TAGS: Git
+#+TAGS: Internals
+#+TAGS: Learning
+#+DATE: 2017-03-01
+#+SLUG: git-packfiles
+#+LINK: git-scm https://git-scm.com/
+#+LINK: git-in-reverse https://kennyballou/blog/2016/01/git-in-reverse
+#+LINK: udiff https://www.gnu.org/software/diffutils/manual/html_node/Unified-Format.html
+#+LINK: git-show https://www.kernel.org/pub/software/scm/git/docs/git-show.html
+#+LINK: git-diff https://www.kernel.org/pub/software/scm/git/docs/git-diff.html
+#+LINK: loose-objects-so http://stackoverflow.com/questions/5709687/what-are-the-loose-objects-that-the-git-gui-refers-to#5710039
+#+LINK: git-internal-packfiles https://git-scm.com/book/en/v2/Git-Internals-Packfiles
+#+LINK: git-verify-pack https://git-scm.com/docs/git-verify-pack
+#+LINK: pack-format-txt https://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/pack-format.txt
+#+LINK: unpacking-packfiles https://codewords.recurse.com/issues/three/unpacking-git-packfiles/
+#+LINK: git-gc https://www.kernel.org/pub/software/scm/git/docs/git-gc.html
+
+#+BEGIN_PREVIEW
+Previously, in [[git-in-reverse][Git in Reverse]], we learned
+about [[https://git-scm.com/][Git]] and how it internally stores information.
+Namely, we went over the [[loose-objects-so]["loose" object]] format that Git
+uses for storage. However, in the last post, we did not discuss how Git uses
+another storage format to more compactly store files, changes, and ultimately
+objects. In this post we will discuss packfiles and how Git uses these
+primarily for using less bandwidth and, only secondarily, using less storage
+space for storing repository contents.
+#+END_PREVIEW
We're only going to discuss the high-level details of packfiles, there are
-[plenty][2] of [sources][5] that [describe][6] the [details][4] better.
+[[git-internal-packfiles][plenty]] of [[git-verify-pack][sources]] that
+[[pack-format-txt][describe]] the [[unpacking-packfiles][details]] better.
-## Packfiles ##
+** Packfiles
-Packfiles, like [git objects before][3], are an internal file set for storing
-objects in a more compressed format. That is, instead of storing _each_ version
-of a file in its entirety, Git can store a single version of the file in its
-entirety and maintain an internal set of objects which contain patches to
-derive the other versions. Furthermore, Git can store entire repository's
-objects into a single packfile, thus eliminating large numbers of small files
-and improving efficiency of object access.
+Packfiles, like [[git-in-reverse][git objects before]], are an internal file
+set for storing objects in a more compressed format. That is, instead of
+storing /each/ version of a file in its entirety, Git can store a single
+version of the file in its entirety and maintain an internal set of objects
+which contain patches to derive the other versions. Furthermore, Git can store
+entire repository's objects into a single packfile, thus eliminating large
+numbers of small files and improving efficiency of object access.
-The actual files themselves are in the `.git/objects/pack` folder of a
-repository and there are both pack, `.pack`, files and index, `.idx`,
-files.
+The actual files themselves are in the ~.git/objects/pack~ folder of a
+repository and there are both pack, ~.pack~, files and index, ~.idx~, files.
Here is the packfile that contains this repository (as of this writing):
+#+BEGIN_EXAMPLE
± find .git/objects/pack -type f
.git/objects/pack/pack-31966bc41ef450ccfecdfb5ef6cd98f7097eea38.pack
.git/objects/pack/pack-31966bc41ef450ccfecdfb5ef6cd98f7097eea38.idx
+#+END_EXAMPLE
Notice, there are not two "packs", but two files that describe the same "pack".
-There is the `.pack` file itself. This is the file that contains the actual
-objects. There is also the `.idx` file which provides an "index" of the objects
-contained in the pack.
+There is the ~.pack~ file itself. This is the file that contains the actual
+objects. There is also the ~.idx~ file which provides an "index" of the
+objects contained in the pack.
We'll take a small moment to describe each in a little more detail.
-### Packs ###
+*** Packs
Packfiles are relatively straight forward, there's a 12 byte header, first four
spell "PACK", next four provide the version, "2" as of this writing, and the
-final four provide the number of objects in this pack. Following the header,
+final four provide the number of objects in this pack. Following the header,
there's a number of objects stored in a very compact but variable length
format. Finally, there's a 20 byte trailer that is the checksum of the
packfile's contents-- header and objects.
In the header, the number of objects is encoded in a 4-byte integer, thus,
-there can only be \\(2^{32}\\) or little over 4 billion objects in a packfile.
-However, this does not give an upper bound of the _size_ of the pack files
-themselves on disk. The length of each object is encoded in a variable length
+there can only be \(2\^{32}\) or little over 4 billion objects in a packfile.
+However, this does not give an upper bound of the /size/ of the pack files
+themselves on disk. The length of each object is encoded in a variable length
integer prefacing each object in the packfile.
The format of the objects in the packfile is not as they usually exist in the
-loose format, but it will compress them _more_, usually resulting in less space
-used on disk. That is, the objects stored in the packfile may be a base,
-_undeltified_ object, or it may be a _deltified_ object.
+loose format, but it will compress them /more/, usually resulting in less space
+used on disk. That is, the objects stored in the packfile may be a base,
+/undeltified/ object, or it may be a /deltified/ object.
Undeltified objects are not necessarily as interesting, for one, because they
-are already [covered][3]. The deltified objects, however, are pretty
+are already [covered][3]. The deltified objects, however, are pretty
interesting, and definitely different.
The deltified objects, as the name might imply, contain the delta, or,
preferably, the patch and the base object name to create the defined object.
-That is, Git will store inside a regular Git object a patch used to derive
-the defined object. But it only does this in the context of packfiles.
+That is, Git will store inside a regular Git object a patch used to derive the
+defined object. But it only does this in the context of packfiles.
Furthermore, the structure allows for the base object to itself be a deltified
object, thus, making it possible to only store one version of the full file,
but then derive all other versions from deltas or patches.
@@ -85,7 +96,7 @@ While it is entirely possible to use only the packfile itself to access the
contained objects, it's not very efficient for random access. Therefore, the
index file is created to maintain a way to peer into the packfile efficiently.
-### Indexes ###
+*** Indexes
Packfile indexes solve the random object access efficiency problems caused by
heavily compacting objects into a single file.
@@ -93,29 +104,29 @@ heavily compacting objects into a single file.
Although, the contents of the index are little more complicated than the pack
file.
-In version 1 of packfiles, the index does not have a header. In version 2,
-the current version, there are 8 bytes dedicated to the header: the first 4
-bytes will always be `255, 116, 79, 99`, because these are invalid bytes for
-the fanout table; the other 4 bytes of the header are dedicated to the version,
-currently, `2`.
+In version 1 of packfiles, the index does not have a header. In version 2, the
+current version, there are 8 bytes dedicated to the header: the first 4 bytes
+will always be ~255, 116, 79, 99~, because these are invalid bytes for the
+fanout table; the other 4 bytes of the header are dedicated to the version,
+currently, ~2~.
-Following the "$header", there is, what Git calls, a fanout table. This header
+Following the "$header", there is, what Git calls, a fanout table. This header
table consists of 256 4-byte integers, each entry of the table records the
number of objects whose first byte are less than or equal to this entry.
-That is, if the repository has 2 objects that start with `00`, there will be a
-2 in the `00`th entry of the table. Furthermore, if there are 3 objects that
-start with `01`, the `01`th entry will report _5_ objects. Remember, each entry
-in the table is the sum of all previous entries ("less than or equal to this
-entry"). Examining at the 256th entry would provide the total number of objects
-in the packfile.
+That is, if the repository has 2 objects that start with ~00~, there will be a
+2 in the ~00~th entry of the table. Furthermore, if there are 3 objects that
+start with ~01~, the ~01~th entry will report /5/ objects. Remember, each
+entry in the table is the sum of all previous entries ("less than or equal to
+this entry"). Examining at the 256th entry would provide the total number of
+objects in the packfile.
Following the fanout table is a sorted table of 20-byte SHA-1 hashes.
In version 2, there is another table following the sorted hashes that consists
-of 4-byte CRC32 values of the packed object data. This table enables easier
-copying of data between packfiles. For example, this improves the efficiency of
-creating new packfiles for new objects.
+of 4-byte CRC32 values of the packed object data. This table enables easier
+copying of data between packfiles. For example, this improves the efficiency
+of creating new packfiles for new objects.
Next, is another table of 4-byte offset values, usually packed into 31-bits,
larger offsets being encoded as offsets for indexes into the next table.
@@ -129,13 +140,14 @@ checksum of all of the above data.
All of these tables are used to make sure Git has very quick and efficient
access to objects in the repository.
-### Plumbing ###
+*** Plumbing
Git will automatically create packfiles when synchronizing a repository (e.g.,
pushing, pulling, cloning), but they can also be created manually with the
-[`git-gc`][7] command. Let's assume there are some loose objects in the current
-repository.
+[[git-gc][~git-gc~]] command. Let's assume there are some loose objects in the
+current repository.
+#+BEGIN_EXAMPLE
± find .git/objects -type f
.git/objects/f2/e90bed364168fcca0893437fb569d762cdbbce
.git/objects/f4/2946046ed0926d5c7b34772642478390a696c9
@@ -158,18 +170,20 @@ repository.
.git/objects/info/packs
.git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.idx
.git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.pack
+#+END_EXAMPLE
-We started with 11 objects, in the loose format, we ran [`git-gc`][7] and we
-are left with a packfile.
+We started with 11 objects, in the loose format, we ran [[git-gc][~git-gc~]]
+and we are left with a packfile.
-The output of [`git-gc`][7] tells us how many objects we packed, how many delta
-objects were used to create the pack, in this case, 0, and how many objects
-were copied from an existing pack and how many deltas from an existing pack,
-both 0 in this example.
+The output of [[git-gc][~git-gc~]] tells us how many objects we packed, how
+many delta objects were used to create the pack, in this case, 0, and how many
+objects were copied from an existing pack and how many deltas from an existing
+pack, both 0 in this example.
-Of course, we can also examine the packfile with the [`git-verify-pack`][8]
-command:
+Of course, we can also examine the packfile with the
+[[git-verify-pack][~git-verify-pack~]] command:
+#+BEGIN_EXAMPLE
± git verify-pack -v .git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.idx
f2e90bed364168fcca0893437fb569d762cdbbce commit 225 153 12
d12254d273712af99e0585e7dd9dfea2106d5692 commit 220 145 165
@@ -184,21 +198,25 @@ command:
4d5fcadc293a348e88f777dc0920f11e7d71441c tree 31 42 707
non delta: 11 objects
.git/objects/pack/pack-1fc05518e49da3867792b704561b68d5b00e6317.pack: ok
+#+END_EXAMPLE
-> It does not matter whether the `.pack` or `.idx` file are specified to the
-> [`git-verify-pack`][8] command, the output will be the same. However, tab
-> completion will prefer the `.idx` files.
+#+BEGIN_QUOTE
+ It does not matter whether the ~.pack~ or ~.idx~ file are specified to the
+ [[git-verify-pack][~git-verify-pack~]] command, the output will be the same.
+ However, tab completion will prefer the ~.idx~ files.
+#+END_QUOTE
This output has a lot of information to it: first, it tells us about all the
-objects in the packfile, we see our 11 original objects from before. But we are
-also given each object's type, size, size in pack, and offset into the
-packfile, respectively. For undeltified objects, these sizes won't be very
+objects in the packfile, we see our 11 original objects from before. But we
+are also given each object's type, size, size in pack, and offset into the
+packfile, respectively. For undeltified objects, these sizes won't be very
different, but for deltified objects, these two sizes can be significantly
different.
-This output also tells us the pack contains no deltified objects. Let's see
+This output also tells us the pack contains no deltified objects. Let's see
what this would look like with deltified objects:
+#+BEGIN_EXAMPLE
± git gc
Counting objects: 17, done.
Delta compression using up to 4 threads.
@@ -230,54 +248,38 @@ what this would look like with deltified objects:
.git/objects/info/packs
.git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.idx
.git/objects/pack/pack-21f02890d9770ec6b5a566c3c82c03e69f530c19.pack
+#+END_EXAMPLE
Notice, we repacked the repository then listed the contents of the new pack,
also notice the old pack is gone, but the objects that were in the old pack are
still available in the new pack.
-More importantly, notice that `f42946` is a deltified object based on
-`7470c9c`. That is, the tree defined in `f42946` is derived by patching
-`7470c9c` with the contents of the object in the packfile. This is also evident
-in the size listings, the size on disk of the loose object is 25 bytes, but the
-size in the pack is 37. The increase in size is often, unfortunately, due to
-how text compression sometimes _doesn't_ work. This is the first look of what
-Git calls "chains".
-
-Chains are a simple way to describe the length of a deltified object set. The
-longest chain in this repository is only 1. But if we examine bigger
-repositories, this number could be much higher. Git itself, for example, has a
+More importantly, notice that ~f42946~ is a deltified object based on
+~7470c9c~. That is, the tree defined in ~f42946~ is derived by patching
+~7470c9c~ with the contents of the object in the packfile. This is also
+evident in the size listings, the size on disk of the loose object is 25 bytes,
+but the size in the pack is 37. The increase in size is often, unfortunately,
+due to how text compression sometimes /doesn't/ work. This is the first look
+of what Git calls "chains".
+
+Chains are a simple way to describe the length of a deltified object set. The
+longest chain in this repository is only 1. But if we examine bigger
+repositories, this number could be much higher. Git itself, for example, has a
chain length of 46 for one object, or another 6 objects with a chain length of
44 each.
Another thing to note, unlike the loose object format, it's much more difficult
-to get to the contents of the objects in the packfile _using_ only the packfile
-without some effort. However, `git-cat-file` and other plumbing commands will
+to get to the contents of the objects in the packfile /using/ only the packfile
+without some effort. However, ~git-cat-file~ and other plumbing commands will
still work as expected given an object name, even if the object is contained
within a packfile.
-## Summary ##
+** Summary
Hopefully, we now have a deeper knowledge of the compact object format Git
-uses, namely, packfiles. Remember, the motivation for these files was not
+uses, namely, packfiles. Remember, the motivation for these files was not
efficiency in storage, but efficiency in network bandwidth when transferring
-objects and lookup speed when there's a large number of loose objects. Thus, if
-working in stealth mode, it can be sometimes important to run [`git-gc`][7]
-occasionally to keep your private repository quick and efficient.
-
-[1]: https://git-scm.com/
-
-[2]: https://git-scm.com/book/en/v2/Git-Internals-Packfiles
-
-[3]: {{< relref "blog/git-in-reverse.markdown" >}}
-
-[4]: https://codewords.recurse.com/issues/three/unpacking-git-packfiles/
-
-[5]: https://git-scm.com/docs/git-verify-pack
-
-[6]: https://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/pack-format.txt
-
-[7]: https://www.kernel.org/pub/software/scm/git/docs/git-gc.html
-
-[8]: https://www.kernel.org/pub/software/scm/git/docs/git-verify-pack.html
-
-[9]: http://stackoverflow.com/questions/5709687/what-are-the-loose-objects-that-the-git-gui-refers-to#5710039
+objects and lookup speed when there's a large number of loose objects. Thus,
+if working in stealth mode, it can be sometimes important to run
+[[git-gc][~git-gc~]] occasionally to keep your private repository quick and
+efficient.