aboutsummaryrefslogtreecommitdiff
path: root/fast-import.c
Commit message (Collapse)AuthorAge
...
* Correct compiler warnings in fast-import.Shawn O. Pearce2007-02-06
| | | | | | | | Junio noticed these warnings/errors in fast-import when compiling with `-Werror -ansi -pedantic`. A few changes are to reduce compiler warnings, while one (in cmd_merge) is a bug fix. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Remove --branch-log from fast-import.Shawn O. Pearce2007-02-06
| | | | | | | | The --branch-log option and its associated code hasn't been used in several months, as its not really very useful for debugging fast-import or a frontend. I don't plan on supporting it in this state long-term, so I'm killing it now before it gets distributed to a wider audience. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Don't support shell-quoted refnames in fast-import.Shawn O. Pearce2007-02-05
| | | | | | | | | | | | | | | | | The current implementation of shell-style quoted refnames and SHA-1 expressions within fast-import contains a bad memory leak. We leak the unquoted strings used by the `from` and `merge` commands, maybe others. Its also just muddling up the docs. Since Git refnames cannot contain LF, and that is our delimiter for the end of the refname, and we accept any other character as-is, there is no reason for these strings to support quoting, except to be nice to frontends. But frontends shouldn't be expecting to use funny refs in Git, and its just as simple to never quote them as it is to always pass them through the same quoting filter as pathnames. So frontends should never quote refs, or ref expressions. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Reduce memory usage of fast-import.Shawn O. Pearce2007-02-05
| | | | | | | | | | | | | | | | | | | | | | | Some structs are allocated rather frequently, but were using integer types which were far larger than required to actually store their full value range. As packfiles are limited to 4 GiB we don't need more than 32 bits to store the offset of an object within that packfile, an `unsigned long` on a 64 bit system is likely a 64 bit unsigned value. Saving 4 bytes per object on a 64 bit system can add up fast on any sizable import. As atom strings are strictly single components in a path name these are probably limited to just 255 bytes by the underlying OS. Going to that short of a string is probably too restrictive, but certainly `unsigned int` is far too large for their lengths. `unsigned short` is a reasonable limit. Modes within a tree really only need two bytes to store their whole value; using `unsigned int` here is vast overkill. Saving 4 bytes per file entry in an active branch can add up quickly on a project with a large number of files. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Include checkpoint command in the BNF.Shawn O. Pearce2007-02-05
| | | | | | | This command isn't encouraged (as its slow) but it does exist and is accepted, so it still should be covered in the BNF. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Merge branch 'master' into sp/gfiShawn O. Pearce2007-01-30
| | | | | | | | | | | | | | git-fast-import requires use of inttypes.h, but the master branch has added it to git-compat-util differently than git-fast-import originally had used it. This merge back of master to the fast-import topic is to get (and use) inttypes.h the way master is using it. This is a partially evil merge to remove the call to setup_ident(), as the master branch now contains a change which makes this implicit and therefore removed the function declaration. (commit 01754769). Conflicts: git-compat-util.h
* Accept 'inline' file data in fast-import commit structure.Shawn O. Pearce2007-01-18
| | | | | | | | | | | | | | | | Its very annoying to need to specify the file content ahead of a commit and use marks to connect the individual blobs to the commit's file modification entry, especially if the frontend can't/won't generate the blob SHA1s itself. Instead it would much easier to use if we can accept the blob data at the same time as we receive each file_change line. Now fast-import accepts 'inline' instead of a mark idnum or blob SHA1 within the 'M' type file_change command. If an inline is detected the very next line must be a 'data n' command, supplying the file data. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Support delimited data regions in fast-import.Shawn O. Pearce2007-01-18
| | | | | | | | | | | | | | | | | | | | | During testing its nice to not have to feed the length of a data chunk to the 'data' command of fast-import. Instead we would prefer to be able to establish a data chunk much like shell's << operator and use a line delimiter to denote the end of the input. So now if a data command is started as 'data <<EOF' we will look for a terminator line containing only the string EOF on that line. Once found, we stop the data command. Everything between the two lines is used as the data value. The 'data <<' syntax is slower than 'data n', as we don't know how many bytes to expect and instead must grow our buffer on the fly. It also has the problem that the frontend must use a string which will not appear on a line by itself in the input, and the data region will always end in an LF. For these reasons real import frontends are encouraged to continue to use _only_ 'data n'. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Remove unnecessary options from fast-import.Shawn O. Pearce2007-01-18
| | | | | | | | | | | | | | | | | | | | | | | The --objects command line option is rather unnecessary. Internally we allocate objects in 5000 unit blocks, ensuring that any sort of malloc overhead is ammortized over the individual objects to almost nothing. Since most frontends don't know how many objects they will need for a given import run (and its hard for them to predict without just doing the run) we probably won't see anyone using --objects. Further since there's really no major benefit to using the option, most frontends won't even bother supplying it even if they could estimate the number of objects. So I'm removing it. The --max-objects-per-pack option was probably a mistake to even have added in the first place. The packfile format is limited to 4 GiB today; given that objects need at least 3 bytes of data (and probably need even more) there's no way we are going to exceed the limit of 1<<32-1 objects before we reach the file size limit. So I'm removing it (to slightly reduce the complexity of the code) before anyone gets any wise ideas and tries to use it. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Use fixed-size integers when writing out the index in fast-import.Shawn O. Pearce2007-01-18
| | | | | | | | | | Currently the pack .idx file format uses 32-bit unsigned integers for the fan-out table and the object offsets. We had previously defined these as 'unsigned int', but not every system will define that type to be a 32 bit value. To ensure maximum portability we should always use 'uint32_t'. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Always use struct pack_header for pack header in fast-import.Shawn O. Pearce2007-01-18
| | | | | | | | | | | | | | | | Previously we were using 'unsigned int' to update the hdr_entries field of the pack header after the file had been completed and was being hashed. This may not be 32 bits on all platforms. Instead we want to always uint32_t. I'm actually cheating here by just using the pack_header like the rest of Git and letting the struct definition declare the correct type. Right now that field is still 'unsigned int' (wrong) but a pending change submitted by Simon 'corecode' Schubert changes it to uint32_t. After that change is merged in fast-import will do the right thing all of the time. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Correct packfile edge output in fast-import.Shawn O. Pearce2007-01-17
| | | | | | | | | | | | | | | Branches are only contained by a packfile if the branch actually had its most recent commit in that packfile. So new branches are set to MAX_PACK_ID to ensure they don't cause their commit to list as part of the first packfile when it closes out if the commit was actually in existance before fast-import started. Also corrected the type of last_commit to be umaxint_t to prevent overflow and wraparound on very large imports. Though that is highly unlikely to occur as we're talking 4 billion commits, which no real project has right now. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Declare no-arg functions as (void) in fast-import.Shawn O. Pearce2007-01-17
| | | | | | | | Apparently the git convention is to declare any function which takes no arguments as taking void. I did not do this during the early fast-import development, but should have. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Correct a few types to be unsigned in fast-import.Shawn O. Pearce2007-01-17
| | | | | | | | | | | The length of an atom string cannot be negative. So make it explicit and declare it as an unsigned value. The shift width in a mark table node also cannot be negative. I'm also moving it to after the pointer arrays to prevent any possible alignment problems on a 64 bit system. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Corrected BNF input documentation for fast-import.Shawn O. Pearce2007-01-17
| | | | | | | | | | | Now that fast-import uses uintmax_t (the largest available unsigned integer type) for marks we don't want to say its an unsigned 32 bit integer in ASCII base 10 notation. It could be much larger, especially on 64 bit systems, and especially if a frontend uses a very large number of marks (1 per file revision on a very, very large import). Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Print out the edge commits for each packfile in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | | | | | | | | To help callers repack very large repositories into a series of packfiles fast-import now outputs the last commits/tags it wrote to a packfile when it prints out the packfile name. This information can be feed to pack-objects --revs to repack. For the first pack of an initial import this is pretty easy (just feed those SHA1s on stdin) but for subsequent packs you want to feed the subsequent pack's final SHA1s but also all prior pack's SHA1s prefixed with the negation operator. This way the prior pack's data does not get included into the subsequent pack. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Correct object_count type and stat output in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | | | | | | | | Since object_count is limited to 'unsigned long' (really an unsigned 32 bit integer value) by the pack file format we may as well use exactly that type here in fast-import for that counter. An earlier change by me incorrectly made it uintmax_t. But since object_count is a counter for the current packfile only, we don't want to output its value at the end. Instead we should sum up the individual type counters and report that total, as that will cover all of the packfiles. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Correct max_packsize default in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | Apparently amd64 has defined 'unsigned long' to be a 64 bit value, which means -1 was way over the 4 GiB packfile limit. Whoops. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Remove unnecessary pack_fd global in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | | | | | | Much like the pack_sha1 the pack_fd is an unnecessary global variable, we already have the fd stored in our struct packed_git *pack_data so that the core library functions in sha1_file.c are able to lookup and decompress object data that we have previously written. Keeping an extra copy of this value in our own variable is just a hold-over from earlier versions of fast-import and is now completely unnecessary. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Ensure we close the packfile after creating it in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | | | Because we are renaming the packfile into its file destination we need to be sure its not open when the rename is called, otherwise some operating systems (e.g. Windows) may prevent the rename from occurring. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Use .keep files in fast-import during processing.Shawn O. Pearce2007-01-16
| | | | | | | | | | | | | | | Because fast-import automatically updates all references (heads and tags) at the end of its run the repository is corrupt unless the objects are available in the .git/objects/pack directory prior to the refs being modified. The easiest way to ensure that is true is to move the packfile and its associated index directly into the .git/objects/pack directory as soon as we have finished output to it. But the only safe way to do this is to create the a temporary .keep file for that pack, so we use the same tricks that index-pack uses when its being invoked by receive-pack. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Reuse sha1 in packed_git in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | | | | Rather than maintaing our own packfile level sha1 variable we can make use of the one already available in struct packed_git. Its meant for the SHA1 of the index but it can also hold the SHA1 of the packfile itself between final checksumming of the packfile and creation of the index. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Replace redundant yread() with read_in_full() in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | | | Prior to git having read_in_full() fast-import used its own private function yread to perform the header reading task. No sense in keeping that around now that read_in_full is a public, stable function. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Use uintmax_t for marks in fast-import.Shawn O. Pearce2007-01-16
| | | | | | | | | | | | | | If a frontend wants to use a mark per file revision and per commit and is doing a truly huge import (such as a 32 GiB SVN repository) we may need more than 2**32 unique mark values, especially if the frontend is unable (or unwilling) to recycle mark values. For mark idnums we should use the largest unsigned integer type available, hoping that will be at least 64 bits when we are compiled as a 64 bit executable. This way we may consume huge amounts of memory storing our mark table, but we'll at least be able to process the entire import without failing. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Corrected buffer overflow during automatic checkpoint in fast-import.Shawn O. Pearce2007-01-15
| | | | | | | | | | | | | | | | If we previously were using a delta but we needed to checkpoint the current packfile and switch to a new packfile we need to throw away the delta and compress the raw object by itself, as delta chains cannot span non-thin packfiles. Unfortunately the output buffer in this case needs to grow, as the size of the compressed object may be quite a bit larger than the size of the compressed delta. I've also avoided recompressing the object if we are checkpointing and we didn't use a delta. In this case the output buffer is the correct size and has already been populated with the right data, we just need to close out the current packfile and open a new one. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Print the packfile names to stdout from fast-import.Shawn O. Pearce2007-01-15
| | | | | | | | Caller scripts may want to know what packfiles the fast-import process just wrote out for them. This is now output to stdout, one packfile name per line, after we checkpoint each packfile. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Implemented automatic checkpoints within fast-import.Shawn O. Pearce2007-01-15
| | | | | | | | | | | | | | | | | When the number of objects or number of bytes gets close to the limit allowed by the packfile format (or configured on the command line by our caller) we should automatically checkpoint the current packfile and start a new one before writing the object out. This does however require that we abandon the delta (if we had one) as its not valid in a new packfile. I also added the simple rule that if we got a delta back but the delta itself is the same size as or larger than the uncompressed object to ignore the delta and just store the object data. This should avoid some really bad behavior caused by our current delta strategy. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Optimize index creation on large object sets in fast-import.Shawn O. Pearce2007-01-15
| | | | | | | | | | | | When we are generating multiple packfiles at once we only need to scan the blocks of object_entry structs which contain objects for the current packfile. Because the most recent blocks are at the front of the linked list, and because all new objects going into the current file are allocated from the front of that list, we can stop scanning for objects as soon as we identify one which doesn't belong to the current packfile. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Don't create a final empty packfile in fast-import.Shawn O. Pearce2007-01-15
| | | | | | | | | If the last packfile is going to be empty (has 0 objects) then it shouldn't be kept after the import has terminated, as there is no point to the packfile. So rather than hashing it and making the index file, just delete the packfile. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Implemented manual packfile switching in fast-import.Shawn O. Pearce2007-01-15
| | | | | | | | | | | | To help importers which are dealing with massive amounts of data fast-import needs to be able to close the packfile it is currently writing to and open a new packfile for any additional data that will be received. A new 'checkpoint' command has been introduced which can be used by the frontend import process to force this to occur at any time. This may be useful to ensure a very long running import doesn't lose any work due to unexpected failures. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Remove unnecessary duplicate_count in fast-import.Shawn O. Pearce2007-01-15
| | | | | | | | | | | There is little reason to be keeping a global duplicate_count value when we also keep it per object type. The global counter can easily be computed at the end, once all processing has completed. This saves us a couple of machine instructions in an unimportant part of code. But it looks slightly better to me to not keep two counters around. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Restructure fast-import to support creating multiple packfiles.Shawn O. Pearce2007-01-15
| | | | | | | | | | | | | | | | | Now that we are starting to see some really large projects (such as KDE or a fork of FreeBSD) get imported into Git we're running into the upper limit on packfile object count as well as overall byte length. The KDE and FreeBSD projects are both likely to require more than 4 GiB to store their current history, which means we really need multiple packfiles to handle their content. This is a fairly simple restructuring of the internal code to help us support creating multiple packfiles from within fast-import. We are now adding a 5 digit incrementing suffix to the end of the basename supplied to us by the caller, permitting up to 99,999 packs to be generated in a single fast-import run. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Misc. type cleanups within fast-import.Shawn O. Pearce2007-01-15
| | | | Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Improve reuse of sha1_file library within fast-import.Shawn O. Pearce2007-01-14
| | | | | | | | | | | | | | | | | | | | Now that the sha1_file.c library routines use the sliding mmap routines to perform efficient access to portions of a packfile I can remove that code from fast-import.c and just invoke it. One benefit is we now have reloading support for any packfile which uses OBJ_OFS_DELTA. Another is we have significantly less code to maintain. This code reuse change *requires* that fast-import generate only an OBJ_OFS_DELTA format packfile, as there is absolutely no index available to perform OBJ_REF_DELTA lookup in while unpacking an object. This is probably reasonable to require as the delta offsets result in smaller packfiles and are faster to unpack, as no index searching is required. Its also only a temporary requirement as users could always repack without offsets before making the import available to older versions of Git. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Merge branch 'master' into sp/fast-importShawn O. Pearce2007-01-14
| | | | | | | | | | | | | | | I'm bringing master in early so that the OBJ_OFS_DELTA implementation is available as part of the topic. This way git-fast-import can learn about this new slightly smaller and faster packfile format, and can generate them directly rather than needing to have them be repacked with git-pack-objects. Due to the API changes in master during the period of development of git-fast-import, a few minor tweaks to fast-import.c are needed to produce a working merge. I've done them here as part of the merge to ensure bisection always works. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Allow creating branches without committing in fast-import.Shawn O. Pearce2007-01-14
| | | | | | | | | | | | | | | | Some importers may want to create a branch long before they actually commit to it, or in some cases they may never commit to the branch but they still need the ref to be created in the repository after the import is complete. This extends the 'reset ' command to automatically create a new branch if the supplied reference isn't already known as a branch. While I'm at it I also modified the syntax of the reset command to terminate with an empty line, like commit and tag operate. This just makes the command set more consistent. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Support creation of merge commits in fast-import.Shawn O. Pearce2007-01-14
| | | | | | | | | | | | | | Some importers are able to determine when branch merges occurred within their source data. In these cases they will want to supply the correct commits to fast-import so that a proper merge commit will exist in Git. This is now supported by supplying a 'merge ' command after the commit message and optional from command. A merge is not actually performed by fast-import, its assumed that the frontend performed any sort of merging activity already and that fast-import should simply be storing its result. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Fix repository corruption when using marks for modified blobs.Shawn O. Pearce2007-01-14
| | | | | | | | | | Apparently we did not copy the blob SHA1 into the stack variable 'sha1' when a mark is used to refer to a prior blob. This code was not previously tested as the Mozilla CVS -> git-fast-import program always fed us full SHA1s for modified blobs and did not use the mark feature there. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Additional fast-import tree delta corruption cleanups.Shawn O. Pearce2007-01-14
| | | | Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Correct tree corruption problems in fast-import.Shawn O. Pearce2007-01-14
| | | | | | | | | The new tree delta implementation caused blob SHA1s to be used instead of a tree SHA1 when a tree was written out. This really only appeared to happen when converting an existing file to a tree, but may have been possible in some other situations. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Replace ywrite in fast-import with the standard write_or_die.Shawn O. Pearce2007-01-14
| | | | Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Reuse the same buffer for all commits/tags in fast-import.Shawn O. Pearce2007-01-14
| | | | | | | | Since most commits and tag objects are around the same size and we only generate one at a time we can reuse the same buffer rather than xmalloc'ing and free'ing the buffer every time we generate a commit. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Recycle data buffers for tree generation in fast-import.Shawn O. Pearce2007-01-14
| | | | | | | | | | We only ever generate at most two tree streams at a time. Since most trees are around the same size we can simply recycle the buffers from one tree generation to the next rather than constantly xmalloc'ing and free'ing them. This should perform slightly better when handling a large number of trees as malloc has less work to do. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Implemented tree delta compression in fast-import.Shawn O. Pearce2007-01-14
| | | | | | | | | | | | | | | We now store for every tree entry two modes and two sha1 values; the base (aka "version 0") and the current/new (aka "version 1"). When we generate a tree object we also regenerate the prior version object and use that as our base object for a delta. This strategy saves a significant amount of memory as we can continue to use the atom pool for file/directory names and only increases each tree entry by an additional 24 bytes of memory. Branches should automatically delta against their ancestor tree, unless the ancestor tree is already at the delta chain limit. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Converted hash memcpy/memcmp to new hashcpy/hashcmp/hashclr.Shawn O. Pearce2007-01-14
| | | | Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Don't crash fast-import if no branch log was requested.Shawn O. Pearce2007-01-14
| | | | Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Added 'reset' command to clear a branch's tree.Shawn O. Pearce2007-01-14
| | | | | | | | | | Sometimes an import frontend may need to work with a temporary branch which will actually contain many different branches over the life of the import. This is especially useful when the frontend needs to create a tag from a set of file versions which are otherwise never a commit. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Map only part of the generated pack file at any point in time.Shawn O. Pearce2007-01-14
| | | | | | | | | | | | | | | | | | When generating a very large pack file (for example close to 1 GB in size) it may be impossible for the kernel to find a contiguous free range within a 32 bit address space for the mapping to be located at. This is especially problematic on large imports where there is a lot of malloc activity occuring within the same process and the malloc'd regions may straddle the previously mapped regions, thereby creating large holes in the address space. So instead we map only 128 MB of the pack at any given time. This will likely increase the number of times the file gets mapped (with additional system time required to update the page tables more frequently) but will allow the program to handle packs up to 4 GB in size. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Fixed compile error in fast-import.Shawn O. Pearce2007-01-14
| | | | Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
* Fixed GPF in fast-import caused by unterminated linked list.Shawn O. Pearce2007-01-14
| | | | | | | | | | fast-import was encounting a GPF when it ran out of free tree_entry objects but didn't know this was the cause because the last tree_entry wasn't terminated with a NULL pointer. The missing NULL pointer occurred when we allocated additional entries via xmalloc but didn't set the last tree_entry's "next" pointer to NULL. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>