This page is primarily intended for Mercurial's developers.
Improving manifest compression
Ideas for reducing manifest overhead
1. Improving delta compression
One major factor in poor compression is large deltas caused by interleaving branches in history. This is partly addressed by GeneralDelta, but needs further work:
- better choice of delta parent
- tuning of window sizes
- separate read windows and decompression windows (ie "we can read 4x linearly, so long as the used hunks stay under 2x")
It is crucial that performance here be measured on a delta-by-delta basis: all generaldelta deltas must be smaller than the corresponding parentdelta. The current generaldelta implementation is known to be buggy here.
Experiments by cyanite in Feb 2013 have shown that aggressive generaldelta can increase compression by 10x or more.
2. Stem compression
Typically, the bulk of a manifest is full pathnames for files. These are generally highly redundant (and thus compress well) but could be compressed even better with the application of stem compression. Instead of a full filename, each entry consists of a leading byte indicating the number of identical bytes to copy from the previous filename, followed by the rest of the name.
So, for instance:
... tests/test-update-branches.t... tests/test-update-issue1456.t...
... <13>date-branches.t <18>issue1456.t
Experiments by mpm in Feb 2013 have shown that stem compression decreases the size of full manifest revisions by a good margin over gzip alone.
3. Shrinking individual deltas
Currently, a minimal delta entry (a single file change) has a form like this:
[4 bytes start] [4 bytes end] [4 bytes insert length] [N bytes filename] [1 null byte] [40 bytes manifest node] [1 optional flag byte] [1 newline byte]
This gives us on the order of 120 bytes for a typical minimum delta (54 bytes base + 66 byte filename).
3.1. Delta-friendly line breaks
The bdiff delta format is based on line breaks. Unfortunately, manifest entries include the filename and hash on the same line. This means a minimal delta will always include the file name, even though it hasn't changed. By putting the filename and node on separate lines, we can cut the delta size in half for typical entries, down to 53 bytes.
3.2. Binary node strings
After that, the bulk of the remaining size is the 40-byte hex representation of the manifest node. Switching this to binary gets us to 34 bytes. Because this field is fixed-width, the fact that it might now contain its own newline bytes is not a problem.
3.3. More compact delta chunks
As most deltas don't need 32 bits, we can use a more compact delta format. For instance:
[3 bytes start, high bit set] [1 byte delete length] [1 byte insert length]
This is sufficient for a small number of lines changed in files up to 8MB, which is sufficient for the vast majority of deltas. This reduces our minimal manifest delta to 26 bytes.
3.4. Don't replace newlines
Deltas currently replace whole lines. A minor modification would allow them to remove matching leading and trailing bytes from the delta itself. As just about every delta will have a match on the trailing newline, this will save us an additional byte, giving us a delta size of 25 bytes.
This is a cumulative 5x improvement over the existing minimal delta and is approaching optimal. Together with improved generaldelta and stem compression, these changes could result in 50-100x reduction in overall manifest size.
4. Implementation strategy
Generaldelta requires a new bundle format to be a win, so work on BundleFormat2 is needed. On-the-fly conversion for interop with old clients will continue to be needed.
4.2. New delta chunk format
This will need a new repository requirement flag. Interop with old clients can be done efficiently with on-the-fly conversion, but will also benefit from a new bundle format.
4.3. New manifest format
This will also need a new repository requirement flag, possibly the same one. On-the-fly conversion for old clients may be more expensive. Some of the work has already been done - see tinymanifest extension.