Differences between revisions 2 and 3
Revision 2 as of 2014-12-18 22:16:05
Size: 5210
Editor: MichaelEdgar
Comment: Add Design Highlights, Unsupported Scenarios
Revision 3 as of 2014-12-19 18:14:54
Size: 6047
Editor: MichaelEdgar
Comment: Detail potential for corruption in "Background"; add compensatory "capping" mechanism to design highlights
Deletions are marked like this. Additions are marked like this.
Line 25: Line 25:
As seen with the original data removal, deltas require agreement on a file revision's content. Depending on the scenario, revisions might successfully transfer, abort transfer due to hash mismatches, or ''silently corrupt the receiving repository'' in the worst case. This last point demonstrates that Mercurial itself must provide some native support to make removing file content viable. As seen with the original data removal, deltas require agreement on a file revision's content. Depending on the scenario, revisions might successfully transfer, abort transfer due to hash mismatches, or ''silently corrupt the receiving repository'' in the worst case. This last possibility stems from a [[http://selenic.com/hg/file/a4679a74df14/mercurial/revlog.py#l1241|"fast-path" optimization]] possible when adding exchanged deltas to revlogs, and demonstrates that Mercurial itself must provide some native support to make removing file content generally safe in practice.
Line 31: Line 31:
Users may configure a verification policy based on the expected tombstone contents; for example, a policy using a shared GPG key could verify tombstones containing GPG signatures. The default policy will be "abort" which always fails verification, and another built-in policy "ignore" will always pass verification. Users may configure a verification policy based on the expected tombstone contents; for example, a policy using a shared GPG key could verify tombstones containing GPG signatures. The default policy will be ''abort'' which always fails verification, and another built-in policy ''ignore'' will always pass verification.
Line 33: Line 33:
Exchange risk is largely mitigated by a new rule enforced when Mercurial natively supports censorship: ''a delta based on a censored revision must trivially replace the entire base text''. A conforming delta will apply correctly regardless of whether or not the base is censored, thanks to the tombstone's padding. This rule enables censor-aware Mercurial to emit valid deltas any client can use and reject deltas that it cannot itself use. Exchange risk is largely mitigated by a new rule enforced by any Mercurial which natively supports censorship: ''a delta based on a censored revision must trivially replace the entire base text''. A conforming delta will apply correctly regardless of whether or not the base is censored, thanks to the tombstone's padding. This rule enables censor-aware Mercurial to emit valid deltas any client can use and reject deltas that it cannot itself use.

An extra safeguard is introduced to the censorship operation, to reduce the impact of the revlog "fast-path" which skips verifying exchanged deltas. When a file revision is censored and is present in any topological heads, a new blank revision of the file is added to the filelog, '''capping''' the censored file node. Then, to each head which contains the newly-censored file node, we add a '''cap''' child changeset that modifies the file to use the new blank revision. This makes a censor-unaware Mercurial clone less likely to produce "fast-path" deltas that would corrupt a third censor-unaware clone.
Line 49: Line 51:
== Future Improvements ==

File Censorship Plan

DVCS users occasionally commit and publish sensitive data like passwords, private keys, and personally identifying information. "Censorship" will remove the sensitive data so future clones receive tombstone data instead.

Non-goals include: removing changesets due to sensitive commit messages, removing manifests due to sensitive file names, proactively removing sensitive file data from existing clones.

1. Introduction

As mentioned above, private data such as passwords or private keys can be unwittingly committed to source control, as well as legally sensitive data such as personally identifying information. While one can (and should) change passwords that are published, legal requirements can require PII to be removed from the source control system so it will no longer be shared. Data and software licenses can also require such removal after the license expires.

In DVCS like Mercurial, hashes demonstrate historical integrity by including parent hashes along with content (see MerkleTree). One can always rewrite each piece of history going back to the introduction of sensitive data. If enough published commits are based upon a commit containing sensitive file data, rewriting history may be prohibitively expensive. For example, the expiration of data/software licenses may require several years of history to be rewritten.

2. Background

If rewriting history is unpalatable, at present the owner of the repository must manually excise the data from the file's history and accept that the hash of that file will be unverifiable. Done blindly, any file revisions which are stored as a "delta" based on the offending file data (directly or transitively through a chain of deltas) will be unreadable after the base content is removed. No generally-available tools exist for performing this delicate surgery.

The repository owner could generally continue committing to the heads of the repository, but attempts to view the repository at any changeset containing the sensitive file data will fail due to the hash mismatch (examples: hg update, hg diff, hg annotate). "hg verify" will fail due to the hash mismatch as well. Clones of such a tainted repository that don't yet have the excised data will not receive it and inherit the limitations of the original repo. Existing clones which do not have a copy of the data will behave similarly.

Existing clones of the repository which include the offending data are unaffected by modifications to the original repository's history - there is no general means through which the original could "reach out" and remove data from all clones. So these existing clones will remain fully functional. They will successfully interoperate with the original except when sending or receiving new revisions of the affected file, due to the use of deltas in revision exchange.

As seen with the original data removal, deltas require agreement on a file revision's content. Depending on the scenario, revisions might successfully transfer, abort transfer due to hash mismatches, or silently corrupt the receiving repository in the worst case. This last possibility stems from a "fast-path" optimization possible when adding exchanged deltas to revlogs, and demonstrates that Mercurial itself must provide some native support to make removing file content generally safe in practice.

3. Design Highlights

Individual file revisions may be censored. When requested by a user, a censored revision is presented as an empty file if it can be verified. Censored file revisions have non-empty data called a tombstone: metadata subject to verification, padded to match the size of the censored data.

Users may configure a verification policy based on the expected tombstone contents; for example, a policy using a shared GPG key could verify tombstones containing GPG signatures. The default policy will be abort which always fails verification, and another built-in policy ignore will always pass verification.

Exchange risk is largely mitigated by a new rule enforced by any Mercurial which natively supports censorship: a delta based on a censored revision must trivially replace the entire base text. A conforming delta will apply correctly regardless of whether or not the base is censored, thanks to the tombstone's padding. This rule enables censor-aware Mercurial to emit valid deltas any client can use and reject deltas that it cannot itself use.

An extra safeguard is introduced to the censorship operation, to reduce the impact of the revlog "fast-path" which skips verifying exchanged deltas. When a file revision is censored and is present in any topological heads, a new blank revision of the file is added to the filelog, capping the censored file node. Then, to each head which contains the newly-censored file node, we add a cap child changeset that modifies the file to use the new blank revision. This makes a censor-unaware Mercurial clone less likely to produce "fast-path" deltas that would corrupt a third censor-unaware clone.

4. Implementation Details

5. Unsupported Exchange Scenarios

There is at least one identified exchange involving old Mercurial clients which could result in repository corruption:

  1. Original repo R has C changesets. File F has N <= C revisions. R is maintained with censor-aware Mercurial.

  2. R is cloned by an old client, creating repo Y.
  3. Repo R censors file F at the Nth revision. This adds a "capstone" revision to F, N+1, linked to a new changeset C+1.
  4. R is cloned by an old client using "hg clone -r C", creating repo Z.
  5. If Z receives changes from Y or vice-versa, they might corrupt each other's filelog for F.

6. Testing Plan

7. Future Improvements



CategoryNewFeatures

CensorPlan (last edited 2015-06-12 08:44:30 by Pierre-YvesDavid)