File Censorship Plan

DVCS users occasionally commit and publish sensitive data like passwords, private keys, and personally identifying information. "Censorship" will remove the sensitive data so future clones receive tombstone data instead.

1. Introduction

As mentioned above, private data such as passwords or private keys can be unwittingly committed to source control, as well as legal-ly sensitive data such as personally identifying information. While one can (and should) change passwords that are published, legal requirements can require PII to be removed from the source control system so it will no longer be shared. Data and software licenses can also require such removal after the license expires.

In DVCS like Mercurial, hashes demonstrate historical integrity by including parent hashes along with content (see MerkleTree). One can always rewrite each piece of history going back to the introduction of sensitive data. If enough published commits are based upon a commit containing sensitive file data, rewriting history may be prohibitively expensive. For example, the expiration of data/software licenses may require several years of history to be rewritten.

2. Background

If rewriting history is unpalatable, at present the owner of the repository must manually excise the data from the file's history and accept that the hash of that file will be unverifiable. Done blindly, any file revisions which are stored as a "delta" based on the offending file data (directly or transitively through a chain of deltas) will be unreadable after the base content is removed. No generally-available tools exist for performing this delicate surger-y.

The repository owner could generally continue committing to the heads of the repository, but attempts to view the repository at any changeset containing the sensitive file data will fail due to the hash mismatch (examples: hg update, hg diff, hg annotate). "hg verify" will fail due to the hash mismatch as well.

Fresh clones of such a tainted repository would not receive the sensitive data, but would inherit the limitations of the original repo. Existing clones which do not have a copy of the data will behave similarly.

Existing clones of the repository which include the offending data are unaffected by modifications to the original repository's history - there is no general means through which the original could "reach out" and remove data from all clones. So these existing clones will remain fully functional. They will successfully interoperate with the original except when sending or receiving new revisions of the affected file, due to the use of deltas in revision exchange.

As seen with the original data removal, deltas require agreement on a file revision's content. Depending on the scenario, revisions might successfully transfer, abort transfer due to hash mismatches, or silently corrupt the repository in the worst case.

TODO(adgar): Verification

== Design Highlights

== Implementation Details

== Testing Plan

CategoryNewFeatures CategoryNewFeatures CategoryNewFeatures