Nested repositories
Our intention is to integrate a subset of the functionality of the ForestExtension into the core of Mercurial, while maintaining simplicity. This isn't quite a design document: it's more an exploration of the different design decisions that might make sense, and what the tradeoffs are.
This feature has been added to Mercurial 1.3 (1st July 2009). See subrepos for usage documentation.
Contents
1. Similar concepts in other systems
git
svn
Perforce
Perforce clients have a spec with a view of how server repository should be mapped to client workspace - all arbitrary bijective mappings are possible.
2. Goals
The goal is to be able to use multiple repositories as a single, loosely coupled, unit. A "parent" has a notion of several "modules" that live under it. In at least some cases, performing a command in the parent should affect the modules.
By "loosely coupled", we mean that repositories are largely independent.
Relationships are hierarchical and one-to-many: a parent knows about its modules, but they do not know about their parent or sibling repositories.
2.1. Use cases
Here are the important needs we would like to at least consider.
- "Vendor branch": a pile of code that is almost never touched by developers, but that is needed to build a project.
- Modular development: a system composed of largely independent units that do not need to be versioned together.
- Partial views: a developer who only needs to work with two out of twelve modules should not have to download or deal with the other ten.
2.2. Terminology
Names used by sundry systems:
- vendor branch
- external
- submodule
- forest
I'm arbitrarily choosing "module".
3. Managing modules
Modules are listed explicitly, in a directory named .hgmodules in the root of the tree (suggested by BrendanCully). Each directory under .hgmodules corresponds to a module that will be present in the working directory. For example, a directory .hgmodules/foo/bar contains information about a module that will be located in foo/bar in the working directory.
The files and directories under .hgmodules are intended to be read and written by machine.
Note by RonnyPfannschmidt: why .hgmodules instead of .hg/modules those are for machines anyway, keep the workdir more clean! ~-Because content under .hg is local to the repository (cf .hgtags vs .hg/localtags). .hgmodules is versioned and shared.
For each module, its directory must contain the following files:
default: specifies the URL to clone from. This is a Mercurial extended URL, so appending #rev ensures that the given revision will be checked out.
optional: if this file is present, the module is optional, not required, for builds and the like.
- If an optional module is not present locally, this is not an error.
The repository directory structure for the .hgmodules given above looks like this:
parent-repo-dir/
.hg/
.hgmodules/foo/bar/default
.hgmodules/quux/default
<working dir content from parent-repo-dir>
foo/bar/
.hg/
<working dir of foo/bar module>
quux/
.hg/
<working dir of quux module>The configuration files in these directories are plain text, but not intended to be edited by hand. How do we modify them?
Do we modify the add, remove, and rename commands to edit them?
Do we add a hg module command that will do some or all of the editing?
Probably the latter.
3.1. Discussion
- A generic 'description = words' field per repository might be helpful
Some kind of 'forest tag' or ability to describe the state of the entire forest with one file or simple set of changeset ids. So something like the existing forest command 'hg fsnap' ('hg module --dumpstate > state'?) then being able to re-create a forest later in time with a 'hg clone -M state'? [BenoitAllard: see identify]
Do we need to protect ourselves from overlapping managed files? e.g. an outer repository managing files that are inside the inner repository directories? DovFeldstern: I think that we are already protected: given repo1/.hg and repo1/repo2/.hg, mercurial does not allow adding any file under repo2 to repo1 -- if you try, it responds with abort: path 'repo2/a' is inside repo 'repo2'. JensWWulf: hg will allow to add a file to two repos if you add it to the parent repo before adding it to the underlying repo. RomanBarczynski: hg should allow you to add/move/remove repo1/repo2/foo/bar file to both repo1 and repo2 anytime you want and not ignoring by default repo2 files in repo1 (e.g. repo2 contains third-party lib with your own config file you don't want to push upstream but you do want to manage it under repo1).
- Two additional suggestions which should not be hard to implement, and which would allow much more flexibility --- to the point of being able to build a full-fledged Configuration Management solution over mercurial:
allow the user to specify a modules tree other than .hgmodules, using an option --modules-file / hgrc setting / environment variable.
provide some kind of include mechanism in the modules file, with a clear override scheme for which version to use of specific modules which are included in more than one file (e.g., the latest included file overrides previously included files; the including file overrides included files)
- Two additional suggestions which should not be hard to implement, and which would allow much more flexibility --- to the point of being able to build a full-fledged Configuration Management solution over mercurial:
- Can a module itself have a .hgmodules files, recursively interpreted?
- Why use a directory structure instead of a single configuration file? I see that these files are not supposed to be edited by humans, but I see several advantages of a normal configuration file:
I can (probably) edit a file much more quickly with an editor than using hg module ... commands.
- I can edit it in more advanced ways: search/replace, moving sections around.
- I can easily mail the configuration to other places. With a directory structure I have to wrap it up in a tarball/zipfile first.
If the configuration file is parsed using, say, ConfigObj then it could also preserve comments left behind by the user. (But in a directory structure one might simply ignore unknown files and so treat them as "comments" so this might not be a big difference.)
- Why use a directory structure instead of a single configuration file? I see that these files are not supposed to be edited by humans, but I see several advantages of a normal configuration file:
Why can we not just have Mercurial figure out that it's an existing Mercurial project when you do hg add, and simply mark the file as so, in the same way that you mark files as files, directories as directories, symbolic links as symbolic links, etc? The existing add, remove and rename commands would work fine thus negating the need for hg module add et al. Any additional metadata would need to be handled in a configuration file or, better, as versioned properties of elements in the repository. (I don't know if versioned properties exist, but it's better to have them as versioned properties as that will mean that when things are renamed, the properties will follow.)
- It should be possible to have a project that contains source to also have sub projects, i.e. the root of the tree should not be just for referencing sub projects. It doesn't sound like anything proposed here would affect that but it's not clear.
All changes to the configuration must be versioned. DovFeldstern: I'm not sure if this is what was meant or not, but IMO changes to the configuration should not automatically trigger a commit; rather, .hgmodules will be changed as required, and the changes should be committed like any other change when the user chooses to commit.
- This page needs a "Status" section describing the status of this proposed change. I.e. In design phase, in coding, proposed timeline, etc. I have a need of this kind of feature and from reading the page I have no idea as to whether it's "right around the corner" or still in the "vigorous hand waving" stage.
CVS also has the ability to manage multiple repository subdirectories from the current working directory. In fact, it doesn't even need a "CVS" directory in the "parent" directory. It simply automatically recurses down subdirectories to find */CVS/Repository and */CVS/Entries files. (Although I'm not sure how far down it will recurse -- maybe only a single level.
Has anyone discussed the idea of having changeset dependencies between modules? I.e. if I make a change in modules A and B and the changes in B require the corresponding changes in A (think of A as a library that B uses).
4. Important open questions
Does it only make sense to think about modules when we have a working directory? Presumably yes, but this introduces the need to possibly have a network connection in order to clone missing modules during a hg update or similar.
- If not, where do modules live when we don't have a working directory? (It would be technically possible to separate a module's working directory from its repository, for example, though I'm not sure we want to go there.)
For now, I'm assuming that if there's no working directory, there are no modules.
Here's another sticky question without an obvious answer: By default, should commands that operate in the working directory recurse into modules?
A nice idea prompted on the MailingList is to enable/disable it via an option in the .hgrc file.
The alternative that I lean towards is to not recurse unless explicitly instructed to. Most probably, only a few commands should arguably even be aware of modules.
This model assumes that modules will usually only be read, and checked out at a fixed revision, such that automatically running status queries or updates in them makes little sense: they won't change often enough to be worth the effort. This is in line with the usual use of externals in SVN, and with CVS vendor branches.
For people who would be actively developing in multiple repositories, however, this provides poor support. If you have a better idea, let's hear it! Note that the existing config mechanism lets you add a "--modules" option to whatever commands you think need it.
If a command like "add" is run in a parent repository's working directory, and given a path to a file in a modules's working directory, what should its behaviour be? The current behaviour is to complain and fail: should this remain?
What about nested nested-repositories ? If I have a .hgmodules tree in one of my modules, should a command issued at the root level also recurse in those "sub-modules" ? I guess so.
/root/
.hg/
.hgmodules
module1/
.hg/
module2/
.hg/
.hgmodules
module21/
.hg/
module22/
.hg/
module3/
.hg/In the structure above, does a command issued at the root level should also take into account module21 and module22 ? If only module21 is listed in the .hgmodules of module2. What if I have module22 recorded as a module of root ?
5. User interface changes
5.1. The module command
We add the "module" command, for managing modules. It has several subcommands.
- "add" introduces a single new module. A local copy of the repository must already be present. Options:
- "-r": the revision to use.
- "-b": the branch to use.
- "-u": the URL to use.
~-Note by RonnyPfannschmidt: why not just use hg clone/hg init since most of the commands need to be module-aware anyway
~-Note by ArneBab: Why not hg module clone, as in hgsuversion? It would add the module to .hgmodules and then clone it to the specified location.
"remove" removes one or more modules. (This could be done by modifying the regular remove command...)
- "record" updates the changeset ID associated with each module. Uses the working directory's parent from each module. Aborts if any module has zero or two parents.
To clone optional modules, do we extend the behaviour of the built-in clone command, or add a "clone" command here (+1) ?
5.2. Changes to existing commands
5.2.1. Uniform option naming
We introduce a standard -M / --modules option for commands that need to become module-aware. The name of the option is standard: its interpretation can change, depending on the command.
~-Note by AdrianBuehlmann: -M is already used as --no-merges on hg log, incoming and outgoing
5.2.2. clone
If invoked with -U to avoid an update, this simply does not clone any modules.
For behaviour without -U, see "update" below.
- We do not need to special-case a local clone: that would be handled by "update", too.
5.2.3. update
- If a required module is missing, it is cloned and updated.
Before trying the URL stored in the .hgmodules/.../default file, we attempt to clone the module from a location relative to wherever we cloned the parent from.
If the default for a module test is http://hg.example.com/foo but we cloned its parent from http://otherexample.net/bar, we try to clone the module from http://otherexample.net/bar/test before trying the default location.
- If an optional child module is missing, nothing happens.
- The content of the ".hgmodules" file in the working directory is used to decide which children to clone and update.
- In other words, changes to the ".hgmodules" file do not need to be committed in order to have an effect, like for the ".hgignore" file.
- Children are not inspected or updated until work in the parent is complete: this traversal is breadth-first, not depth-first.
Ideas that probably don't make sense:
The -M / --modules option causes each module to be updated to whatever revision is appropriate, based on the current contents of ".hgmodules".
- Requiring an explicit option makes it too easy to get the parent out of sync with its modules.
The default here should be to require an option to avoid updating the children.
5.2.4. add, remove, rename
- What should these commands do if asked to operate on a module, or a directory containing a module?
- Modify ".hgmodules" to add, remove, or rename a module? (+1)
- Print a warning advising ... something else to be done?
- Remain untouched?
5.2.5. pull
- Accepts a -M / --modules option, to pull in modules as well as this repository.
- If both --modules and --update are specified, both this repository and each module are updated.
- Not clear whether the order of execution (relative to the parent) matters.
- If one pull fails, do the others continue, or does everything come to a halt?
JesseGlick: I would expect pull -u (or fetch) with --modules to first update the parent, then inspect its updated .hgmodules to see what modules might be there that also need to be updated.
5.2.6. push
- Accepts a -M / --modules option, to push from modules as well as this repository.
Must push from all children (depth first) before the parent, otherwise remote users will not be able to pull when a push has partially completed, because ".hgmodules" may refer to revisions not yet pushed.
- If one push fails, do the others continue, or does everything come to a halt?
5.2.7. bundle
- Accepts a -M / --modules option to create a global bundle including the nested modules.
- Optional ones will not be included unless specifically asked.
JesseGlick: I'm not sure what bundle --modules should do, actually. The current format can only bundle changesets from one repo.
5.2.8. incoming, outgoing
- Accept -M / --modules options, to operate in modules as well as this repository.
5.2.9. tag
- Accepts a -M / --modules option, to tag in modules as well as this repository.
5.2.10. branch, branches
- Accepts a -M / --modules option to create a branch / show branches in modules as well as in the repository.
5.2.11. status
Accepts a -M / --modules option. This simply lists modules: it does not recurse into modules. What if I want the status of the files in my modules ?
- We can identify modules that are present with "M" - what do we do for modules that are missing? What about optional modules?
5.2.12. identify
Accepts a -M / --modules option. This simply dump the state of the current repository and its sub-modules (à la fsnap from the forest extension)
5.3. Questionable commands
Here are some possible behaviours for commands where it's really not clear that being module-aware makes sense at all.
5.3.1. commit
- Accepts a -M / --modules option, to commit inside modules.
- If a commit message is not explicitly provided, we use the commit message from the parent in every module, or prompt for a new message in each?
- Probably the former.
BenoitAllard is not for making the commit recursive, as each feature has to stay in each owns repository. That's encouraging people to bad habits: creating one module per directory. A commit common to a list of repository is more on my sense a lack in the design of the repositories.
We have the possibility of rolling every commit back if any commit fails, when using --modules. Do we want to do this?
JesseGlick: commit --modules would be nice (for a forest of loosely synchronized repositories) but not essential.
MarcusLindblom: add an option in .hgmodules whether this is allowed or not, to allow any policy. Default to not allowed (unless forced) ?
5.3.2. Next sticky question
If we make "commit" module-aware, why not status, diff, and all the rest?
6. Implementation
Alexander Solovyov has a proof-of-concept implementation to provide subrepositories written as an extension. To make it work, a patched version of Mercurial is needed. See details in the extension docs.
There's also an implementation of subrepos as an integrated feature of the mercurial core.
One more implementation as an extension: subrepo extension
An extension for handling external dependencies (and Mercurial subrepositories): hgdeps extension
