This page is intended for developers.
This is a proposed feature, last updated on 2011-10-26.
Motivation
Microsoft Windows systems typically treat filenames as character sequences, whereas Unix systems typically treat filenames as byte sequences. Thus, where it is natural for a filename to have a single interpretation on a Windows system, a filename on a Unix system can be interpreted in multiple ways. Consider one user on a Unix system using a locale with the ISO-8859-1 encoding who saves a file called "Þingvellir"; another user using the ISO-8859-5 who then browses the directory containing this file will not see the correct name for the file.
Although there may be confusion about the interpretation of the file's name, on Unix it is possible to avoid taking a position on what the name should actually be: the user who saved the file will still see the correct representation of the name (unless they change their locale, which is not completely unlikely over time). Other users can still manipulate the file, although they may see question marks or other characters in place of the proper characters in the name, and copying a name containing such placeholder characters and expecting the name to still be recognised by the operating system is likely to lead to disappointment.
However, upon wishing to transfer such an ambiguously named file to a Windows system, the issue of interpretation arises immediately since the naming of the file must reflect the actual intended name and employ character values, not just a bag of bytes.
Objectives
- A repository employing only ASCII filenames should not be affected by this mechanism.
Only Windows will be affected by Unicode conversion of filenames. A test for sys.platform == 'win32' should guard the functionality.
- UTF-8 should be the default filename encoding for new commits.
To support the old encoding-insensitive Mercurial repository format, a new --encoding option will be offered.
- Automatic renaming of ambiguous filenames to UTF-8-encoded filenames should be supported.
Per-repository encoding configuration needed
A clearly encoded configuration for each repository is needed, so that we don't need to guess the repository's encoding. A per-user or per-machine encoding configuration may not be a good idea, because a single user may interact with repositories from different origins, employing different locales, and a single machine may have many users, each using a selection of different locales.
So the repository’s encoding configuration should be like this:
old_repository_encoding |
such as ascii, cp1251, cp936, cp1252, utf8 and so on |
This is the encoding of the repository before conversion. |
repository_encoding |
such as utf8 or ascii |
In most cases, it's utf8. However, some users don't want to use UTF-8, so it could be another encoding (as noted above), and those users can still do everything in the old way. |
separator_revision |
such as 0,128,312 |
This is maybe not be a good choice, please refer to #Other choice about encoding configuration |
0 means the repository only contains one encoding, so [0, tip] is encoded in repository_encoding. |
||
Anything else means [0, separator_revision) is encoded in old_repository_encoding, and [separator_revision, tip] is encoded in repository_encoding. |
We set three parameters because when we migrate from an old repository to a new repository, we face a new problem: how to checkout old history? For example, how do we checkout old-tag, old-branch? When we check out those revisions, we need to work in the old way.
Other choice about encoding configuration
Convert all old repository's paths to a new encoding (UTF-8) and recommit it. I don't know such a modification will disturb something, such as whether the hash will be changed for each ctx. (This is important.)
Set a ctx.extra['encoding'] property. So when commit a revision, the encoding information is also attached. Then when we play with a ctx, the following scheme applies:
The scheme applying only to Windows
1 if windows:
2 if there is configuration about encoding:
3 rev = current revision
4 if rev is working directory:
5 '''
6 Because the working directory’s encoding may differ from it’s parent encoding.
7 When this happened, we should automatically handle the rename on those messed
8 up characters.
9 '''
10 mode = repository_encoding, parent_encoding
11 elif rev < separator revision:
12 mode = old_repository_encoding
13 else:
14 mode = repository_encoding
15 else:
16 mode = passthrough
17 else:
18 mode = passthrough
Schema support for all OS
1 if there is configuration about encoding:
2 rev = current revision
3 if rev is working directory:
4 '''
5 Because the working directory’s encoding may differ from it’s parent encoding.
6 When this happened, we should automatically handle the rename on those messed
7 up characters.
8 '''
9 mode = repository_encoding, parent_encoding
10 elif rev <= separator revision:
11 mode = old_repository_encoding
12 else:
13 mode = repository_encoding
14 else:
15 mode = passthrough
Upgrade
Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.
hg encoding upgrade --old ascii --new utf8 --sep 1920 hg encoding upgrade --old cp936 --new utf8 hg encoding upgrade --old cp936 --new utf8 hg encoding upgrade --new utf8 #The old is the current locale hg encoding upgrade #The old is the current local and the new is utf8 hg encoding --verify # Iterating on the whole repository, to verify each path in each revision is encoding in config encoding. hg encoding upgrade --clear
option explain
--old The old repository's original encoding. --new In most case, it's utf8, when someone want to use Mercurial in old way, then it's can be setting to other things(such as ascii for make sure each committed filename are encoded in ascii). --sep Setting the separator revision, when not specified, then it's the newest revision. or if old is ascii, then the sep revision is 0 --clear #Regenerating the repository with all path under all revision is converted to new encoding(utf8). (There may someone desire for it)
Notes
- The filename conversion only occurs on Windows.
- The encoding criteria have to be explicitly defined in advance so that nothing needs to be guessed afterwards. After an initial upgrade/conversion, everything will handled automatically.
Existing repos will not be affected at all, except that the user can still execute the hg encoding command to upgrade the repository.
- Newly created repositories will by default be set to UTF-8, but we can still supply an option to create repository in the old way.
Compatibility
Old versions of Mercurial should just continue to work as normal when a given repository is not configured for encodings. Otherwise, if the repository's encoding is the same as the locale encoding then it should also continue to work exactly as it did before. If the repository only contains ASCII paths (byte values in the range 0 to 0x7F), it should also function exactly as before. Where none of these conditions are met, a checkout may be obtained, but it may not function properly. It is envisaged that upgraded/converted repositories will not in general function with old versions of Mercurial.
Questions and Answers
- Question
- So what happens when I create and check in a file that uses a non-ascii file name encoded in something other than utf8 on a Unix box and you try and check it out on your windows box?
- Answer
- First, you know which encoding you used under Unix box. suppose the encoding is cp936. On windows, execute the following command
hg encoding upgrade --old cp936 --new cp936
It's only need to execute in one time, only when the repository's encoding is not decided. Then you will working fine with it on windows.
- Question
- I don't see why the repo needs an encoding (where by "repo" I mean the stuff under .hg).
- Answer(By Andrey)
- The same file name is encoded differently for different platforms. For instance 'дятел.txt' cannot be exchanged between Unix and Windows.
General test case
commit and checkout the following files without problems under win2k and upper OS, then content setting to it’s filename, encoded with utf8.
Chinese (Traditional).txt 简体.txt 繁体.txt 중국어 (번체).txt Chinês (Tradicional).txt áéíóúñ.txt 'accented characters'
