<!> This page is intended for developers.

<!> This is a proposed feature, last updated on 2011-10-26.

Motivation

Microsoft Windows systems typically treat filenames as character sequences, whereas Unix systems typically treat filenames as byte sequences. Thus, where it is natural for a filename to have a single interpretation on a Windows system, a filename on a Unix system can be interpreted in multiple ways. Consider one user on a Unix system using a locale with the ISO-8859-1 encoding who saves a file called "Þingvellir"; another user using the ISO-8859-5 who then browses the directory containing this file will not see the correct name for the file.

Although there may be confusion about the interpretation of the file's name, on Unix it is possible to avoid taking a position on what the name should actually be: the user who saved the file will still see the correct representation of the name (unless they change their locale, which is not completely unlikely over time). Other users can still manipulate the file, although they may see question marks or other characters in place of the proper characters in the name, and copying a name containing such placeholder characters and expecting the name to still be recognised by the operating system is likely to lead to disappointment.

However, upon wishing to transfer such an ambiguously named file to a Windows system, the issue of interpretation arises immediately since the naming of the file must reflect the actual intended name and employ character values, not just a bag of bytes.

Objectives

Per-repository encoding configuration needed

A clearly encoded configuration for each repository is needed, so that we don't need to guess the repository's encoding. A per-user or per-machine encoding configuration may not be a good idea, because a single user may interact with repositories from different origins, employing different locales, and a single machine may have many users, each using a selection of different locales.

So the repository’s encoding configuration should be like this:

old_repository_encoding

such as ascii, cp1251, cp936, cp1252, utf8 and so on

This is the encoding of the repository before conversion.

repository_encoding

such as utf8 or ascii

In most cases, it's utf8. However, some users don't want to use UTF-8, so it could be another encoding (as noted above), and those users can still do everything in the old way.

separator_revision

such as 0,128,312

This is maybe not be a good choice, please refer to #Other choice about encoding configuration

0 means the repository only contains one encoding, so [0, tip] is encoded in repository_encoding.

Anything else means [0, separator_revision) is encoded in old_repository_encoding, and [separator_revision, tip] is encoded in repository_encoding.

We set three parameters because when we migrate from an old repository to a new repository, we face a new problem: how to checkout old history? For example, how do we checkout old-tag, old-branch? When we check out those revisions, we need to work in the old way.

Other choice about encoding configuration

  1. Convert all old repository's paths to a new encoding (UTF-8) and recommit it. I don't know such a modification will disturb something, such as whether the hash will be changed for each ctx. (This is important.)

  2. Set a ctx.extra['encoding'] property. So when commit a revision, the encoding information is also attached. Then when we play with a ctx, the following scheme applies:

    •    1 if ctx is working directory:
         2   ctx_encoding = repository_encoding # Such as UTF8
         3 if 'encoding' in ctx.extra():
         4   ctx_encoding = ctx.extra()['encoding']
         5 else:
         6   ctx_encoding= old_repository_encoding.
      

The scheme applying only to Windows

   1 if windows:
   2   if there is configuration about encoding:
   3     rev = current revision
   4     if rev is working directory:
   5       '''
   6       Because the working directory’s encoding may differ from it’s parent encoding.
   7       When this happened, we should automatically handle the rename on those messed
   8       up characters.
   9       '''
  10       mode = repository_encoding, parent_encoding
  11     elif rev < separator revision:
  12       mode = old_repository_encoding
  13     else:
  14       mode = repository_encoding
  15   else:
  16     mode = passthrough
  17 else:
  18   mode = passthrough

Schema support for all OS

   1 if there is configuration about encoding:
   2   rev = current revision
   3   if rev is working directory:
   4     '''
   5     Because the working directory’s encoding may differ from it’s parent encoding.
   6     When this happened, we should automatically handle the rename on those messed
   7     up characters.
   8     '''
   9     mode = repository_encoding, parent_encoding
  10   elif rev <= separator revision:
  11     mode = old_repository_encoding
  12   else:
  13     mode = repository_encoding
  14 else:
  15   mode = passthrough

Upgrade

Also which need to supply a upgrade tool to upgrade to using utf8 as the repository encoding on old tools.

hg encoding upgrade --old ascii --new utf8 --sep 1920
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --old cp936 --new utf8
hg encoding upgrade --new utf8 #The old is the current locale
hg encoding upgrade #The old is the current local and the new is utf8
hg encoding --verify # Iterating on the whole repository, to verify each path in each revision is encoding in config encoding.
hg encoding upgrade --clear

option explain

--old The old repository's original encoding.
--new In most case, it's utf8, when someone want to use Mercurial in old way, then it's can be setting to other things(such as ascii for make sure each committed filename are encoded in ascii).
--sep Setting the separator revision, when not specified, then it's the newest revision. or if old is ascii, then the sep revision is 0
--clear #Regenerating the repository with all path under all revision is converted to new encoding(utf8). (There may someone desire for it)

Notes

Compatibility

Old versions of Mercurial should just continue to work as normal when a given repository is not configured for encodings. Otherwise, if the repository's encoding is the same as the locale encoding then it should also continue to work exactly as it did before. If the repository only contains ASCII paths (byte values in the range 0 to 0x7F), it should also function exactly as before. Where none of these conditions are met, a checkout may be obtained, but it may not function properly. It is envisaged that upgraded/converted repositories will not in general function with old versions of Mercurial.

Questions and Answers

General test case

commit and checkout the following files without problems under win2k and upper OS, then content setting to it’s filename, encoded with utf8.

Chinese (Traditional).txt
简体.txt
繁体.txt
중국어 (번체).txt
Chinês (Tradicional).txt
áéíóúñ.txt 'accented characters'


CategoryDeveloper CategoryNewFeatures

UnicodeOnWindows (last edited 2011-10-26 15:03:27 by PaulBoddie)