Issue1080

Title Problem with UTF8 filenames on OSX
Priority bug Status resolved
Superseder Nosy List cyanite, danchr, djc, jldiaz, kiilerix, mg, mpm, pirmin, ralf1070, semtlnori, tksoh
Assigned To Topics casefolding

Created on 2008-04-07.17:01:10 by ralf1070, last changed 2011-12-30.22:15:16 by mpm.

Files
File name Uploaded Type Edit Remove
osxNFD.py jldiaz, 2009-04-23.18:32:06 text/x-python
osxNFD.py.patch semtlnori, 2011-09-14.06:04:04 application/octet-stream
osxNFD.pyc jldiaz, 2009-04-23.18:27:56 application/x-python-code
Messages
msg18456 (view) Author: mpm Date: 2011-12-30.22:15:16
Fixed in 2.0.1
msg17441 (view) Author: kiilerix Date: 2011-09-14.11:41:38
jldiaz / smtlnori: Please host the extension somewhere as described on
http://mercurial.selenic.com/wiki/PublishingExtensions so other users can
benefit from your extension too.
msg17435 (view) Author: semtlnori Date: 2011-09-14.06:04:04
jldiaz//

Here is my patch for your extension 'osxNFD.py' to fix an AttributeError.

Finally it works for me well! (MacOSX 10.7.1 and Mercurial 1.9.1)
msg9168 (view) Author: jldiaz Date: 2009-04-23.18:32:06
I attach now the correct file (osxNFD.py) 
In my previous message I attached the .pyc version.
msg9167 (view) Author: jldiaz Date: 2009-04-23.18:27:56
Hi,

I wrote a plugin which solves this issue to me.

I regularly use repositories both in linux and OSX, and generally I was forced
to avoid non-ascii characters in the filename, due to this issue.

Now, with this plugin installed in both sides (linux and OSX), all appears to
work fine. I tried to create/edit/delete files with aacute, aumlaut, etc in the
filename, doing commits, pulls, updates, etc in both platforms, and apparently
mercurial does not get confused anymore with these characteres. So I'm happy :-)

However, I'm not very confident in my programming skills. Basically I edited
win32mbcs, added a unicode.normalize at some points, and deleted other stuff
until it worked. Not very sure of what I was doing, however. And no test was
performed involving Win32 architecture.

I attach my solution. Perhaps it can be useful to other people in my situation,
or perhaps someone could revise my code and polish (or bless) it.
msg8347 (view) Author: ralf1070 Date: 2009-01-08.09:33:29
> braindamaged operating system

hard words ... :)

BTW: To avoid the problem I tried to simply rename such files to
OSX naming convention. This makes work on other systems then OSX
unhandy. As the default on Linux is the more compact UTF8 notation
and Linux only makes byte wise compare of file names you are unable
to type file names like they are default on OSX. So from a usability
point of view an optional file name normalization to the system
default would be a great thing to have ...

> ... so I'm not likely to tackle it

Thats ok - it's _your_ time you spend with this project - you can do
with it whatever you want.

Just for the records - mercurial is a great project. Kudos for every
contributor :)
msg8346 (view) Author: mpm Date: 2009-01-08.06:59:14
No, there's been absolutely no work on that front. I personally don't have any
braindamaged operating systems to play with so I'm not likely to tackle it.
msg8344 (view) Author: ralf1070 Date: 2009-01-07.18:51:43
for version 1.1 I saw an update note:
- Improved correctness in the face of casefolding filesystems

I had a look if this affects the UTF8 normalization problem
on OSX. As far as I can see - it does not. Is this the expected
result or are there new option I missed to activate?
msg7115 (view) Author: pirmin Date: 2008-09-13.19:13:28
I'm afraid I can't help you for a solution.
I just would like to show how this issue affects my intentional use of Mercurial.

Till now I have been using SVN to archive any project data in various repositories.
Some projects consist of only source files. Others also contain different kind
of documentation files.
For several reasons I don't want to change the names of files that I got from
customers.

Because of the nice features of Mercurial, I started to convert SVN repositories
to Mercurial.
The majority of the files of course come out correctly named. Some of them
however are looking strange.

Here is an example:
- In Subversion: Update der Roadmap für die Genève-Detaillieferungen.rtf
- In Mercurial:  Update der Roadmap für die Genève-Detaillieferungen.rtf

My solution is, only converting repositories to Mercurial, which contain only
simple file names.
I hope this issue can be solved some time, so I wouldn't need to work with
different versioning systems.

Regards
msg6512 (view) Author: mpm Date: 2008-07-14.22:33:26
The groundwork for dealing with this is now in place. It's basically analogous
to the case-insensitivity code we have for Windows.
msg6511 (view) Author: kiilerix Date: 2008-07-14.22:26:48
For reference, report and discussion of similar issue in subversion:
http://subversion.tigris.org/issues/show_bug.cgi?id=2464
http://svn.collab.net/repos/svn/trunk/notes/unicode-composition-for-filenames
msg5856 (view) Author: ralf1070 Date: 2008-04-09.11:45:35
> There IS something you can do: Create an extension.

Ok - I see your point. I will have a lock at the actual win32mbcs
extension when I find some spare time. Unfortunately I will not
have time for that, the next several weeks. Comes time, comes
extension ;)

Thanks for the very nice discussion so far.
Regards Ralf
msg5854 (view) Author: kiilerix Date: 2008-04-09.11:05:34
There IS something you can do: Create an extension. Right, win32mbcs can't be
used as it is, it solves another problem, and splitunc is win32-only
functionality. But it shows how file system layer can be monkey-patched and how
you could get the functionality you ask for.

AFAIK posix/unix filenames are always byte sequences. Upper layers can encode
unicode filenames in for example UTF-8. Python handles unicode file names by
encoding them to byte sequences using the global/system encoding. I'm no posix
expert, but I assume that full posix compliance can't be built on top of HFS.

At one abstraction level a filename _is_ a sequence of bytes. That is the level
where Mercurial is. Another level interpretes the bytes according to some
encoding and might consider glyph rendering. That is where you would like
Mercurial to be.

You say that Apple deals with Unicode but enforces that only normalized code
points can be used. I would say that "linux" deals with Unicode just as well
(also by using UTF-8 encoding) and allows any code points to be used. Choose one.

Easy "solution": If your policy says that team members shouldn't create casing
dependent filenames, then you could also enforce that only normalized Unicode in
UTF-8 encoding is used. You could perhaps use a trigger to enforce/guide that.

BTW FWIW: Apparently we have exactly the same issue with å in danish.
msg5853 (view) Author: ralf1070 Date: 2008-04-09.08:28:14
If it is a core design decision to not deal with system
specific file name problems I can't do anything against it.

Yesterday I read a lot about HFS. I see that they wanted to
be smart, they shot too short in the first run and then they
got overrun by reality. I think the primary decision
"If we deal with unicode, we do it semantically correct"
was the right one. The point of view "a file name is a
sequence of bytes and I, as a system programmer, do not care
about it's meaning" is handy, but lazy. Anyway - I can't do
anything about it.

The actual point is, there are different systems out there,
with different limitations. You have case agnostic systems
(at least Apple and Windows), you have at least one system
that tries to actually deal with unicode (Apple), you have
filesystems with special disallowed characters (':' on smb
comes into my mind). Thanks god 8.3 filesystems seem to be
gone.

From my point of view a SCMS should at least detect if it's
doing something strange - so if it creates files and they
magically disappear within operation it should be possible
to detect that. And it should be possible to detect that
files get mapped over each other - I already had the case
that one super clever member of my development team decided
to have files which only differ in case - which resulted in
really strange compilation problems on non unix platforms.

But if you are able to detect these conflicts once (if you
decide to be interested in such things), you are near to
be able to silently deal with the non conflicting cases.
Maybe I get an option then that solves my problem ;)

To directly answer you questions:

> What should happen if that ('ä' variants) repo is
> used on Mac?
> And what should happen when it is used on a system
> using an 8-bit dos-ish codepage without ä?

It should warn and bail out. From my point of view
one should have to use some kind of "--force" option
to ignore such conflicts.

I tried win32mbcs - it seems to have some problems on mac.
First I get the following:
[win32mbcs] cannot activate on this platform.

When I override this I get that:
** unknown exception encountered, details follow
** report bug details to http://www.selenic.com/mercurial/bts
** or mercurial@selenic.com
** Mercurial Distributed SCM (version 1.0)
Traceback (most recent call last):
  File "/opt/local/bin/hg", line 20, in <module>
    mercurial.dispatch.run()
  File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 20, in run
    sys.exit(dispatch(sys.argv[1:]))
  File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 29, in dispatch
    return _runcatch(u, args)
  File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 45, in _runcatch
    return _dispatch(ui, args)
  File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 340, in 
_dispatch
    repo = hg.repository(ui, path=path)
  File "/opt/local/lib/python2.5/site-packages/mercurial/hg.py", line 65, in repository
    hook(ui, repo)
  File "/opt/local/lib/python2.5/site-packages/hgext/win32mbcs.py", line 155, in reposetup
    install()
  File "/opt/local/lib/python2.5/site-packages/hgext/win32mbcs.py", line 127, in install
    os.path.splitunc = wrap(os.path.splitunc)
AttributeError: 'module' object has no attribute 'splitunc'

Am I right when I assume python does not expect unicode
file names on a unix like OS?
msg5852 (view) Author: djc Date: 2008-04-09.07:07:13
I think trying to convert filenames from filesystemencoding to unicode, then
comparing them, might not be too bad?
msg5850 (view) Author: kiilerix Date: 2008-04-09.00:37:49
I can understand your standpoint. For some purposes some "smart" handling of
filenames could be nice. For other purposes "smart" handling of file content is
needed. (And as Matt argue: if you have "smart" handling of one then you want it
for both.)

But AFAIK it is a design decision for core Mercurial that it doesn't try to be
smart. "Smartness" has been contributed as extensions - for example
KeywordExtension and Win32Extension. And especially win32mbcs extension looks
like what you are looking for.

You assume that Mercurial supports Unicode. It doesn't. But because it is
(almost) binary clean it can handle filenames as byte sequences in (almost) any
encoding. Mercurial doesn't know and doesn't care which encoding is used.

I'm sure that someone could make an extension that does what you request. It
might even work and solve your problem.

But on Unix it is and should be possible to create a repository containing files
with both both variants of ä and their corresponding uppercase variants. What
should happen if that repo is used on Mac? And should it depend on whether HFS
has been configured to be case sensitive or not? And what should happen when it
is used on a system using an 8-bit dos-ish codepage without ä?

Encodings are for presentation only. Any attempt to put semantics into them and
into which code points looks similar are doomed to fail. IMHO ;-)
msg5843 (view) Author: ralf1070 Date: 2008-04-08.09:54:41
> I assume that 
>  python -c 'open("\xc3\xa4".decode("utf-8"), "w")'
>  ls | hexdump -C
> will show that the file system doesn't return the files it has
> been given.

Your assumption is right:
00000000  61 cc 88 0a                                       |a?..|

> Checking out a repository on a system where the file can't be stored
> in the file system is "don't"

I'm a programmer myself - so I can at least understand your view. When
programming I use english as language of choice - it makes code more
portable: I can share it with anyone anywhere, no encoding problems
raised ... anyway.

You try to interpret a filesystem as a byte heap - you put bytes in and
get bytes out. For the actual file data I'm with you - a filesystem should
never touch data. For file names I'm biased. There was no problem with this
point of view when 8 bit encodings where used - nobody had space left to
implement the same character twice in such an encoding. Ok - it was never
really clear which encoding had been used where - but yes, you could see
file names as heap of bytes, compare them byte wise and you where done.

Then unicode had been invented and more importantly - used. Ok, we can have
every character of nearly every language in the world in any file name we
want. But we have an overhead now - _using_ unicode also means that we must
deal with the whole complexity of it. And one part of it _is_ there are
different byte strings with the very same meaning. In unicode it is
perfectly legal to represent e.g. the german 'umlaut a' as the character
itself or as an 'a' with a combining character. To compare strings one has
to use a normalized form of both of them.
(see: http://www.unicode.org/unicode/reports/tr15/index.html).
To interpret two different unicode representations of the same character as
different names actually _is_ wrong.

To make it more clear: the Linux behavior, to be able to have two different
file names which are actually the same unicode character - just in two
different representations, is much more broken then the apple statement
"we try to use a normalized representation but sometimes we failed".

Yes, unicode is bloated. There should have been one who had hit the
unicode inventors with a _big_ stick. But they are done now - we have to
live with the result.

So - "system where the file can't be stored in the filesystem" - is not
actually true. The filesystem just knows about unicode and normalizes names
to not have to deal with name clashes of the special kind. In this special
case mercurial does not do it's job - it has been told to use unicode but
it does it only halfway.

> On my system bash and python doesn't handle composed unicode characters
> as some think they should:
> python -c 'print len("\x61\xcc\x88".decode("utf-8")) == 2'

It's the same on OS X. Actually I assume this may be correct because this
is an 'a' followed by the combining character for two dot's above it. 

When I read the standard I can believe they may have decided that you can
have two strings with different length which are equal anyway ... *argl* ...
msg5841 (view) Author: kiilerix Date: 2008-04-07.20:53:00
(MPM answered while I was writing - but now I have written it and will post
anyway ;-)

I assume that 
 python -c 'open("\xc3\xa4".decode("utf-8"), "w")'
 ls | hexdump -C
will show that the file system doesn't return the files it has been given.
Admitted by apple at http://developer.apple.com/qa/qa2001/qa1173.html - some
call that "Utter crap" (http://ln-s.net/1mk5).

On Linux:
$ python -c 'open("\xc3\xa4".decode("utf-8"), "w")'
$ python -c 'open("\x61\xcc\x88".decode("utf-8"), "w")'
$ hg com -A -m test
$ ls .hg/store/data/
a~cc~88.i  ~c3~a4.i
$ rm -rf *
$ hg up
$ ls | hexdump -C
00000000  61 cc 88 0a c3 a4 0a                              |a......|
00000007

Checking out a repository on a system where the file can't be stored in the file
system is "don't" - just like windows users should stay away from repos having
"con", the file ":\foo ", or both "foo" and "Foo".

A workaround could require some kind of translation layer between core Mercurial
and the file system. That could perhaps be done in a monkey-patching extension.
But what should the extension do?

(On my system bash and python doesn't handle composed unicode characters as some
think they should: python -c 'print len("\x61\xcc\x88".decode("utf-8")) == 2' -
that's slightly crappy too)
msg5840 (view) Author: mpm Date: 2008-04-07.20:13:24
Mercurial is perfectly consistent: it does not alter bytecodes presented for
filenames at all. It's your operating system doing the mangling.

Your operating system probably thinks it's being clever and helpful by quietly
transcoding everything to Apple's modified variant D form of UTF-16 and back,
but it's instead creating a portability nightmare for you. If you've got a file
named ä.c in your repo and it's mentioned in a Makefile (or any number of
similar scenarios), your build will subtly break when moved between systems that
mangle and don't mangle.

Similarly, if you copy ä from Linux and back, you can expect to end up with two
files named ä with different encodings.
msg5839 (view) Author: ralf1070 Date: 2008-04-07.17:03:17
OSX version is 10.4.11
msg5838 (view) Author: ralf1070 Date: 2008-04-07.17:01:08
when creating filenames on OSX mercurial 1.0 behaves inconsistent:

The repository file:
.hg/store/data/~c3~a4.i

(which represents a german "umlaut a" as filename)

gives after "hg update" the following byte sequence:
> ls | hexdump -C
00000000  61 cc 88 0a    |a?..|

which is not totally wrong as this is another valid UTF-8 representation
of the german "umlaut a" - but at least for mercurial it is misleading:
> hg stat
? ä

There are two valid behaviors for mercurial:
1. make sure it creates the right sequence of bytes
or
2. accept the other valid UTF-8 notation it creates
 
Regards
History
Date User Action Args
2011-12-30 22:15:16mpmsetstatus: chatting -> resolved
nosy: mpm, tksoh, kiilerix, mg, djc, ralf1070, pirmin, cyanite, danchr, jldiaz, semtlnori
messages: + msg18456
2011-09-14 11:41:38kiilerixsetnosy: mpm, tksoh, kiilerix, mg, djc, ralf1070, pirmin, cyanite, danchr, jldiaz, semtlnori
messages: + msg17441
2011-09-14 06:04:04semtlnorisetfiles: + osxNFD.py.patch
nosy: + semtlnori
messages: + msg17435
2010-03-12 21:54:17danchrsetnosy: + danchr
2009-04-23 18:32:06jldiazsetfiles: + osxNFD.py
nosy: mpm, tksoh, kiilerix, mg, djc, ralf1070, pirmin, cyanite, jldiaz
messages: + msg9168
2009-04-23 18:30:30mgsetnosy: + mg
2009-04-23 18:28:03jldiazsetfiles: + osxNFD.pyc
nosy: + jldiaz
messages: + msg9167
2009-01-08 09:33:32ralf1070setnosy: mpm, tksoh, kiilerix, djc, ralf1070, pirmin, cyanite
messages: + msg8347
2009-01-08 06:59:15mpmsetnosy: mpm, tksoh, kiilerix, djc, ralf1070, pirmin, cyanite
messages: + msg8346
2009-01-07 18:51:46ralf1070settopic: + casefolding
nosy: mpm, tksoh, kiilerix, djc, ralf1070, pirmin, cyanite
messages: + msg8344
2008-09-17 19:28:16cyanitesetnosy: + cyanite
2008-09-13 19:13:30pirminsetnosy: + pirmin
messages: + msg7115
2008-07-14 22:33:28mpmsetnosy: mpm, tksoh, kiilerix, djc, ralf1070
messages: + msg6512
2008-07-14 22:26:51kiilerixsetnosy: mpm, tksoh, kiilerix, djc, ralf1070
messages: + msg6511
2008-04-17 00:59:59tksohsetnosy: + tksoh
2008-04-09 11:45:35ralf1070setnosy: mpm, kiilerix, djc, ralf1070
messages: + msg5856
2008-04-09 11:05:36kiilerixsetnosy: mpm, kiilerix, djc, ralf1070
messages: + msg5854
2008-04-09 08:28:15ralf1070setnosy: mpm, kiilerix, djc, ralf1070
messages: + msg5853
2008-04-09 07:07:14djcsetnosy: + djc
messages: + msg5852
2008-04-09 00:37:52kiilerixsetnosy: mpm, kiilerix, ralf1070
messages: + msg5850
2008-04-08 09:54:41ralf1070setnosy: mpm, kiilerix, ralf1070
messages: + msg5843
2008-04-07 20:53:02kiilerixsetnosy: + kiilerix
messages: + msg5841
2008-04-07 20:13:26mpmsetnosy: + mpm
messages: + msg5840
2008-04-07 17:03:17ralf1070setstatus: unread -> chatting
messages: + msg5839
2008-04-07 17:01:10ralf1070create