Created on 2008-04-07.17:01:10 by ralf1070, last changed 2011-12-30.22:15:16 by mpm.
| File name |
Uploaded |
Type |
Edit |
Remove |
|
osxNFD.py
|
jldiaz,
2009-04-23.18:32:06
|
text/x-python |
|
|
|
osxNFD.py.patch
|
semtlnori,
2011-09-14.06:04:04
|
application/octet-stream |
|
|
|
osxNFD.pyc
|
jldiaz,
2009-04-23.18:27:56
|
application/x-python-code |
|
|
| msg18456 (view) |
Author: mpm |
Date: 2011-12-30.22:15:16 |
|
Fixed in 2.0.1
|
| msg17441 (view) |
Author: kiilerix |
Date: 2011-09-14.11:41:38 |
|
jldiaz / smtlnori: Please host the extension somewhere as described on
http://mercurial.selenic.com/wiki/PublishingExtensions so other users can
benefit from your extension too.
|
| msg17435 (view) |
Author: semtlnori |
Date: 2011-09-14.06:04:04 |
|
jldiaz//
Here is my patch for your extension 'osxNFD.py' to fix an AttributeError.
Finally it works for me well! (MacOSX 10.7.1 and Mercurial 1.9.1)
|
| msg9168 (view) |
Author: jldiaz |
Date: 2009-04-23.18:32:06 |
|
I attach now the correct file (osxNFD.py)
In my previous message I attached the .pyc version.
|
| msg9167 (view) |
Author: jldiaz |
Date: 2009-04-23.18:27:56 |
|
Hi,
I wrote a plugin which solves this issue to me.
I regularly use repositories both in linux and OSX, and generally I was forced
to avoid non-ascii characters in the filename, due to this issue.
Now, with this plugin installed in both sides (linux and OSX), all appears to
work fine. I tried to create/edit/delete files with aacute, aumlaut, etc in the
filename, doing commits, pulls, updates, etc in both platforms, and apparently
mercurial does not get confused anymore with these characteres. So I'm happy :-)
However, I'm not very confident in my programming skills. Basically I edited
win32mbcs, added a unicode.normalize at some points, and deleted other stuff
until it worked. Not very sure of what I was doing, however. And no test was
performed involving Win32 architecture.
I attach my solution. Perhaps it can be useful to other people in my situation,
or perhaps someone could revise my code and polish (or bless) it.
|
| msg8347 (view) |
Author: ralf1070 |
Date: 2009-01-08.09:33:29 |
|
> braindamaged operating system
hard words ... :)
BTW: To avoid the problem I tried to simply rename such files to
OSX naming convention. This makes work on other systems then OSX
unhandy. As the default on Linux is the more compact UTF8 notation
and Linux only makes byte wise compare of file names you are unable
to type file names like they are default on OSX. So from a usability
point of view an optional file name normalization to the system
default would be a great thing to have ...
> ... so I'm not likely to tackle it
Thats ok - it's _your_ time you spend with this project - you can do
with it whatever you want.
Just for the records - mercurial is a great project. Kudos for every
contributor :)
|
| msg8346 (view) |
Author: mpm |
Date: 2009-01-08.06:59:14 |
|
No, there's been absolutely no work on that front. I personally don't have any
braindamaged operating systems to play with so I'm not likely to tackle it.
|
| msg8344 (view) |
Author: ralf1070 |
Date: 2009-01-07.18:51:43 |
|
for version 1.1 I saw an update note:
- Improved correctness in the face of casefolding filesystems
I had a look if this affects the UTF8 normalization problem
on OSX. As far as I can see - it does not. Is this the expected
result or are there new option I missed to activate?
|
| msg7115 (view) |
Author: pirmin |
Date: 2008-09-13.19:13:28 |
|
I'm afraid I can't help you for a solution.
I just would like to show how this issue affects my intentional use of Mercurial.
Till now I have been using SVN to archive any project data in various repositories.
Some projects consist of only source files. Others also contain different kind
of documentation files.
For several reasons I don't want to change the names of files that I got from
customers.
Because of the nice features of Mercurial, I started to convert SVN repositories
to Mercurial.
The majority of the files of course come out correctly named. Some of them
however are looking strange.
Here is an example:
- In Subversion: Update der Roadmap für die Genève-Detaillieferungen.rtf
- In Mercurial: Update der Roadmap für die Genève-Detaillieferungen.rtf
My solution is, only converting repositories to Mercurial, which contain only
simple file names.
I hope this issue can be solved some time, so I wouldn't need to work with
different versioning systems.
Regards
|
| msg6512 (view) |
Author: mpm |
Date: 2008-07-14.22:33:26 |
|
The groundwork for dealing with this is now in place. It's basically analogous
to the case-insensitivity code we have for Windows.
|
| msg6511 (view) |
Author: kiilerix |
Date: 2008-07-14.22:26:48 |
|
For reference, report and discussion of similar issue in subversion:
http://subversion.tigris.org/issues/show_bug.cgi?id=2464
http://svn.collab.net/repos/svn/trunk/notes/unicode-composition-for-filenames
|
| msg5856 (view) |
Author: ralf1070 |
Date: 2008-04-09.11:45:35 |
|
> There IS something you can do: Create an extension.
Ok - I see your point. I will have a lock at the actual win32mbcs
extension when I find some spare time. Unfortunately I will not
have time for that, the next several weeks. Comes time, comes
extension ;)
Thanks for the very nice discussion so far.
Regards Ralf
|
| msg5854 (view) |
Author: kiilerix |
Date: 2008-04-09.11:05:34 |
|
There IS something you can do: Create an extension. Right, win32mbcs can't be
used as it is, it solves another problem, and splitunc is win32-only
functionality. But it shows how file system layer can be monkey-patched and how
you could get the functionality you ask for.
AFAIK posix/unix filenames are always byte sequences. Upper layers can encode
unicode filenames in for example UTF-8. Python handles unicode file names by
encoding them to byte sequences using the global/system encoding. I'm no posix
expert, but I assume that full posix compliance can't be built on top of HFS.
At one abstraction level a filename _is_ a sequence of bytes. That is the level
where Mercurial is. Another level interpretes the bytes according to some
encoding and might consider glyph rendering. That is where you would like
Mercurial to be.
You say that Apple deals with Unicode but enforces that only normalized code
points can be used. I would say that "linux" deals with Unicode just as well
(also by using UTF-8 encoding) and allows any code points to be used. Choose one.
Easy "solution": If your policy says that team members shouldn't create casing
dependent filenames, then you could also enforce that only normalized Unicode in
UTF-8 encoding is used. You could perhaps use a trigger to enforce/guide that.
BTW FWIW: Apparently we have exactly the same issue with å in danish.
|
| msg5853 (view) |
Author: ralf1070 |
Date: 2008-04-09.08:28:14 |
|
If it is a core design decision to not deal with system
specific file name problems I can't do anything against it.
Yesterday I read a lot about HFS. I see that they wanted to
be smart, they shot too short in the first run and then they
got overrun by reality. I think the primary decision
"If we deal with unicode, we do it semantically correct"
was the right one. The point of view "a file name is a
sequence of bytes and I, as a system programmer, do not care
about it's meaning" is handy, but lazy. Anyway - I can't do
anything about it.
The actual point is, there are different systems out there,
with different limitations. You have case agnostic systems
(at least Apple and Windows), you have at least one system
that tries to actually deal with unicode (Apple), you have
filesystems with special disallowed characters (':' on smb
comes into my mind). Thanks god 8.3 filesystems seem to be
gone.
From my point of view a SCMS should at least detect if it's
doing something strange - so if it creates files and they
magically disappear within operation it should be possible
to detect that. And it should be possible to detect that
files get mapped over each other - I already had the case
that one super clever member of my development team decided
to have files which only differ in case - which resulted in
really strange compilation problems on non unix platforms.
But if you are able to detect these conflicts once (if you
decide to be interested in such things), you are near to
be able to silently deal with the non conflicting cases.
Maybe I get an option then that solves my problem ;)
To directly answer you questions:
> What should happen if that ('ä' variants) repo is
> used on Mac?
> And what should happen when it is used on a system
> using an 8-bit dos-ish codepage without ä?
It should warn and bail out. From my point of view
one should have to use some kind of "--force" option
to ignore such conflicts.
I tried win32mbcs - it seems to have some problems on mac.
First I get the following:
[win32mbcs] cannot activate on this platform.
When I override this I get that:
** unknown exception encountered, details follow
** report bug details to http://www.selenic.com/mercurial/bts
** or mercurial@selenic.com
** Mercurial Distributed SCM (version 1.0)
Traceback (most recent call last):
File "/opt/local/bin/hg", line 20, in <module>
mercurial.dispatch.run()
File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 20, in run
sys.exit(dispatch(sys.argv[1:]))
File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 29, in dispatch
return _runcatch(u, args)
File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 45, in _runcatch
return _dispatch(ui, args)
File "/opt/local/lib/python2.5/site-packages/mercurial/dispatch.py", line 340, in
_dispatch
repo = hg.repository(ui, path=path)
File "/opt/local/lib/python2.5/site-packages/mercurial/hg.py", line 65, in repository
hook(ui, repo)
File "/opt/local/lib/python2.5/site-packages/hgext/win32mbcs.py", line 155, in reposetup
install()
File "/opt/local/lib/python2.5/site-packages/hgext/win32mbcs.py", line 127, in install
os.path.splitunc = wrap(os.path.splitunc)
AttributeError: 'module' object has no attribute 'splitunc'
Am I right when I assume python does not expect unicode
file names on a unix like OS?
|
| msg5852 (view) |
Author: djc |
Date: 2008-04-09.07:07:13 |
|
I think trying to convert filenames from filesystemencoding to unicode, then
comparing them, might not be too bad?
|
| msg5850 (view) |
Author: kiilerix |
Date: 2008-04-09.00:37:49 |
|
I can understand your standpoint. For some purposes some "smart" handling of
filenames could be nice. For other purposes "smart" handling of file content is
needed. (And as Matt argue: if you have "smart" handling of one then you want it
for both.)
But AFAIK it is a design decision for core Mercurial that it doesn't try to be
smart. "Smartness" has been contributed as extensions - for example
KeywordExtension and Win32Extension. And especially win32mbcs extension looks
like what you are looking for.
You assume that Mercurial supports Unicode. It doesn't. But because it is
(almost) binary clean it can handle filenames as byte sequences in (almost) any
encoding. Mercurial doesn't know and doesn't care which encoding is used.
I'm sure that someone could make an extension that does what you request. It
might even work and solve your problem.
But on Unix it is and should be possible to create a repository containing files
with both both variants of ä and their corresponding uppercase variants. What
should happen if that repo is used on Mac? And should it depend on whether HFS
has been configured to be case sensitive or not? And what should happen when it
is used on a system using an 8-bit dos-ish codepage without ä?
Encodings are for presentation only. Any attempt to put semantics into them and
into which code points looks similar are doomed to fail. IMHO ;-)
|
| msg5843 (view) |
Author: ralf1070 |
Date: 2008-04-08.09:54:41 |
|
> I assume that
> python -c 'open("\xc3\xa4".decode("utf-8"), "w")'
> ls | hexdump -C
> will show that the file system doesn't return the files it has
> been given.
Your assumption is right:
00000000 61 cc 88 0a |a?..|
> Checking out a repository on a system where the file can't be stored
> in the file system is "don't"
I'm a programmer myself - so I can at least understand your view. When
programming I use english as language of choice - it makes code more
portable: I can share it with anyone anywhere, no encoding problems
raised ... anyway.
You try to interpret a filesystem as a byte heap - you put bytes in and
get bytes out. For the actual file data I'm with you - a filesystem should
never touch data. For file names I'm biased. There was no problem with this
point of view when 8 bit encodings where used - nobody had space left to
implement the same character twice in such an encoding. Ok - it was never
really clear which encoding had been used where - but yes, you could see
file names as heap of bytes, compare them byte wise and you where done.
Then unicode had been invented and more importantly - used. Ok, we can have
every character of nearly every language in the world in any file name we
want. But we have an overhead now - _using_ unicode also means that we must
deal with the whole complexity of it. And one part of it _is_ there are
different byte strings with the very same meaning. In unicode it is
perfectly legal to represent e.g. the german 'umlaut a' as the character
itself or as an 'a' with a combining character. To compare strings one has
to use a normalized form of both of them.
(see: http://www.unicode.org/unicode/reports/tr15/index.html).
To interpret two different unicode representations of the same character as
different names actually _is_ wrong.
To make it more clear: the Linux behavior, to be able to have two different
file names which are actually the same unicode character - just in two
different representations, is much more broken then the apple statement
"we try to use a normalized representation but sometimes we failed".
Yes, unicode is bloated. There should have been one who had hit the
unicode inventors with a _big_ stick. But they are done now - we have to
live with the result.
So - "system where the file can't be stored in the filesystem" - is not
actually true. The filesystem just knows about unicode and normalizes names
to not have to deal with name clashes of the special kind. In this special
case mercurial does not do it's job - it has been told to use unicode but
it does it only halfway.
> On my system bash and python doesn't handle composed unicode characters
> as some think they should:
> python -c 'print len("\x61\xcc\x88".decode("utf-8")) == 2'
It's the same on OS X. Actually I assume this may be correct because this
is an 'a' followed by the combining character for two dot's above it.
When I read the standard I can believe they may have decided that you can
have two strings with different length which are equal anyway ... *argl* ...
|
| msg5841 (view) |
Author: kiilerix |
Date: 2008-04-07.20:53:00 |
|
(MPM answered while I was writing - but now I have written it and will post
anyway ;-)
I assume that
python -c 'open("\xc3\xa4".decode("utf-8"), "w")'
ls | hexdump -C
will show that the file system doesn't return the files it has been given.
Admitted by apple at http://developer.apple.com/qa/qa2001/qa1173.html - some
call that "Utter crap" (http://ln-s.net/1mk5).
On Linux:
$ python -c 'open("\xc3\xa4".decode("utf-8"), "w")'
$ python -c 'open("\x61\xcc\x88".decode("utf-8"), "w")'
$ hg com -A -m test
$ ls .hg/store/data/
a~cc~88.i ~c3~a4.i
$ rm -rf *
$ hg up
$ ls | hexdump -C
00000000 61 cc 88 0a c3 a4 0a |a......|
00000007
Checking out a repository on a system where the file can't be stored in the file
system is "don't" - just like windows users should stay away from repos having
"con", the file ":\foo ", or both "foo" and "Foo".
A workaround could require some kind of translation layer between core Mercurial
and the file system. That could perhaps be done in a monkey-patching extension.
But what should the extension do?
(On my system bash and python doesn't handle composed unicode characters as some
think they should: python -c 'print len("\x61\xcc\x88".decode("utf-8")) == 2' -
that's slightly crappy too)
|
| msg5840 (view) |
Author: mpm |
Date: 2008-04-07.20:13:24 |
|
Mercurial is perfectly consistent: it does not alter bytecodes presented for
filenames at all. It's your operating system doing the mangling.
Your operating system probably thinks it's being clever and helpful by quietly
transcoding everything to Apple's modified variant D form of UTF-16 and back,
but it's instead creating a portability nightmare for you. If you've got a file
named ä.c in your repo and it's mentioned in a Makefile (or any number of
similar scenarios), your build will subtly break when moved between systems that
mangle and don't mangle.
Similarly, if you copy ä from Linux and back, you can expect to end up with two
files named ä with different encodings.
|
| msg5839 (view) |
Author: ralf1070 |
Date: 2008-04-07.17:03:17 |
|
OSX version is 10.4.11
|
| msg5838 (view) |
Author: ralf1070 |
Date: 2008-04-07.17:01:08 |
|
when creating filenames on OSX mercurial 1.0 behaves inconsistent:
The repository file:
.hg/store/data/~c3~a4.i
(which represents a german "umlaut a" as filename)
gives after "hg update" the following byte sequence:
> ls | hexdump -C
00000000 61 cc 88 0a |a?..|
which is not totally wrong as this is another valid UTF-8 representation
of the german "umlaut a" - but at least for mercurial it is misleading:
> hg stat
? ä
There are two valid behaviors for mercurial:
1. make sure it creates the right sequence of bytes
or
2. accept the other valid UTF-8 notation it creates
Regards
|
|
| Date |
User |
Action |
Args |
| 2011-12-30 22:15:16 | mpm | set | status: chatting -> resolved nosy:
mpm, tksoh, kiilerix, mg, djc, ralf1070, pirmin, cyanite, danchr, jldiaz, semtlnori messages:
+ msg18456 |
| 2011-09-14 11:41:38 | kiilerix | set | nosy:
mpm, tksoh, kiilerix, mg, djc, ralf1070, pirmin, cyanite, danchr, jldiaz, semtlnori messages:
+ msg17441 |
| 2011-09-14 06:04:04 | semtlnori | set | files:
+ osxNFD.py.patch nosy:
+ semtlnori messages:
+ msg17435 |
| 2010-03-12 21:54:17 | danchr | set | nosy:
+ danchr |
| 2009-04-23 18:32:06 | jldiaz | set | files:
+ osxNFD.py nosy:
mpm, tksoh, kiilerix, mg, djc, ralf1070, pirmin, cyanite, jldiaz messages:
+ msg9168 |
| 2009-04-23 18:30:30 | mg | set | nosy:
+ mg |
| 2009-04-23 18:28:03 | jldiaz | set | files:
+ osxNFD.pyc nosy:
+ jldiaz messages:
+ msg9167 |
| 2009-01-08 09:33:32 | ralf1070 | set | nosy:
mpm, tksoh, kiilerix, djc, ralf1070, pirmin, cyanite messages:
+ msg8347 |
| 2009-01-08 06:59:15 | mpm | set | nosy:
mpm, tksoh, kiilerix, djc, ralf1070, pirmin, cyanite messages:
+ msg8346 |
| 2009-01-07 18:51:46 | ralf1070 | set | topic:
+ casefolding nosy:
mpm, tksoh, kiilerix, djc, ralf1070, pirmin, cyanite messages:
+ msg8344 |
| 2008-09-17 19:28:16 | cyanite | set | nosy:
+ cyanite |
| 2008-09-13 19:13:30 | pirmin | set | nosy:
+ pirmin messages:
+ msg7115 |
| 2008-07-14 22:33:28 | mpm | set | nosy:
mpm, tksoh, kiilerix, djc, ralf1070 messages:
+ msg6512 |
| 2008-07-14 22:26:51 | kiilerix | set | nosy:
mpm, tksoh, kiilerix, djc, ralf1070 messages:
+ msg6511 |
| 2008-04-17 00:59:59 | tksoh | set | nosy:
+ tksoh |
| 2008-04-09 11:45:35 | ralf1070 | set | nosy:
mpm, kiilerix, djc, ralf1070 messages:
+ msg5856 |
| 2008-04-09 11:05:36 | kiilerix | set | nosy:
mpm, kiilerix, djc, ralf1070 messages:
+ msg5854 |
| 2008-04-09 08:28:15 | ralf1070 | set | nosy:
mpm, kiilerix, djc, ralf1070 messages:
+ msg5853 |
| 2008-04-09 07:07:14 | djc | set | nosy:
+ djc messages:
+ msg5852 |
| 2008-04-09 00:37:52 | kiilerix | set | nosy:
mpm, kiilerix, ralf1070 messages:
+ msg5850 |
| 2008-04-08 09:54:41 | ralf1070 | set | nosy:
mpm, kiilerix, ralf1070 messages:
+ msg5843 |
| 2008-04-07 20:53:02 | kiilerix | set | nosy:
+ kiilerix messages:
+ msg5841 |
| 2008-04-07 20:13:26 | mpm | set | nosy:
+ mpm messages:
+ msg5840 |
| 2008-04-07 17:03:17 | ralf1070 | set | status: unread -> chatting messages:
+ msg5839 |
| 2008-04-07 17:01:10 | ralf1070 | create | |
|