sg16: Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 12 Sep 2019 12:38:10 +0100

>> If case insensitive filename matching is on (the default is yes, forced
>> by a system-wide registry flag), much travesty is done to the codepoint
>> space. Obvious stuff like Roman 'a' and 'A' are considered identical,
>> but so is 'a' and 'Á' and 'á' and many more codepoints. I saw several
>> hundred codepoints out of the 65,536 space considered identical. I guess
>> at least we know Microsoft have implemented Unicode correctly, for some
>> definition of correct.
>
> But not normalisation, right?

Maybe.

It didn't cost me much to get the test program to dump the characters
considered identical to the existing characters, and you can find that
list at
https://github.com/ned14/llfio/blob/develop/programs/illegal-codepoints/main.cpp#L5115

>From a brief scan, it *looks* like proper Unicode normalisation. I don't
see a few of the numbers though, like superscript numbers being
normalised to normal numbers. So it might be incomplete.

I should point out that this normalisation is done by the NTFS driver
specifically. i.e. it varies between filesystems, each has their own
implementation, so you get cool data losing normalisation bugs like
between Apple APFS and Samba. The NT object manager has an ASCII-only
based case insensitive compare, or at least it did last time I checked.

Niall

Received on 2019-09-12 13:38:14