sg16: Re: [SG16-Unicode] filesystem::path

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Thu, 22 Aug 2019 10:46:24 -0500

On Thu, Aug 22, 2019 at 9:10 AM Niall Douglas <s_sourceforge_at_[hidden]>
wrote:

> I am currently refactoring filesystem::path_view::compare() along SG16's
> feedback, and I want to check approval for the following algorithm. Be
> aware that path_view can refer to source path data with the following
> encodings:
>
> 1. byte (don't interpret, just pass through as native filesystem encoding)
> 2. char (native narrow encoding)
> 3. wchar_t (native wide encoding)
> 4. char8_t (utf8)
> 5. char16_t (utf16)
>
> The proposed comparison algorithm:
>
> 1. Splitting a path into components is done by
> codepoint-value-comparison for '/' (and
> filesystem::path::preferred_separator if not on POSIX). This is not
> Unicode aware, and could be problematic on weird native encodings.
>
> 2. Each path component in between separator-codepoint-values is
> individually compared as follows:
>
> a) If the encodings are the same, via char_traits<CharT>::compare().
>
> b) If the encodings are dissimilar, codecvt<CharT, char8_t> is used to
> convert each non-char8_t source character to utf8. The char8_t values
> are compared using char_traits<char8_t>::compare().
>

That's a problem -- the Unicode rules say that you cannot compare bytes to
do less-than comparisons (which they term "collation"), or even for
equality. Just for equality, you must normalize. Consider the combiner C1
and joiner J that have a combined code point C2 (so, concretely, say that
C1 is "A", J is a combining circle, and C2 is the Angstrom symbol). You
must not use comparisons in which result in C1J == C2 being false.
Normalization gets you there.

Collation is even harder. There is a default collation, and it works for a
lot of comparison operations just as you'd want. However, there are
different levels of collation (there's one that considers everything --
probably what you want) and there are collations that work for different
languages, and even variants of languages (the ordering of strings in the
phonebook is different from the one you would use when searching within a
document, for instance).

Did I mention that Unicode is hard?

I think you can just use the default collation, especially if you document
that this is what you are doing. However, on some filesystems this might
lead to the wrong answer -- FAT and NTFS and apparently now some Linux FSs
are supposed to ignore case, right?

In any case, the algorithm above is not Unicode-friendly.

Sorry to be the bearer of this awfulest of news. Collation sucks.

Zach

Received on 2019-08-22 17:44:21