C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] filesystem::path_view::compare()
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2019-08-22 11:09:40


On Thu, 22 Aug 2019 at 17:44, Zach Laine <whatwasthataddress_at_[hidden]>
wrote:

> On Thu, Aug 22, 2019 at 9:10 AM Niall Douglas <s_sourceforge_at_[hidden]>
> wrote:
>
>> I am currently refactoring filesystem::path_view::compare() along SG16's
>> feedback, and I want to check approval for the following algorithm. Be
>> aware that path_view can refer to source path data with the following
>> encodings:
>>
>> 1. byte (don't interpret, just pass through as native filesystem encoding)
>> 2. char (native narrow encoding)
>> 3. wchar_t (native wide encoding)
>> 4. char8_t (utf8)
>> 5. char16_t (utf16)
>>
>> The proposed comparison algorithm:
>>
>> 1. Splitting a path into components is done by
>> codepoint-value-comparison for '/' (and
>> filesystem::path::preferred_separator if not on POSIX). This is not
>> Unicode aware, and could be problematic on weird native encodings.
>>
>> 2. Each path component in between separator-codepoint-values is
>> individually compared as follows:
>>
>> a) If the encodings are the same, via char_traits<CharT>::compare().
>>
>> b) If the encodings are dissimilar, codecvt<CharT, char8_t> is used to
>> convert each non-char8_t source character to utf8. The char8_t values
>> are compared using char_traits<char8_t>::compare().
>>
>
> That's a problem -- the Unicode rules say that you cannot compare bytes to
> do less-than comparisons (which they term "collation"), or even for
> equality. Just for equality, you must normalize. Consider the combiner C1
> and joiner J that have a combined code point C2 (so, concretely, say that
> C1 is "A", J is a combining circle, and C2 is the Angstrom symbol). You
> must not use comparisons in which result in C1J == C2 being false.
> Normalization gets you there.
>
> Collation is even harder. There is a default collation, and it works for
> a lot of comparison operations just as you'd want. However, there are
> different levels of collation (there's one that considers everything --
> probably what you want) and there are collations that work for different
> languages, and even variants of languages (the ordering of strings in the
> phonebook is different from the one you would use when searching within a
> document, for instance).
>
> Did I mention that Unicode is hard?
>
> I think you can just use the default collation, especially if you document
> that this is what you are doing. However, on some filesystems this might
> lead to the wrong answer -- FAT and NTFS and apparently now some Linux FSs
> are supposed to ignore case, right?
>
> In any case, the algorithm above is not Unicode-friendly.
>
> Sorry to be the bearer of this awfulest of news. Collation sucks.
>

Collation is for text, paths are not text, i don't think we need to worry
about that !

But also not portable, such that, you probably want

OSX : normalize, compare the sequence of codepoint
Linux : compare bytes
Windows : ??? need to be case sensitive , but afaict it doesn't match any
unicode casing spec...

Reading Niall's mail, I sort of question the whole approach...

I am not sure that "are these two path component the same" can be answered
portably - and we probably want that operation to be fast.
Might be better to offer a way to extract a u8string a let the user apply
the comparison they want on that string?

>
> Zach
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>



SG16 list run by sg16-owner@lists.isocpp.org