On Thu, 22 Aug 2019 at 17:44, Zach Laine <whatwasthataddress@gmail.com> wrote:
On Thu, Aug 22, 2019 at 9:10 AM Niall Douglas <s_sourceforge@nedprod.com> wrote:
I am currently refactoring filesystem::path_view::compare() along SG16's
feedback, and I want to check approval for the following algorithm. Be
aware that path_view can refer to source path data with the following
encodings:

1. byte (don't interpret, just pass through as native filesystem encoding)
2. char (native narrow encoding)
3. wchar_t (native wide encoding)
4. char8_t (utf8)
5. char16_t (utf16)

The proposed comparison algorithm:

1. Splitting a path into components is done by
codepoint-value-comparison for '/' (and
filesystem::path::preferred_separator if not on POSIX). This is not
Unicode aware, and could be problematic on weird native encodings.

2. Each path component in between separator-codepoint-values is
individually compared as follows:

  a) If the encodings are the same, via char_traits<CharT>::compare().

  b) If the encodings are dissimilar, codecvt<CharT, char8_t> is used to
convert each non-char8_t source character to utf8. The char8_t values
are compared using char_traits<char8_t>::compare().

That's a problem -- the Unicode rules say that you cannot compare bytes to do less-than comparisons (which they term "collation"), or even for equality.  Just for equality, you must normalize.  Consider the combiner C1 and joiner J that have a combined code point C2 (so, concretely, say that C1 is "A", J is a combining circle, and C2 is the Angstrom symbol).  You must not use comparisons in which result in C1J == C2 being false.  Normalization gets you there.

Collation is even harder.  There is a default collation, and it works for a lot of comparison operations just as you'd want.  However, there are different levels of collation (there's one that considers everything -- probably what you want) and there are collations that work for different languages, and even variants of languages (the ordering of strings in the phonebook is different from the one you would use when searching within a document, for instance).

Did I mention that Unicode is hard?

I think you can just use the default collation, especially if you document that this is what you are doing.  However, on some filesystems this might lead to the wrong answer -- FAT and NTFS and apparently now some Linux FSs are supposed to ignore case, right?

In any case, the algorithm above is not Unicode-friendly.

Sorry to be the bearer of this awfulest of news.  Collation sucks.


Collation is for text, paths are not text, i don't think we need to worry about that !

But also not portable, such that, you probably want

OSX        : normalize, compare the sequence of codepoint
Linux       : compare bytes
Windows : ??? need to be case sensitive , but afaict it doesn't match any unicode casing spec...

Reading Niall's mail, I sort of question the whole approach...

I am not sure that "are these two path component the same" can be answered portably - and we probably want that operation to be fast.
Might be better to offer a way to extract a u8string a let the user apply the comparison they want on that string?


 

Zach

_______________________________________________
SG16 Unicode mailing list
Unicode@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/unicode