sg16: Re: [SG16-Unicode] filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 22 Aug 2019 17:08:17 +0100

> b) If the encodings are dissimilar, codecvt<CharT, char8_t> is used to
> convert each non-char8_t source character to utf8. The char8_t values
> are compared using char_traits<char8_t>::compare().
>
> That's a problem -- the Unicode rules say that you cannot compare bytes
> to do less-than comparisons (which they term "collation"), or even for
> equality. Just for equality, you must normalize. Consider the combiner
> C1 and joiner J that have a combined code point C2 (so, concretely, say
> that C1 is "A", J is a combining circle, and C2 is the Angstrom
> symbol). You must not use comparisons in which result in C1J == C2
> being false. Normalization gets you there.
>
> Collation is even harder. There is a default collation, and it works
> for a lot of comparison operations just as you'd want. However, there
> are different levels of collation (there's one that considers everything
> -- probably what you want) and there are collations that work for
> different languages, and even variants of languages (the ordering of
> strings in the phonebook is different from the one you would use when
> searching within a document, for instance).
>
> Did I mention that Unicode is hard?

My proposed algorithm is based on filesystem::path::compare, I did not
innovate.

That means a comparison of lexical representations of paths, same as
basic_string<filesystem::path::value_type>::compare(). Which uses
char_traits<CharT>::compare(), same as my proposed algorithm. All the
same strengths, and weaknesses, apply in this choice.

What the filesystem itself does, or does not do, is irrelevant here. We
are aiming for lack of surprise for users performing filesystem::path
comparisons when they then use filesystem::path_view comparisons. The
filesystem never gets involved here.

To that end, the only source of problem will be the inaccurate
conversion of path components to another encoding. However, the exact
same problem would exist for filesystem::path.

An entirely valid concern is that filesystem::path always converts to
the native filesystem encoding, and it is that which is compared.
Whereas my proposed algorithm always goes via utf8. Thus,
filesystem::path and filesystem::path_view are potentially not one-one
identical on non-POSIX.

Would SG16 prefer that path_view instead uses the native filesystem
encoding for comparisons of dissimilar encoded path views? Then it maps
filesystem::path exactly, however now you lose portability of behaviour
between platforms. I had assumed the latter more important for edge
cases, which is mainly where invalid Unicode paths are being compared
and you would probably want stable behaviour across platforms, as
invalid Unicode filesystem paths are perfectly valid, and indeed may
become common if LLFIO get standardised.

Or am I completely missing your point here Zach?

Niall

Received on 2019-08-22 18:08:22