sg16: [SG16-Unicode] filesystem::path

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 22 Aug 2019 15:10:09 +0100

I am currently refactoring filesystem::path_view::compare() along SG16's
feedback, and I want to check approval for the following algorithm. Be
aware that path_view can refer to source path data with the following
encodings:

1. byte (don't interpret, just pass through as native filesystem encoding)
2. char (native narrow encoding)
3. wchar_t (native wide encoding)
4. char8_t (utf8)
5. char16_t (utf16)

The proposed comparison algorithm:

1. Splitting a path into components is done by
codepoint-value-comparison for '/' (and
filesystem::path::preferred_separator if not on POSIX). This is not
Unicode aware, and could be problematic on weird native encodings.

2. Each path component in between separator-codepoint-values is
individually compared as follows:

  a) If the encodings are the same, via char_traits<CharT>::compare().

  b) If the encodings are dissimilar, codecvt<CharT, char8_t> is used to
convert each non-char8_t source character to utf8. The char8_t values
are compared using char_traits<char8_t>::compare().

  c) Invalid input is handled by considering the input truncated, and
therefore less than the comparator.

  d) Two invalid inputs, if they both truncate at the same point with
the same value, are considered equal.

  e) Byte encoding if for both path views is compared using memcmp().

  f) Byte encoding for one path view but not the other causes comparison
to vary depending on the platform. If on POSIX, the byte array is
compared as if utf8. If on Windows, the byte array is compared as if
utf-16. Other platforms would vary here.

(Note that I am aware that codecvt is deprecated in future C++, but it's
what I have to hand for the reference implementation)

Consequences of the above design:

1. Null codepoints in path components are accepted, but '/' codepoint
(and filesystem::path::preferred_separator codepoint if not on POSIX) is
not, even if those are legal in filenames on that platform.

Let me be clear: this would open a source of attack upon C++ programs,
in that a carefully malformed filename would cause denial of service.

2. Path view comparison and hashing is going to be potentially very
slow. This matters a lot for std::map, std::unordered_map etc. Note that
the raw_path() escape hatch discussed earlier would efficiently work
around this, but then hash values and equality comparisons would depend
upon encoding as well as value.

Thoughts?

Niall

Received on 2019-08-22 16:10:12