Date: Mon, 28 Nov 2022 23:09:08 -0500
If I recall correctly, and it's been a while, so underlying details are
fuzzy.
Filesystems in general, and in particular NTFS, don't actually maintain any
unicode invariants. The sequence of octets you get from the file system for
a path may not be a well formed utf-8 string. The only thing that will open
the same file is the same set of octets used to open it in the first place,
and the only way to construct a file path that corresponds to an existing
file is to use the various path dirwalk algorithms without looking to
closely. Treating a path as a utf-8 string is at best slightly misleading,
at worst an attack vector.
But it's been a few years since we had these rounds of discussions in SG16,
so CC'ing.
I recall Nial having some horror stories also, but I don't have his
address handy.
On Mon, Nov 28, 2022 at 8:45 PM Casey Carter via Lib <lib_at_[hidden]>
wrote:
> On Wed, Nov 23, 2022 at 4:32 AM Daniel Krügler via Lib <
> lib_at_[hidden]> wrote:
>
>> The filesystem::u8path function became deprecated with the adoption of
>>
>> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html
>>
>> But it is not actually clear to me why we did this. The u8path
>> function is still useful if my original string source is a char or
>> wchar_t sequence and I *do know* that the encoding of these sequences
>> is a Unicode encoding that matches the size of the character type. The
>> deprecation note suggests that I should use std::u8string instead,
>> which costs me an additional transformation and doesn't work without
>> reinterpret_cast (assuming I start from a char sequence). And it
>> doesn't help me if my source is a wchar_t sequence.
>>
>> We have a constructor allowing us to provide a locale, but I think in
>> recent discussions it became clear that Unicode should not be
>> expressed as a locale and there exists AFAIK no portable and
>> non-deprecated way to construct a locale that denotes - for example -
>> UTF-8.
>>
>> Have I overlooked a much better existing alternative? (In this case
>> the deprecation note should be adjusted)
>>
>> Should we consider un-deprecating u8path?
>>
>
> I think we should. I've been hearing rants to this effect from Nicole
> (cc'ed so she can correct me if necessary) every week or two since she
> joined the team that develops MSVCSTL, and I can't come up with a good
> argument for the deprecation other than a vague "encourage people to use
> the new char8_t for UTF-8-encoded strings."
>
>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2022/11/24644.php
>
fuzzy.
Filesystems in general, and in particular NTFS, don't actually maintain any
unicode invariants. The sequence of octets you get from the file system for
a path may not be a well formed utf-8 string. The only thing that will open
the same file is the same set of octets used to open it in the first place,
and the only way to construct a file path that corresponds to an existing
file is to use the various path dirwalk algorithms without looking to
closely. Treating a path as a utf-8 string is at best slightly misleading,
at worst an attack vector.
But it's been a few years since we had these rounds of discussions in SG16,
so CC'ing.
I recall Nial having some horror stories also, but I don't have his
address handy.
On Mon, Nov 28, 2022 at 8:45 PM Casey Carter via Lib <lib_at_[hidden]>
wrote:
> On Wed, Nov 23, 2022 at 4:32 AM Daniel Krügler via Lib <
> lib_at_[hidden]> wrote:
>
>> The filesystem::u8path function became deprecated with the adoption of
>>
>> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html
>>
>> But it is not actually clear to me why we did this. The u8path
>> function is still useful if my original string source is a char or
>> wchar_t sequence and I *do know* that the encoding of these sequences
>> is a Unicode encoding that matches the size of the character type. The
>> deprecation note suggests that I should use std::u8string instead,
>> which costs me an additional transformation and doesn't work without
>> reinterpret_cast (assuming I start from a char sequence). And it
>> doesn't help me if my source is a wchar_t sequence.
>>
>> We have a constructor allowing us to provide a locale, but I think in
>> recent discussions it became clear that Unicode should not be
>> expressed as a locale and there exists AFAIK no portable and
>> non-deprecated way to construct a locale that denotes - for example -
>> UTF-8.
>>
>> Have I overlooked a much better existing alternative? (In this case
>> the deprecation note should be adjusted)
>>
>> Should we consider un-deprecating u8path?
>>
>
> I think we should. I've been hearing rants to this effect from Nicole
> (cc'ed so she can correct me if necessary) every week or two since she
> joined the team that develops MSVCSTL, and I can't come up with a good
> argument for the deprecation other than a vague "encourage people to use
> the new char8_t for UTF-8-encoded strings."
>
>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2022/11/24644.php
>
Received on 2022-11-29 04:09:12