Date: Tue, 29 Nov 2022 04:32:02 +0000
I'd point out that the exact same issue exists with path(u8string), we've just made life more painful for people who do need to convert utf-8 to paths. (i.e., Windows people).
Nicole
Sent from my iPhone
On Nov 28, 2022, at 20:09, Steve Downey <sdowney_at_[hidden]> wrote:
You don't often get email from sdowney_at_gmail.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
If I recall correctly, and it's been a while, so underlying details are fuzzy.
Filesystems in general, and in particular NTFS, don't actually maintain any unicode invariants. The sequence of octets you get from the file system for a path may not be a well formed utf-8 string. The only thing that will open the same file is the same set of octets used to open it in the first place, and the only way to construct a file path that corresponds to an existing file is to use the various path dirwalk algorithms without looking to closely. Treating a path as a utf-8 string is at best slightly misleading, at worst an attack vector.
But it's been a few years since we had these rounds of discussions in SG16, so CC'ing.
I recall Nial having some horror stories also, but I don't have his address handy.
On Mon, Nov 28, 2022 at 8:45 PM Casey Carter via Lib <lib_at_[hidden]<mailto:lib_at_[hidden]>> wrote:
On Wed, Nov 23, 2022 at 4:32 AM Daniel Krügler via Lib <lib_at_[hidden]<mailto:lib_at_[hidden]>> wrote:
The filesystem::u8path function became deprecated with the adoption of
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-std.org%2Fjtc1%2Fsc22%2Fwg21%2Fdocs%2Fpapers%2F2018%2Fp0482r6.html&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C673bdcf049d0470365ce08dad1bf78c6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638052917547783675%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=0rKoHt0y3hDcLHt746nbOtVF71RTAJQOc%2FianAAI6ZA%3D&reserved=0>
But it is not actually clear to me why we did this. The u8path
function is still useful if my original string source is a char or
wchar_t sequence and I *do know* that the encoding of these sequences
is a Unicode encoding that matches the size of the character type. The
deprecation note suggests that I should use std::u8string instead,
which costs me an additional transformation and doesn't work without
reinterpret_cast (assuming I start from a char sequence). And it
doesn't help me if my source is a wchar_t sequence.
We have a constructor allowing us to provide a locale, but I think in
recent discussions it became clear that Unicode should not be
expressed as a locale and there exists AFAIK no portable and
non-deprecated way to construct a locale that denotes - for example -
UTF-8.
Have I overlooked a much better existing alternative? (In this case
the deprecation note should be adjusted)
Should we consider un-deprecating u8path?
I think we should. I've been hearing rants to this effect from Nicole (cc'ed so she can correct me if necessary) every week or two since she joined the team that develops MSVCSTL, and I can't come up with a good argument for the deprecation other than a vague "encourage people to use the new char8_t for UTF-8-encoded strings."
_______________________________________________
Lib mailing list
Lib_at_[hidden]<mailto:Lib_at_[hidden].org>
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C673bdcf049d0470365ce08dad1bf78c6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638052917547783675%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lwFNri0cP9hoXFDD%2BOEuRwPEWBKpkWqFahCwxf4ZaJQ%3D&reserved=0>
Link to this post: http://lists.isocpp.org/lib/2022/11/24644.php<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2022%2F11%2F24644.php&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C673bdcf049d0470365ce08dad1bf78c6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638052917547783675%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kGJ9za8T0nAXILAThVQDlif81Bve6xd7GzTAfRBy5O8%3D&reserved=0>
Nicole
Sent from my iPhone
On Nov 28, 2022, at 20:09, Steve Downey <sdowney_at_[hidden]> wrote:
You don't often get email from sdowney_at_gmail.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
If I recall correctly, and it's been a while, so underlying details are fuzzy.
Filesystems in general, and in particular NTFS, don't actually maintain any unicode invariants. The sequence of octets you get from the file system for a path may not be a well formed utf-8 string. The only thing that will open the same file is the same set of octets used to open it in the first place, and the only way to construct a file path that corresponds to an existing file is to use the various path dirwalk algorithms without looking to closely. Treating a path as a utf-8 string is at best slightly misleading, at worst an attack vector.
But it's been a few years since we had these rounds of discussions in SG16, so CC'ing.
I recall Nial having some horror stories also, but I don't have his address handy.
On Mon, Nov 28, 2022 at 8:45 PM Casey Carter via Lib <lib_at_[hidden]<mailto:lib_at_[hidden]>> wrote:
On Wed, Nov 23, 2022 at 4:32 AM Daniel Krügler via Lib <lib_at_[hidden]<mailto:lib_at_[hidden]>> wrote:
The filesystem::u8path function became deprecated with the adoption of
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-std.org%2Fjtc1%2Fsc22%2Fwg21%2Fdocs%2Fpapers%2F2018%2Fp0482r6.html&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C673bdcf049d0470365ce08dad1bf78c6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638052917547783675%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=0rKoHt0y3hDcLHt746nbOtVF71RTAJQOc%2FianAAI6ZA%3D&reserved=0>
But it is not actually clear to me why we did this. The u8path
function is still useful if my original string source is a char or
wchar_t sequence and I *do know* that the encoding of these sequences
is a Unicode encoding that matches the size of the character type. The
deprecation note suggests that I should use std::u8string instead,
which costs me an additional transformation and doesn't work without
reinterpret_cast (assuming I start from a char sequence). And it
doesn't help me if my source is a wchar_t sequence.
We have a constructor allowing us to provide a locale, but I think in
recent discussions it became clear that Unicode should not be
expressed as a locale and there exists AFAIK no portable and
non-deprecated way to construct a locale that denotes - for example -
UTF-8.
Have I overlooked a much better existing alternative? (In this case
the deprecation note should be adjusted)
Should we consider un-deprecating u8path?
I think we should. I've been hearing rants to this effect from Nicole (cc'ed so she can correct me if necessary) every week or two since she joined the team that develops MSVCSTL, and I can't come up with a good argument for the deprecation other than a vague "encourage people to use the new char8_t for UTF-8-encoded strings."
_______________________________________________
Lib mailing list
Lib_at_[hidden]<mailto:Lib_at_[hidden].org>
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C673bdcf049d0470365ce08dad1bf78c6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638052917547783675%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=lwFNri0cP9hoXFDD%2BOEuRwPEWBKpkWqFahCwxf4ZaJQ%3D&reserved=0>
Link to this post: http://lists.isocpp.org/lib/2022/11/24644.php<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2022%2F11%2F24644.php&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C673bdcf049d0470365ce08dad1bf78c6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638052917547783675%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kGJ9za8T0nAXILAThVQDlif81Bve6xd7GzTAfRBy5O8%3D&reserved=0>
Received on 2022-11-29 04:32:05