ISOCPP sg16 List: Re: [EXTERNAL] Re: [isocpp-lib] Why have we deprecated filesystem::u8path?

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 29 Nov 2022 12:22:34 -0500

On 11/29/22 10:58 AM, Nicole Mazzuca wrote:
> I think it was a noble idea, but fundamentally a non-zero number of
> people use `string` as a utf-8 container, and cannot switch. It should
> not be considered the default, but it should certainly be supported
> without allocation to a completely different type (or a
> reinterpret_cast).

From the perspective of the standard, since the ordinary literal
encoding and the locale dependent execution character set (and multibyte
encoding) may be UTF-8, the standard must support use of these types
with UTF-8. But there is a subtle distinction between supporting them
when these encodings are UTF-8 vs encouraging use of these types for
UTF-8 when these encodings are something else.

When the locale encoding is UTF-8, invoking a path constructor with a
range of char is a supported well-defined way to construct a path object
from a UTF-8 string; the behavior is the same as u8path.

The only use case for u8path is when the locale encoding is not UTF-8
and the programmer has UTF-8 data held in a range of char. I don't think
the standard should encourage that. A solution in the spirit of
Corentin's P2626 (charN_t incremental adoption: Casting pointers of UTF
character types) <https://wg21.link/p2626> offers a better approach for
projects in this situation by requiring explicit code to
convert/cast/validate the input. (Thank you Corentin, I had intended to
mention your paper in my earlier response and then forgot to do so).

Tom.

>
> Nicole
>
> Sent from my iPhone
>
>> On Nov 29, 2022, at 07:45, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>
>>
>>
>> You don't often get email from tom_at_honermann.net. Learn why this is
>> important <https://aka.ms/LearnAboutSenderIdentification>
>>
>>
>> Sorry for the delay in responding.
>>
>> u8path was deprecated with the adoption of P0482R6
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwg21.link%2Fp0482r6&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=r2NfZ4E5lX4N6Occ%2FWidIk12RdCDv2kVRzH79zZGS%2B4%3D&reserved=0>.
>> I confirmed that I neglected to include motivation for its
>> deprecation in that paper. The closest the paper gets to such
>> motivation is in the discussion of u8path in the Motivation
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwg21.link%2Fp0482r6%23motivation&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=A0lnf2Vv6oj7UlTtEWPJ9MEeTeXymcwkmONrNH5e0B0%3D&reserved=0>
>> section:
>>
>> To accommodate UTF-8 encoded text, the file system library
>> specifies the following factory functions. Matching factory
>> functions are not provided for other encodings.
>>
>> |template <class Source> path u8path(const Source& source);
>> template <class InputIterator> path u8path(InputIterator first,
>> InputIterator last); |
>>
>> The requirement to construct path objects using one interface for
>> UTF-8 strings vs another interface for all other supported
>> encodings creates unnecessary difficulties for portable code.
>> Consider an application that uses UTF-8 as its internal encoding
>> on POSIX systems, but uses UTF-16 on Windows. Conditional
>> compilation or other abstractions must be implemented and used in
>> otherwise platform neutral code to construct path objects.
>>
>> The original motivation for deprecation was that u8path was only
>> added because the path constructor, per [fs.path.type.cvt]
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ffs.path.type.cvt&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VZ8hGrtxmK2G3AIax%2FOWeEZCZGegD72WY4ra%2FDH1Js0%3D&reserved=0>,
>> already specified different behavior for construction via a range of
>> char; u8path therefore provided redundant functionality once char8_t
>> was added.
>>
>> I think deprecation is still justified on design grounds. The
>> standard currently associates the following encodings with char:
>>
>> 1. The /ordinary literal encoding/ ([lex.ccon.literal]
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ftab%3Alex.ccon.literal&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FkoDA6WskXmmL7BPDSDbSQrYNROMlWsq8d9fAn8B7Jw%3D&reserved=0>,
>> [lex.string.literal]
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ftab%3Alex.string.literal&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uAGuPJysBUp0bp2KT%2FwZKOhQjrd699yWFPodPQJMuYM%3D&reserved=0>)
>> used for character and string literals.
>> 2. The /execution character set/ ([character.seq.general]p(1.2)
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flibrary%23character.seq.general-1.2&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tZy3GWrooaMSYlCWKmJAN%2BTPHfpmBug7aFFQ6mFGs9I%3D&reserved=0>)
>> used for the locale dependent execution environment.
>> 3. The multibyte character encoding ([c.mb.wcs]
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fc.mb.wcs&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ch6h1YEk8BmItsljatl3QMePHITu20COznJlK01MCUk%3D&reserved=0>,
>> C: 5.2.1.1 Multibyte characters) which is effectively the
>> encoding of the /execution character set/.
>> 4. The /native encoding/ ([fs.path.type.cvt]p1
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ffs.path.type.cvt%231&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C258b03079d85402ba9a408dad220cbc6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053335557071006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=q%2F0MXCfAbDMFzF2RcOHhfnWpuP%2FWm4OinSMKSiFFgr0%3D&reserved=0>)
>> used for path names.
>>
>> Though the standard doesn't require it, the intent is that these
>> encodings are all compatible. In practice, they do get out of sync;
>> the locale of the execution environment is not generally known when
>> encoding character and string literals and filesystem encoding may
>> differ from the locale dependent encoding.
>>
>> Adding an additional association with UTF-8 creates a deeper
>> division. We know that programmers have a hard time maintaining
>> encoding expectations; mojibake remains a common occurrence. From a
>> design perspective, if we endorse continued use of u8path, should we
>> also add char-based UTF-8 specific variants of std::basic_string,
>> std::char_traits, and std::ctype? It isn't clear to me that path
>> names are sufficiently special to warrant special interfaces;
>> particularly when most filesystems in use today (NTFS being a partial
>> exception) do not require a particular encoding (most just require a
>> specific value for the '/' and '\0' characters). As we seek to add
>> more Unicode features to the standard library, should we add UTF-8
>> based interfaces for char and char8_t (and unsigned char since some
>> projects use that for UTF-8)? I think the standard should avoid
>> further muddying the waters of what encoding(s) char should be
>> associated with.
>>
>> Tom.
>>
>> On 11/29/22 1:08 AM, Daniel Krügler wrote:
>>> Am Di., 29. Nov. 2022 um 05:32 Uhr schrieb Nicole Mazzuca
>>> <Nicole.Mazzuca_at_[hidden]>:
>>>> I'd point out that the exact same issue exists with path(u8string), we've just made life more painful for people who do need to convert utf-8 to paths. (i.e., Windows people).
>>>>
>>>> Nicole
>>> Thanks for all the feedback, Nicole, Steve, and Casey. I will now open
>>> an LWG issue about this.
>>>
>>> - Daniel

Received on 2022-11-29 17:22:36