ISOCPP sg16 List: Re: [EXTERNAL] Re: [isocpp-lib] Why have we deprecated filesystem::u8path?

From: Nicole Mazzuca <Nicole.Mazzuca_at_[hidden]>
Date: Tue, 29 Nov 2022 17:31:29 +0000

P2626 would be a great way to do this instead; I'm still frustrated that we deprecated a perfectly good function without a non-allocating way to replace it, though.

Nicole
________________________________
From: Tom Honermann <tom_at_[hidden]>
Sent: Tuesday, November 29, 2022 9:22 AM
To: Nicole Mazzuca <Nicole.Mazzuca_at_[hidden]>
Cc: Daniel Krügler <daniel.kruegler_at_[hidden]>; Steve Downey <sdowney_at_[hidden]>; lib_at_[hidden] <lib_at_[hidden]>; SG16 <sg16_at_[hidden]>; Casey Carter <Casey_at_[hidden]>
Subject: Re: [EXTERNAL] Re: [isocpp-lib] Why have we deprecated filesystem::u8path?

On 11/29/22 10:58 AM, Nicole Mazzuca wrote:
I think it was a noble idea, but fundamentally a non-zero number of people use `string` as a utf-8 container, and cannot switch. It should not be considered the default, but it should certainly be supported without allocation to a completely different type (or a reinterpret_cast).

From the perspective of the standard, since the ordinary literal encoding and the locale dependent execution character set (and multibyte encoding) may be UTF-8, the standard must support use of these types with UTF-8. But there is a subtle distinction between supporting them when these encodings are UTF-8 vs encouraging use of these types for UTF-8 when these encodings are something else.

When the locale encoding is UTF-8, invoking a path constructor with a range of char is a supported well-defined way to construct a path object from a UTF-8 string; the behavior is the same as u8path.

The only use case for u8path is when the locale encoding is not UTF-8 and the programmer has UTF-8 data held in a range of char. I don't think the standard should encourage that. A solution in the spirit of Corentin's P2626 (charN_t incremental adoption: Casting pointers of UTF character types)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwg21.link%2Fp2626&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5Pi53HlPkTJB34OBFx%2BT5LiaWaHkKUdHISB1Yt0rFh8%3D&reserved=0> offers a better approach for projects in this situation by requiring explicit code to convert/cast/validate the input. (Thank you Corentin, I had intended to mention your paper in my earlier response and then forgot to do so).

Tom.

Nicole

Sent from my iPhone

On Nov 29, 2022, at 07:45, Tom Honermann <tom_at_[hidden]><mailto:tom_at_[hidden]> wrote:

You don't often get email from tom_at_[hidden]<mailto:tom_at_[hidden]>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>

Sorry for the delay in responding.

u8path was deprecated with the adoption of P0482R6<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwg21.link%2Fp0482r6&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FTfpNAymlw1wXSMZLU7amH5C3mDxwVGGgNzyNbidmxw%3D&reserved=0>. I confirmed that I neglected to include motivation for its deprecation in that paper. The closest the paper gets to such motivation is in the discussion of u8path in the Motivation<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwg21.link%2Fp0482r6%23motivation&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BPZa9E6SBVWS19Ls2%2Bd1o03IpPDS6K%2FWpnKJ53bq9UE%3D&reserved=0> section:

To accommodate UTF-8 encoded text, the file system library specifies the following factory functions. Matching factory functions are not provided for other encodings.

template <class Source>
path u8path(const Source& source);
template <class InputIterator>
path u8path(InputIterator first, InputIterator last);

The requirement to construct path objects using one interface for UTF-8 strings vs another interface for all other supported encodings creates unnecessary difficulties for portable code. Consider an application that uses UTF-8 as its internal encoding on POSIX systems, but uses UTF-16 on Windows. Conditional compilation or other abstractions must be implemented and used in otherwise platform neutral code to construct path objects.

The original motivation for deprecation was that u8path was only added because the path constructor, per [fs.path.type.cvt]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ffs.path.type.cvt&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oHCnaie0doedp8ZEjlnCFQRBpl%2BN1ogSYXyoSdobTnU%3D&reserved=0>, already specified different behavior for construction via a range of char; u8path therefore provided redundant functionality once char8_t was added.

I think deprecation is still justified on design grounds. The standard currently associates the following encodings with char:

  1. The ordinary literal encoding ([lex.ccon.literal]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ftab%3Alex.ccon.literal&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JQE5Ac14YbQX7XvJFcTqL9bt0skxgnzYWbSr%2BJ%2BzDeg%3D&reserved=0>, [lex.string.literal]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ftab%3Alex.string.literal&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9O4Yasl6k0bZFgngNTbPbtWcBJMqtrch1U9AcaqyOrc%3D&reserved=0>) used for character and string literals.
  2. The execution character set ([character.seq.general]p(1.2)<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flibrary%23character.seq.general-1.2&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hfhjZMhxdD4%2FUaFo0MYRBaevcozb1ViyWuW1e1ASWEk%3D&reserved=0>) used for the locale dependent execution environment.
  3. The multibyte character encoding ([c.mb.wcs]<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fc.mb.wcs&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=68e6YpTeeaiDkPC5yzhDwIHLHqFaonzfsIhnNHGtOmQ%3D&reserved=0>, C: 5.2.1.1 Multibyte characters) which is effectively the encoding of the execution character set.
  4. The native encoding ([fs.path.type.cvt]p1<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Ffs.path.type.cvt%231&data=05%7C01%7CNicole.Mazzuca%40microsoft.com%7C17b4550ad73e41f5dae408dad22e4f31%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638053393598120300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Ib0ge%2BezFVxrrIwxY5nywxhv8DuvtBikAmvVMNVGZhE%3D&reserved=0>) used for path names.

Though the standard doesn't require it, the intent is that these encodings are all compatible. In practice, they do get out of sync; the locale of the execution environment is not generally known when encoding character and string literals and filesystem encoding may differ from the locale dependent encoding.

Adding an additional association with UTF-8 creates a deeper division. We know that programmers have a hard time maintaining encoding expectations; mojibake remains a common occurrence. From a design perspective, if we endorse continued use of u8path, should we also add char-based UTF-8 specific variants of std::basic_string, std::char_traits, and std::ctype? It isn't clear to me that path names are sufficiently special to warrant special interfaces; particularly when most filesystems in use today (NTFS being a partial exception) do not require a particular encoding (most just require a specific value for the '/' and '\0' characters). As we seek to add more Unicode features to the standard library, should we add UTF-8 based interfaces for char and char8_t (and unsigned char since some projects use that for UTF-8)? I think the standard should avoid further muddying the waters of what encoding(s) char should be associated with.

Tom.

On 11/29/22 1:08 AM, Daniel Krügler wrote:

Am Di., 29. Nov. 2022 um 05:32 Uhr schrieb Nicole Mazzuca
<Nicole.Mazzuca_at_[hidden]><mailto:Nicole.Mazzuca_at_[hidden]>:

I'd point out that the exact same issue exists with path(u8string), we've just made life more painful for people who do need to convert utf-8 to paths. (i.e., Windows people).

Nicole

Thanks for all the feedback, Nicole, Steve, and Casey. I will now open
an LWG issue about this.

- Daniel

Received on 2022-11-29 17:31:33