ISOCPP sg16 List: Re: [EXTERNAL] Re: [isocpp-lib] Why have we deprecated filesystem::u8path?

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 29 Nov 2022 10:45:51 -0500

Sorry for the delay in responding.

u8path was deprecated with the adoption of P0482R6
<https://wg21.link/p0482r6>. I confirmed that I neglected to include
motivation for its deprecation in that paper. The closest the paper gets
to such motivation is in the discussion of u8path in the Motivation
<https://wg21.link/p0482r6#motivation> section:

    To accommodate UTF-8 encoded text, the file system library specifies
    the following factory functions. Matching factory functions are not
    provided for other encodings.

    |template <class Source> path u8path(const Source& source); template
    <class InputIterator> path u8path(InputIterator first, InputIterator
    last); |

    The requirement to construct path objects using one interface for
    UTF-8 strings vs another interface for all other supported encodings
    creates unnecessary difficulties for portable code. Consider an
    application that uses UTF-8 as its internal encoding on POSIX
    systems, but uses UTF-16 on Windows. Conditional compilation or
    other abstractions must be implemented and used in otherwise
    platform neutral code to construct path objects.

The original motivation for deprecation was that u8path was only added
because the path constructor, per [fs.path.type.cvt]
<http://eel.is/c++draft/fs.path.type.cvt>, already specified different
behavior for construction via a range of char; u8path therefore provided
redundant functionality once char8_t was added.

I think deprecation is still justified on design grounds. The standard
currently associates the following encodings with char:

1. The /ordinary literal encoding/ ([lex.ccon.literal]
    <http://eel.is/c++draft/tab:lex.ccon.literal>, [lex.string.literal]
    <http://eel.is/c++draft/tab:lex.string.literal>) used for character
    and string literals.
2. The /execution character set/ ([character.seq.general]p(1.2)
    <http://eel.is/c++draft/library#character.seq.general-1.2>) used for
    the locale dependent execution environment.
3. The multibyte character encoding ([c.mb.wcs]
    <http://eel.is/c++draft/c.mb.wcs>, C: 5.2.1.1 Multibyte characters)
    which is effectively the encoding of the /execution character set/.
4. The /native encoding/ ([fs.path.type.cvt]p1
    <http://eel.is/c++draft/fs.path.type.cvt#1>) used for path names.

Though the standard doesn't require it, the intent is that these
encodings are all compatible. In practice, they do get out of sync; the
locale of the execution environment is not generally known when encoding
character and string literals and filesystem encoding may differ from
the locale dependent encoding.

Adding an additional association with UTF-8 creates a deeper division.
We know that programmers have a hard time maintaining encoding
expectations; mojibake remains a common occurrence. From a design
perspective, if we endorse continued use of u8path, should we also add
char-based UTF-8 specific variants of std::basic_string,
std::char_traits, and std::ctype? It isn't clear to me that path names
are sufficiently special to warrant special interfaces; particularly
when most filesystems in use today (NTFS being a partial exception) do
not require a particular encoding (most just require a specific value
for the '/' and '\0' characters). As we seek to add more Unicode
features to the standard library, should we add UTF-8 based interfaces
for char and char8_t (and unsigned char since some projects use that for
UTF-8)? I think the standard should avoid further muddying the waters of
what encoding(s) char should be associated with.

Tom.

On 11/29/22 1:08 AM, Daniel Krügler wrote:
> Am Di., 29. Nov. 2022 um 05:32 Uhr schrieb Nicole Mazzuca
> <Nicole.Mazzuca_at_[hidden]>:
>> I'd point out that the exact same issue exists with path(u8string), we've just made life more painful for people who do need to convert utf-8 to paths. (i.e., Windows people).
>>
>> Nicole
> Thanks for all the feedback, Nicole, Steve, and Casey. I will now open
> an LWG issue about this.
>
> - Daniel

Received on 2022-11-29 15:45:53