Date: Tue, 29 Nov 2022 10:45:51 -0500
Sorry for the delay in responding.
u8path was deprecated with the adoption of P0482R6
<https://wg21.link/p0482r6>. I confirmed that I neglected to include
motivation for its deprecation in that paper. The closest the paper gets
to such motivation is in the discussion of u8path in the Motivation
<https://wg21.link/p0482r6#motivation> section:
To accommodate UTF-8 encoded text, the file system library specifies
the following factory functions. Matching factory functions are not
provided for other encodings.
|template <class Source> path u8path(const Source& source); template
<class InputIterator> path u8path(InputIterator first, InputIterator
last); |
The requirement to construct path objects using one interface for
UTF-8 strings vs another interface for all other supported encodings
creates unnecessary difficulties for portable code. Consider an
application that uses UTF-8 as its internal encoding on POSIX
systems, but uses UTF-16 on Windows. Conditional compilation or
other abstractions must be implemented and used in otherwise
platform neutral code to construct path objects.
The original motivation for deprecation was that u8path was only added
because the path constructor, per [fs.path.type.cvt]
<http://eel.is/c++draft/fs.path.type.cvt>, already specified different
behavior for construction via a range of char; u8path therefore provided
redundant functionality once char8_t was added.
I think deprecation is still justified on design grounds. The standard
currently associates the following encodings with char:
1. The /ordinary literal encoding/ ([lex.ccon.literal]
<http://eel.is/c++draft/tab:lex.ccon.literal>, [lex.string.literal]
<http://eel.is/c++draft/tab:lex.string.literal>) used for character
and string literals.
2. The /execution character set/ ([character.seq.general]p(1.2)
<http://eel.is/c++draft/library#character.seq.general-1.2>) used for
the locale dependent execution environment.
3. The multibyte character encoding ([c.mb.wcs]
<http://eel.is/c++draft/c.mb.wcs>, C: 5.2.1.1 Multibyte characters)
which is effectively the encoding of the /execution character set/.
4. The /native encoding/ ([fs.path.type.cvt]p1
<http://eel.is/c++draft/fs.path.type.cvt#1>) used for path names.
Though the standard doesn't require it, the intent is that these
encodings are all compatible. In practice, they do get out of sync; the
locale of the execution environment is not generally known when encoding
character and string literals and filesystem encoding may differ from
the locale dependent encoding.
Adding an additional association with UTF-8 creates a deeper division.
We know that programmers have a hard time maintaining encoding
expectations; mojibake remains a common occurrence. From a design
perspective, if we endorse continued use of u8path, should we also add
char-based UTF-8 specific variants of std::basic_string,
std::char_traits, and std::ctype? It isn't clear to me that path names
are sufficiently special to warrant special interfaces; particularly
when most filesystems in use today (NTFS being a partial exception) do
not require a particular encoding (most just require a specific value
for the '/' and '\0' characters). As we seek to add more Unicode
features to the standard library, should we add UTF-8 based interfaces
for char and char8_t (and unsigned char since some projects use that for
UTF-8)? I think the standard should avoid further muddying the waters of
what encoding(s) char should be associated with.
Tom.
On 11/29/22 1:08 AM, Daniel Krügler wrote:
> Am Di., 29. Nov. 2022 um 05:32 Uhr schrieb Nicole Mazzuca
> <Nicole.Mazzuca_at_[hidden]>:
>> I'd point out that the exact same issue exists with path(u8string), we've just made life more painful for people who do need to convert utf-8 to paths. (i.e., Windows people).
>>
>> Nicole
> Thanks for all the feedback, Nicole, Steve, and Casey. I will now open
> an LWG issue about this.
>
> - Daniel
u8path was deprecated with the adoption of P0482R6
<https://wg21.link/p0482r6>. I confirmed that I neglected to include
motivation for its deprecation in that paper. The closest the paper gets
to such motivation is in the discussion of u8path in the Motivation
<https://wg21.link/p0482r6#motivation> section:
To accommodate UTF-8 encoded text, the file system library specifies
the following factory functions. Matching factory functions are not
provided for other encodings.
|template <class Source> path u8path(const Source& source); template
<class InputIterator> path u8path(InputIterator first, InputIterator
last); |
The requirement to construct path objects using one interface for
UTF-8 strings vs another interface for all other supported encodings
creates unnecessary difficulties for portable code. Consider an
application that uses UTF-8 as its internal encoding on POSIX
systems, but uses UTF-16 on Windows. Conditional compilation or
other abstractions must be implemented and used in otherwise
platform neutral code to construct path objects.
The original motivation for deprecation was that u8path was only added
because the path constructor, per [fs.path.type.cvt]
<http://eel.is/c++draft/fs.path.type.cvt>, already specified different
behavior for construction via a range of char; u8path therefore provided
redundant functionality once char8_t was added.
I think deprecation is still justified on design grounds. The standard
currently associates the following encodings with char:
1. The /ordinary literal encoding/ ([lex.ccon.literal]
<http://eel.is/c++draft/tab:lex.ccon.literal>, [lex.string.literal]
<http://eel.is/c++draft/tab:lex.string.literal>) used for character
and string literals.
2. The /execution character set/ ([character.seq.general]p(1.2)
<http://eel.is/c++draft/library#character.seq.general-1.2>) used for
the locale dependent execution environment.
3. The multibyte character encoding ([c.mb.wcs]
<http://eel.is/c++draft/c.mb.wcs>, C: 5.2.1.1 Multibyte characters)
which is effectively the encoding of the /execution character set/.
4. The /native encoding/ ([fs.path.type.cvt]p1
<http://eel.is/c++draft/fs.path.type.cvt#1>) used for path names.
Though the standard doesn't require it, the intent is that these
encodings are all compatible. In practice, they do get out of sync; the
locale of the execution environment is not generally known when encoding
character and string literals and filesystem encoding may differ from
the locale dependent encoding.
Adding an additional association with UTF-8 creates a deeper division.
We know that programmers have a hard time maintaining encoding
expectations; mojibake remains a common occurrence. From a design
perspective, if we endorse continued use of u8path, should we also add
char-based UTF-8 specific variants of std::basic_string,
std::char_traits, and std::ctype? It isn't clear to me that path names
are sufficiently special to warrant special interfaces; particularly
when most filesystems in use today (NTFS being a partial exception) do
not require a particular encoding (most just require a specific value
for the '/' and '\0' characters). As we seek to add more Unicode
features to the standard library, should we add UTF-8 based interfaces
for char and char8_t (and unsigned char since some projects use that for
UTF-8)? I think the standard should avoid further muddying the waters of
what encoding(s) char should be associated with.
Tom.
On 11/29/22 1:08 AM, Daniel Krügler wrote:
> Am Di., 29. Nov. 2022 um 05:32 Uhr schrieb Nicole Mazzuca
> <Nicole.Mazzuca_at_[hidden]>:
>> I'd point out that the exact same issue exists with path(u8string), we've just made life more painful for people who do need to convert utf-8 to paths. (i.e., Windows people).
>>
>> Nicole
> Thanks for all the feedback, Nicole, Steve, and Casey. I will now open
> an LWG issue about this.
>
> - Daniel
Received on 2022-11-29 15:45:53