Date: Wed, 27 Aug 2025 12:00:10 -0400
On 8/27/25 10:22 AM, Jason McKesson via Std-Proposals wrote:
> On Wed, Aug 27, 2025 at 4:33 AM zxuiji via Std-Proposals
> <std-proposals_at_[hidden]> wrote:
>> Correct me if I'm wrong but isn't the purpose of the char8/16/32_t types not to guarantee the encoding used but that the types are unsigned and big enough for encodings using the respective amount of bits so that string literals like u8"...", u"..." and U"..." can map to a consistent type rather than the inconsistent wchar_t? If so then what's the issue? The types don't stop arbitrary bytes in files being read as X encoding, only convey to the compiler that you'll be working with at that many bytes at a time, making it easier to process the encoding in the code.
When originally added in C11 and C++11, char16_t and char32_t character
and string literals had an implementation-defined encoding despite the
clear intent that they were added for UTF-16 and UTF-32 support. This
was changed for C++20 by P1041R4 (Make char16_t/char32_t string literals
be UTF-16/32) <https://wg21.link/p1041r4> and for C23 by N2728 (char16_t
& char32_t string literals shall be UTF-16 & UTF-32)
<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm> and use of
UTF-16 and UTF-32 is now specified by each standard. char8_t was added
in C++20 via P0482 (char8_t: A type for UTF-8 characters and strings)
<https://wg21.link/p0482> and char8_t character and string literals have
always been specified as UTF-8.
> While you *can* put whatever data you want into them, the assumption
> when using such types is that they represent valid data within that
> encoding. If you pass `std::filesystem::path`'s constructor a
> `char16_t const*`, it will assume that the string is a valid
> UTF-16-encoded string and undefined behavior will result if it is not.
That statement is a slight over generalization. A function may certainly
have a precondition that a char16_t string contain well-formed UTF-16
text, but it could also follow Unicode guidance and report an error or
substitute a replacement character for invalid code unit sequences. The
C++ standard currently fails to state preconditions for well-formed
encoded text which is ... something we should fix.
>
> The types themselves don't "guarantee" anything, but all of the
> functions and constructs that consume or generate them *do* make such
> guarantees/requirements. `u8` literals *will* be in UTF-8 or you get a
> compile error. Functions that take `char32_t`s should be expected to
> fail if you pass an invalid codepoint. Etc.
In general, yes. It is also possible to construct UTF literals that
contain invalid code unit sequences. For example, u8'\xFF' and u8"\xFF
is not a valid code unit". Such literals are valid, but passing them to
some functions might violate preconditions. The validity of such
literals was clarified by P2029 (Proposed resolution for core issues
411, 1656, and 2333; numeric and universal character escapes in
character and string literals) <https://wg21.link/p2029>.
Tom.
> On Wed, Aug 27, 2025 at 4:33 AM zxuiji via Std-Proposals
> <std-proposals_at_[hidden]> wrote:
>> Correct me if I'm wrong but isn't the purpose of the char8/16/32_t types not to guarantee the encoding used but that the types are unsigned and big enough for encodings using the respective amount of bits so that string literals like u8"...", u"..." and U"..." can map to a consistent type rather than the inconsistent wchar_t? If so then what's the issue? The types don't stop arbitrary bytes in files being read as X encoding, only convey to the compiler that you'll be working with at that many bytes at a time, making it easier to process the encoding in the code.
When originally added in C11 and C++11, char16_t and char32_t character
and string literals had an implementation-defined encoding despite the
clear intent that they were added for UTF-16 and UTF-32 support. This
was changed for C++20 by P1041R4 (Make char16_t/char32_t string literals
be UTF-16/32) <https://wg21.link/p1041r4> and for C23 by N2728 (char16_t
& char32_t string literals shall be UTF-16 & UTF-32)
<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm> and use of
UTF-16 and UTF-32 is now specified by each standard. char8_t was added
in C++20 via P0482 (char8_t: A type for UTF-8 characters and strings)
<https://wg21.link/p0482> and char8_t character and string literals have
always been specified as UTF-8.
> While you *can* put whatever data you want into them, the assumption
> when using such types is that they represent valid data within that
> encoding. If you pass `std::filesystem::path`'s constructor a
> `char16_t const*`, it will assume that the string is a valid
> UTF-16-encoded string and undefined behavior will result if it is not.
That statement is a slight over generalization. A function may certainly
have a precondition that a char16_t string contain well-formed UTF-16
text, but it could also follow Unicode guidance and report an error or
substitute a replacement character for invalid code unit sequences. The
C++ standard currently fails to state preconditions for well-formed
encoded text which is ... something we should fix.
>
> The types themselves don't "guarantee" anything, but all of the
> functions and constructs that consume or generate them *do* make such
> guarantees/requirements. `u8` literals *will* be in UTF-8 or you get a
> compile error. Functions that take `char32_t`s should be expected to
> fail if you pass an invalid codepoint. Etc.
In general, yes. It is also possible to construct UTF literals that
contain invalid code unit sequences. For example, u8'\xFF' and u8"\xFF
is not a valid code unit". Such literals are valid, but passing them to
some functions might violate preconditions. The validity of such
literals was clarified by P2029 (Proposed resolution for core issues
411, 1656, and 2333; numeric and universal character escapes in
character and string literals) <https://wg21.link/p2029>.
Tom.
Received on 2025-08-27 16:00:21