On 8/27/25 10:22 AM, Jason McKesson via Std-Proposals wrote:

On Wed, Aug 27, 2025 at 4:33 AM zxuiji via Std-Proposals
<std-proposals@lists.isocpp.org> wrote:

Correct me if I'm wrong but isn't the purpose of the char8/16/32_t types not to guarantee the encoding used but that the types are unsigned and big enough for encodings using the respective amount of bits so that string literals like u8"...", u"..." and U"..." can map to a consistent type rather than the inconsistent wchar_t? If so then what's the issue? The types don't stop arbitrary bytes in files being read as X encoding, only convey to the compiler that you'll be working with at that many bytes at a time, making it easier to process the encoding in the code.

When originally added in C11 and C++11, char16_t and char32_t character and string literals had an implementation-defined encoding despite the clear intent that they were added for UTF-16 and UTF-32 support. This was changed for C++20 by P1041R4 (Make char16_t/char32_t string literals be UTF-16/32) and for C23 by N2728 (char16_t & char32_t string literals shall be UTF-16 & UTF-32) and use of UTF-16 and UTF-32 is now specified by each standard. char8_t was added in C++20 via P0482 (char8_t: A type for UTF-8 characters and strings) and char8_t character and string literals have always been specified as UTF-8.

While you *can* put whatever data you want into them, the assumption
when using such types is that they represent valid data within that
encoding. If you pass `std::filesystem::path`'s constructor a
`char16_t const*`, it will assume that the string is a valid
UTF-16-encoded string and undefined behavior will result if it is not.

That statement is a slight over generalization. A function may certainly have a precondition that a char16_t string contain well-formed UTF-16 text, but it could also follow Unicode guidance and report an error or substitute a replacement character for invalid code unit sequences. The C++ standard currently fails to state preconditions for well-formed encoded text which is ... something we should fix.


The types themselves don't "guarantee" anything, but all of the
functions and constructs that consume or generate them *do* make such
guarantees/requirements. `u8` literals *will* be in UTF-8 or you get a
compile error. Functions that take `char32_t`s should be expected to
fail if you pass an invalid codepoint. Etc.

In general, yes. It is also possible to construct UTF literals that contain invalid code unit sequences. For example, u8'\xFF' and u8"\xFF is not a valid code unit". Such literals are valid, but passing them to some functions might violate preconditions. The validity of such literals was clarified by P2029 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals).

Tom.