On Wed, Aug 27, 2025 at 4:33 AM zxuiji via Std-Proposals <std-proposals@lists.isocpp.org> wrote:Correct me if I'm wrong but isn't the purpose of the char8/16/32_t types not to guarantee the encoding used but that the types are unsigned and big enough for encodings using the respective amount of bits so that string literals like u8"...", u"..." and U"..." can map to a consistent type rather than the inconsistent wchar_t? If so then what's the issue? The types don't stop arbitrary bytes in files being read as X encoding, only convey to the compiler that you'll be working with at that many bytes at a time, making it easier to process the encoding in the code.
When originally added in C11 and C++11, char16_t
and char32_t character and string
literals had an implementation-defined encoding despite the clear
intent that they were added for UTF-16 and UTF-32 support. This
was changed for C++20 by P1041R4
(Make char16_t/char32_t string literals be UTF-16/32) and
for C23 by N2728
(char16_t & char32_t string literals shall be UTF-16 &
UTF-32) and use of UTF-16 and UTF-32 is now specified by
each standard. char8_t was added in C++20 via P0482 (char8_t: A type for UTF-8
characters and strings) and char8_t
character and string literals have always been specified as UTF-8.
That statement is a slight over generalization. A function may certainly have a precondition that a char16_t string contain well-formed UTF-16 text, but it could also follow Unicode guidance and report an error or substitute a replacement character for invalid code unit sequences. The C++ standard currently fails to state preconditions for well-formed encoded text which is ... something we should fix.While you *can* put whatever data you want into them, the assumption when using such types is that they represent valid data within that encoding. If you pass `std::filesystem::path`'s constructor a `char16_t const*`, it will assume that the string is a valid UTF-16-encoded string and undefined behavior will result if it is not.
The types themselves don't "guarantee" anything, but all of the functions and constructs that consume or generate them *do* make such guarantees/requirements. `u8` literals *will* be in UTF-8 or you get a compile error. Functions that take `char32_t`s should be expected to fail if you pass an invalid codepoint. Etc.
In general, yes. It is also possible to construct UTF literals
that contain invalid code unit sequences. For example, u8'\xFF' and u8"\xFF
is not a valid code unit". Such literals are valid, but
passing them to some functions might violate preconditions. The
validity of such literals was clarified by P2029 (Proposed resolution for
core issues 411, 1656, and 2333; numeric and universal character
escapes in character and string literals).
Tom.