Date: Tue, 26 Aug 2025 14:32:25 -0400
On 8/23/25 10:21 PM, Thiago Macieira wrote:
> On Saturday, 23 August 2025 19:55:14 Central Daylight Time Tom Honermann
> wrote:
>> As you know, there is a long history of C++ implementations that do not
>> target ASCII-based platforms. The typical C++ developer does not use
>> these implementations nor directly write code for their respective
>> target platforms, but these implementations remain relevant to the
>> computing ecosystem and general C++ marketplace. char8_t (and char16_t
>> and char32_t) enable code that reads/writes text in UTF encodings to be
>> portable across all C++ implementations. It was never the expectation
>> that all code would be migrated to char8_t.
> Hello Tom
>
> My problem is how char8_t was introduced. It took a feature that was useful --
> the ability to create an UTF-8-encoded string literal from an arbitrarily-
> encoded source file -- and made it useless by making it incompatible with 100%
> of existing software.
The encoding of a source file has never been relevant for determining
the encoding of a string literal (UTF-8 or otherwise).
Interpreting a source file as having an encoding other than what it was
authored with may affect the code units observed in a string literal.
This is standard mojibake and is not solvable (in general) by any means.
>
> Admittedly, I don't know if the bigger problem are non-UTF8 sources or non-
> UTF8 execution environments. From my experience in the past 20 years of C++,
> it is the former, but I can't say that my experience is representative of the
> general case.
From a standardization perspective, it is the latter. The C++ standard
has the most strict requirements of any C++ project since all targets
supported by all implementations must be considered.
>
>> The lack of proper support for the char/N/_t encodings in the C++
>> standard library is acknowledged. We are working towards improving the
>> situation, but progress has been slow. I with I had more time/energy to
>> personally devote to such improvements.
> That's true, but while unfortunate, we can live with the issue. We have and
> have had the transcoding functions for 25 years. Right now, Qt 6 still has a
> hybrid approach where UTF-16 is represented at the same time by QChar,
> char16_t, and ushort arrays, with the latter diminishing as char16_t
> increases.
We still don't have a standardized interface to transcoding functions
that covers all encodings the standard has to consider (the environment
encoding (environment variables, command line arguments,
stdin/stdout/stderr), the current locale encoding, the console encoding
(for Windows), filename encoding (for Windows, kind of), ordinary
literal encoding, wide literal encoding, UTF-8, UTF-16, UTF-32). I think
that is where we have the most pain at the moment.
Can you elaborate regarding your observed increased use of char16_t?
Even anecdotal data would be interesting. I haven't seen any data one
way or the other regarding use of char16_t.
>
>> There are a few programming models actively deployed in the C++
>> ecosystem for handling of UTF-8 today.
>>
>> * char is assumed to be some implementation-defined character
>> encoding, probably something derived from ASCII though not
>> necessarily UTF-8, but is also used for UTF-8. Programs that use
>> this model must track character encoding themselves and, in general,
>> don't do a great job of it. This is the historic programming model
>> and remains the dominant model (note that modern POSIX
>> implementations still respect locale settings that affect the choice
>> of character encoding).
> This is what we do in Qt.
>
>> * char is assumed to be some implementation-defined character
>> encoding, char8_t (or wchar_t or char16_t) is used as an internal
>> encoding with transcoding performed at program boundaries (system,
>> Win32, I/O, etc...). This is the traditional programming model for
>> Windows. char8_t provides an alternative to wchar_t with portable
>> semantics.
> If you remove any mentions of char8_t from this, it becomes the first model. In
> my experience, this is not what Windows uses because there is no char8_t API
> anywhere. In order to effectively use UTF-8 on Windows at all, you must opt
> into your third model and make char be UTF-8.
This is what many Windows applications have historically used with
wchar_t substituted for char8_t.
>
>> * char is assumed to be UTF-8. This is very common on Linux and macOS.
>> Support for this has improved in the Windows ecosystem, but rough
>> corners remain. Note that compiling with MSVC's /utf-8 option is not
>> sufficient by itself to enable an assumption of UTF-8 for text held
>> in char-based storage; the environment encoding, console encoding,
>> and locale region settings also have to be considered.
> Indeed and Microsoft's slow uptake on this is annoying, even if completely
> understandable due to the massive legacy it is dealing with.
Agreed.
>
>> These all remain valid programming models and the right choice remains
>> dependent on project goals. The C++ ecosystem is not a monoculture.
> I'm sorry, but are they?
>
> Aside from the Windows 8-bit-is-ANSI case, are there any non-UTF8 environments
> where C++ is still relevant? See the discussion on the 8-bit byte from a few
> weeks ago, where some people are arguing that there are no systems nor will
> there be any systems of relevance to C++29 and later where that assumption
> fails to hold. So I have to ask: is anyone still deploying C++23 or later on a
> system where char isn't UTF-8, aside from Windows?
Jens already stated that we have participation in WG21 from people that
support EBCDIC systems. But I can add some more specifics.
IBM's z/OS is the primary EBCDIC platform that we talk about in SG16. I
understand that there are other EBCDIC platforms, but I'm not familiar
with them. There are three relevant C++ compilers for z/OS
* IBM provides two C and C++ compilers as part of their IBM C/C++ for
z/OS <https://www.ibm.com/products/xl-cpp-compiler-zos> product.
o xlC is the historic C and C++ compiler and it has not seen
language updates since before C++11. While this compiler is
maintained, I don't expect it to be updated for any recent C++
standards.
o xlclang is a fork of LLVM/Clang that is actively developed. At
present, support is only claimed for C++17 but I don't know if
that is due to missing language or library features. IBM has
been contributing changes to LLVM/Clang at a steady rate; see
PRs here
<https://github.com/llvm/llvm-project/pulls?q=is%3Apr+in%3Atitle+z%2FOS+>.
These include some minimal support for EBCDIC; see PRs here
<https://github.com/llvm/llvm-project/pulls?q=is%3Apr+in%3Atitle+EBCDIC+is%3Aclosed>.
There is also a PR
<https://github.com/llvm/llvm-project/pull/138895> for support
of the -fexec-charset option and corresponding RFCs (here
<https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512>
and here
<https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795>).
Unfortunately, some of these PRs have been awaiting code review
approval for a long time (it's on my todo list). I don't know
what the state of LLVM/Clang native support for z/OS currently
is, but my expectation is that LLVM/Clang will eventually have
full native support for z/OS with EBCDIC.
* Dignus provides a C and C++ compiler for z/OS called Systems/C++
<http://www.dignus.com/dcxx/whatsnew.html>. It is an LLVM-based
compiler but I don't know if it is based on Clang. Their most recent
release is from November, 2024. A compiler manual is available here
<http://www.dignus.com/dcxx/syscxx.pdf> and it claims support for C++17.
>
> Though it's entirely possible that Windows being still an exception renders my
> question moot and we must support arbitrary char encodings anyway.
Indeed, there is a reason that Microsoft hasn't made UTF-8 the default.
>
> In any case, that doesn't require a char8_t type to exist. We could reduce to
> two cases: char is arbitrarily encoded (and the encoding information is kept
> out of band somewhere) or char is UTF8. That's what we've lived with for 20
> years since the transition to UTF8 began, and I don't see char8_t's existence
> helping move the needle here. From my experience, it's hurt more than helped.
I understand, and as I said, continuing to develop for those programming
models remains, and will remain, viable. char8_t solves a number of
problems within the C++ standard and provides a means for projects that
want to use the type system to enforce encoding concerns to do so. It is
not necessary, nor realistic, for all C++ projects to adopt use of char8_t.
We do have work to do to improve interoperability between char and
char8_t though. P2626 (charN_t incremental adoption: Casting pointers of
UTF character types) <https://wg21.link/p2626> discusses one such
approach. That proposal still needs work to determine what is and is not
viable for non-aliasing types. Assuming we do identify an appropriate
solution, the underlying support could also enable such interoperability
between the various floating point types discussed earlier in this email
thread.
Tom.
> On Saturday, 23 August 2025 19:55:14 Central Daylight Time Tom Honermann
> wrote:
>> As you know, there is a long history of C++ implementations that do not
>> target ASCII-based platforms. The typical C++ developer does not use
>> these implementations nor directly write code for their respective
>> target platforms, but these implementations remain relevant to the
>> computing ecosystem and general C++ marketplace. char8_t (and char16_t
>> and char32_t) enable code that reads/writes text in UTF encodings to be
>> portable across all C++ implementations. It was never the expectation
>> that all code would be migrated to char8_t.
> Hello Tom
>
> My problem is how char8_t was introduced. It took a feature that was useful --
> the ability to create an UTF-8-encoded string literal from an arbitrarily-
> encoded source file -- and made it useless by making it incompatible with 100%
> of existing software.
The encoding of a source file has never been relevant for determining
the encoding of a string literal (UTF-8 or otherwise).
Interpreting a source file as having an encoding other than what it was
authored with may affect the code units observed in a string literal.
This is standard mojibake and is not solvable (in general) by any means.
>
> Admittedly, I don't know if the bigger problem are non-UTF8 sources or non-
> UTF8 execution environments. From my experience in the past 20 years of C++,
> it is the former, but I can't say that my experience is representative of the
> general case.
From a standardization perspective, it is the latter. The C++ standard
has the most strict requirements of any C++ project since all targets
supported by all implementations must be considered.
>
>> The lack of proper support for the char/N/_t encodings in the C++
>> standard library is acknowledged. We are working towards improving the
>> situation, but progress has been slow. I with I had more time/energy to
>> personally devote to such improvements.
> That's true, but while unfortunate, we can live with the issue. We have and
> have had the transcoding functions for 25 years. Right now, Qt 6 still has a
> hybrid approach where UTF-16 is represented at the same time by QChar,
> char16_t, and ushort arrays, with the latter diminishing as char16_t
> increases.
We still don't have a standardized interface to transcoding functions
that covers all encodings the standard has to consider (the environment
encoding (environment variables, command line arguments,
stdin/stdout/stderr), the current locale encoding, the console encoding
(for Windows), filename encoding (for Windows, kind of), ordinary
literal encoding, wide literal encoding, UTF-8, UTF-16, UTF-32). I think
that is where we have the most pain at the moment.
Can you elaborate regarding your observed increased use of char16_t?
Even anecdotal data would be interesting. I haven't seen any data one
way or the other regarding use of char16_t.
>
>> There are a few programming models actively deployed in the C++
>> ecosystem for handling of UTF-8 today.
>>
>> * char is assumed to be some implementation-defined character
>> encoding, probably something derived from ASCII though not
>> necessarily UTF-8, but is also used for UTF-8. Programs that use
>> this model must track character encoding themselves and, in general,
>> don't do a great job of it. This is the historic programming model
>> and remains the dominant model (note that modern POSIX
>> implementations still respect locale settings that affect the choice
>> of character encoding).
> This is what we do in Qt.
>
>> * char is assumed to be some implementation-defined character
>> encoding, char8_t (or wchar_t or char16_t) is used as an internal
>> encoding with transcoding performed at program boundaries (system,
>> Win32, I/O, etc...). This is the traditional programming model for
>> Windows. char8_t provides an alternative to wchar_t with portable
>> semantics.
> If you remove any mentions of char8_t from this, it becomes the first model. In
> my experience, this is not what Windows uses because there is no char8_t API
> anywhere. In order to effectively use UTF-8 on Windows at all, you must opt
> into your third model and make char be UTF-8.
This is what many Windows applications have historically used with
wchar_t substituted for char8_t.
>
>> * char is assumed to be UTF-8. This is very common on Linux and macOS.
>> Support for this has improved in the Windows ecosystem, but rough
>> corners remain. Note that compiling with MSVC's /utf-8 option is not
>> sufficient by itself to enable an assumption of UTF-8 for text held
>> in char-based storage; the environment encoding, console encoding,
>> and locale region settings also have to be considered.
> Indeed and Microsoft's slow uptake on this is annoying, even if completely
> understandable due to the massive legacy it is dealing with.
Agreed.
>
>> These all remain valid programming models and the right choice remains
>> dependent on project goals. The C++ ecosystem is not a monoculture.
> I'm sorry, but are they?
>
> Aside from the Windows 8-bit-is-ANSI case, are there any non-UTF8 environments
> where C++ is still relevant? See the discussion on the 8-bit byte from a few
> weeks ago, where some people are arguing that there are no systems nor will
> there be any systems of relevance to C++29 and later where that assumption
> fails to hold. So I have to ask: is anyone still deploying C++23 or later on a
> system where char isn't UTF-8, aside from Windows?
Jens already stated that we have participation in WG21 from people that
support EBCDIC systems. But I can add some more specifics.
IBM's z/OS is the primary EBCDIC platform that we talk about in SG16. I
understand that there are other EBCDIC platforms, but I'm not familiar
with them. There are three relevant C++ compilers for z/OS
* IBM provides two C and C++ compilers as part of their IBM C/C++ for
z/OS <https://www.ibm.com/products/xl-cpp-compiler-zos> product.
o xlC is the historic C and C++ compiler and it has not seen
language updates since before C++11. While this compiler is
maintained, I don't expect it to be updated for any recent C++
standards.
o xlclang is a fork of LLVM/Clang that is actively developed. At
present, support is only claimed for C++17 but I don't know if
that is due to missing language or library features. IBM has
been contributing changes to LLVM/Clang at a steady rate; see
PRs here
<https://github.com/llvm/llvm-project/pulls?q=is%3Apr+in%3Atitle+z%2FOS+>.
These include some minimal support for EBCDIC; see PRs here
<https://github.com/llvm/llvm-project/pulls?q=is%3Apr+in%3Atitle+EBCDIC+is%3Aclosed>.
There is also a PR
<https://github.com/llvm/llvm-project/pull/138895> for support
of the -fexec-charset option and corresponding RFCs (here
<https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512>
and here
<https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795>).
Unfortunately, some of these PRs have been awaiting code review
approval for a long time (it's on my todo list). I don't know
what the state of LLVM/Clang native support for z/OS currently
is, but my expectation is that LLVM/Clang will eventually have
full native support for z/OS with EBCDIC.
* Dignus provides a C and C++ compiler for z/OS called Systems/C++
<http://www.dignus.com/dcxx/whatsnew.html>. It is an LLVM-based
compiler but I don't know if it is based on Clang. Their most recent
release is from November, 2024. A compiler manual is available here
<http://www.dignus.com/dcxx/syscxx.pdf> and it claims support for C++17.
>
> Though it's entirely possible that Windows being still an exception renders my
> question moot and we must support arbitrary char encodings anyway.
Indeed, there is a reason that Microsoft hasn't made UTF-8 the default.
>
> In any case, that doesn't require a char8_t type to exist. We could reduce to
> two cases: char is arbitrarily encoded (and the encoding information is kept
> out of band somewhere) or char is UTF8. That's what we've lived with for 20
> years since the transition to UTF8 began, and I don't see char8_t's existence
> helping move the needle here. From my experience, it's hurt more than helped.
I understand, and as I said, continuing to develop for those programming
models remains, and will remain, viable. char8_t solves a number of
problems within the C++ standard and provides a means for projects that
want to use the type system to enforce encoding concerns to do so. It is
not necessary, nor realistic, for all C++ projects to adopt use of char8_t.
We do have work to do to improve interoperability between char and
char8_t though. P2626 (charN_t incremental adoption: Casting pointers of
UTF character types) <https://wg21.link/p2626> discusses one such
approach. That proposal still needs work to determine what is and is not
viable for non-aliasing types. Assuming we do identify an appropriate
solution, the underlying support could also enable such interoperability
between the various floating point types discussed earlier in this email
thread.
Tom.
Received on 2025-08-26 18:32:29