On 8/23/25 10:21 PM, Thiago Macieira wrote:

On Saturday, 23 August 2025 19:55:14 Central Daylight Time Tom Honermann 
wrote:

As you know, there is a long history of C++ implementations that do not
target ASCII-based platforms. The typical C++ developer does not use
these implementations nor directly write code for their respective
target platforms, but these implementations remain relevant to the
computing ecosystem and general C++ marketplace. char8_t (and char16_t
and char32_t) enable code that reads/writes text in UTF encodings to be
portable across all C++ implementations. It was never the expectation
that all code would be migrated to char8_t.

Hello Tom

My problem is how char8_t was introduced. It took a feature that was useful -- 
the ability to create an UTF-8-encoded string literal from an arbitrarily-
encoded source file -- and made it useless by making it incompatible with 100% 
of existing software.

The encoding of a source file has never been relevant for determining the encoding of a string literal (UTF-8 or otherwise).

Interpreting a source file as having an encoding other than what it was authored with may affect the code units observed in a string literal. This is standard mojibake and is not solvable (in general) by any means.


Admittedly, I don't know if the bigger problem are non-UTF8 sources or non-
UTF8 execution environments. From my experience in the past 20 years of C++, 
it is the former, but I can't say that my experience is representative of the 
general case.

From a standardization perspective, it is the latter. The C++ standard has the most strict requirements of any C++ project since all targets supported by all implementations must be considered.

The lack of proper support for the char/N/_t encodings in the C++
standard library is acknowledged. We are working towards improving the
situation, but progress has been slow. I with I had more time/energy to
personally devote to such improvements.

That's true, but while unfortunate, we can live with the issue. We have and 
have had the transcoding functions for 25 years. Right now, Qt 6 still has a 
hybrid approach where UTF-16 is represented at the same time by QChar, 
char16_t, and ushort arrays, with the latter diminishing as char16_t 
increases.

We still don't have a standardized interface to transcoding functions that covers all encodings the standard has to consider (the environment encoding (environment variables, command line arguments, stdin/stdout/stderr), the current locale encoding, the console encoding (for Windows), filename encoding (for Windows, kind of), ordinary literal encoding, wide literal encoding, UTF-8, UTF-16, UTF-32). I think that is where we have the most pain at the moment.

Can you elaborate regarding your observed increased use of char16_t? Even anecdotal data would be interesting. I haven't seen any data one way or the other regarding use of char16_t.

There are a few programming models actively deployed in the C++
ecosystem for handling of UTF-8 today.

  * char is assumed to be some implementation-defined character
    encoding, probably something derived from ASCII though not
    necessarily UTF-8, but is also used for UTF-8. Programs that use
    this model must track character encoding themselves and, in general,
    don't do a great job of it. This is the historic programming model
    and remains the dominant model (note that modern POSIX
    implementations still respect locale settings that affect the choice
    of character encoding).

This is what we do in Qt.

  * char is assumed to be some implementation-defined character
    encoding, char8_t (or wchar_t or char16_t) is used as an internal
    encoding with transcoding performed at program boundaries (system,
    Win32, I/O, etc...). This is the traditional programming model for
    Windows. char8_t provides an alternative to wchar_t with portable
    semantics.

If you remove any mentions of char8_t from this, it becomes the first model. In 
my experience, this is not what Windows uses because there is no char8_t API 
anywhere. In order to effectively use UTF-8 on Windows at all, you must opt 
into your third model and make char be UTF-8.

This is what many Windows applications have historically used with wchar_t substituted for char8_t.

  * char is assumed to be UTF-8. This is very common on Linux and macOS.
    Support for this has improved in the Windows ecosystem, but rough
    corners remain. Note that compiling with MSVC's /utf-8 option is not
    sufficient by itself to enable an assumption of UTF-8 for text held
    in char-based storage; the environment encoding, console encoding,
    and locale region settings also have to be considered.

Indeed and Microsoft's slow uptake on this is annoying, even if completely 
understandable due to the massive legacy it is dealing with.

Agreed.

These all remain valid programming models and the right choice remains
dependent on project goals. The C++ ecosystem is not a monoculture.

I'm sorry, but are they?

Aside from the Windows 8-bit-is-ANSI case, are there any non-UTF8 environments 
where C++ is still relevant? See the discussion on the 8-bit byte from a few 
weeks ago, where some people are arguing that there are no systems nor will 
there be any systems of relevance to C++29 and later where that assumption 
fails to hold. So I have to ask: is anyone still deploying C++23 or later on a 
system where char isn't UTF-8, aside from Windows?

Jens already stated that we have participation in WG21 from people that support EBCDIC systems. But I can add some more specifics.

IBM's z/OS is the primary EBCDIC platform that we talk about in SG16. I understand that there are other EBCDIC platforms, but I'm not familiar with them. There are three relevant C++ compilers for z/OS

IBM provides two C and C++ compilers as part of their IBM C/C++ for z/OS product.

xlC is the historic C and C++ compiler and it has not seen language updates since before C++11. While this compiler is maintained, I don't expect it to be updated for any recent C++ standards.
xlclang is a fork of LLVM/Clang that is actively developed. At present, support is only claimed for C++17 but I don't know if that is due to missing language or library features. IBM has been contributing changes to LLVM/Clang at a steady rate; see PRs here. These include some minimal support for EBCDIC; see PRs here. There is also a PR for support of the -fexec-charset option and corresponding RFCs (here and here). Unfortunately, some of these PRs have been awaiting code review approval for a long time (it's on my todo list). I don't know what the state of LLVM/Clang native support for z/OS currently is, but my expectation is that LLVM/Clang will eventually have full native support for z/OS with EBCDIC.

Dignus provides a C and C++ compiler for z/OS called Systems/C++. It is an LLVM-based compiler but I don't know if it is based on Clang. Their most recent release is from November, 2024. A compiler manual is available here and it claims support for C++17.


Though it's entirely possible that Windows being still an exception renders my 
question moot and we must support arbitrary char encodings anyway.

Indeed, there is a reason that Microsoft hasn't made UTF-8 the default.


In any case, that doesn't require a char8_t type to exist. We could reduce to 
two cases: char is arbitrarily encoded (and the encoding information is kept 
out of band somewhere) or char is UTF8. That's what we've lived with for 20 
years since the transition to UTF8 began, and I don't see char8_t's existence 
helping move the needle here. From my experience, it's hurt more than helped.

I understand, and as I said, continuing to develop for those programming models remains, and will remain, viable. char8_t solves a number of problems within the C++ standard and provides a means for projects that want to use the type system to enforce encoding concerns to do so. It is not necessary, nor realistic, for all C++ projects to adopt use of char8_t.

We do have work to do to improve interoperability between char and char8_t though. P2626 (charN_t incremental adoption: Casting pointers of UTF character types) discusses one such approach. That proposal still needs work to determine what is and is not viable for non-aliasing types. Assuming we do identify an appropriate solution, the underlying support could also enable such interoperability between the various floating point types discussed earlier in this email thread.

Tom.