Date: Sat, 23 Aug 2025 21:21:38 -0500
On Saturday, 23 August 2025 19:55:14 Central Daylight Time Tom Honermann
wrote:
> As you know, there is a long history of C++ implementations that do not
> target ASCII-based platforms. The typical C++ developer does not use
> these implementations nor directly write code for their respective
> target platforms, but these implementations remain relevant to the
> computing ecosystem and general C++ marketplace. char8_t (and char16_t
> and char32_t) enable code that reads/writes text in UTF encodings to be
> portable across all C++ implementations. It was never the expectation
> that all code would be migrated to char8_t.
Hello Tom
My problem is how char8_t was introduced. It took a feature that was useful --
the ability to create an UTF-8-encoded string literal from an arbitrarily-
encoded source file -- and made it useless by making it incompatible with 100%
of existing software.
Admittedly, I don't know if the bigger problem are non-UTF8 sources or non-
UTF8 execution environments. From my experience in the past 20 years of C++,
it is the former, but I can't say that my experience is representative of the
general case.
> The lack of proper support for the char/N/_t encodings in the C++
> standard library is acknowledged. We are working towards improving the
> situation, but progress has been slow. I with I had more time/energy to
> personally devote to such improvements.
That's true, but while unfortunate, we can live with the issue. We have and
have had the transcoding functions for 25 years. Right now, Qt 6 still has a
hybrid approach where UTF-16 is represented at the same time by QChar,
char16_t, and ushort arrays, with the latter diminishing as char16_t
increases.
> There are a few programming models actively deployed in the C++
> ecosystem for handling of UTF-8 today.
>
> * char is assumed to be some implementation-defined character
> encoding, probably something derived from ASCII though not
> necessarily UTF-8, but is also used for UTF-8. Programs that use
> this model must track character encoding themselves and, in general,
> don't do a great job of it. This is the historic programming model
> and remains the dominant model (note that modern POSIX
> implementations still respect locale settings that affect the choice
> of character encoding).
This is what we do in Qt.
> * char is assumed to be some implementation-defined character
> encoding, char8_t (or wchar_t or char16_t) is used as an internal
> encoding with transcoding performed at program boundaries (system,
> Win32, I/O, etc...). This is the traditional programming model for
> Windows. char8_t provides an alternative to wchar_t with portable
> semantics.
If you remove any mentions of char8_t from this, it becomes the first model. In
my experience, this is not what Windows uses because there is no char8_t API
anywhere. In order to effectively use UTF-8 on Windows at all, you must opt
into your third model and make char be UTF-8.
> * char is assumed to be UTF-8. This is very common on Linux and macOS.
> Support for this has improved in the Windows ecosystem, but rough
> corners remain. Note that compiling with MSVC's /utf-8 option is not
> sufficient by itself to enable an assumption of UTF-8 for text held
> in char-based storage; the environment encoding, console encoding,
> and locale region settings also have to be considered.
Indeed and Microsoft's slow uptake on this is annoying, even if completely
understandable due to the massive legacy it is dealing with.
> These all remain valid programming models and the right choice remains
> dependent on project goals. The C++ ecosystem is not a monoculture.
I'm sorry, but are they?
Aside from the Windows 8-bit-is-ANSI case, are there any non-UTF8 environments
where C++ is still relevant? See the discussion on the 8-bit byte from a few
weeks ago, where some people are arguing that there are no systems nor will
there be any systems of relevance to C++29 and later where that assumption
fails to hold. So I have to ask: is anyone still deploying C++23 or later on a
system where char isn't UTF-8, aside from Windows?
Though it's entirely possible that Windows being still an exception renders my
question moot and we must support arbitrary char encodings anyway.
In any case, that doesn't require a char8_t type to exist. We could reduce to
two cases: char is arbitrarily encoded (and the encoding information is kept
out of band somewhere) or char is UTF8. That's what we've lived with for 20
years since the transition to UTF8 began, and I don't see char8_t's existence
helping move the needle here. From my experience, it's hurt more than helped.
wrote:
> As you know, there is a long history of C++ implementations that do not
> target ASCII-based platforms. The typical C++ developer does not use
> these implementations nor directly write code for their respective
> target platforms, but these implementations remain relevant to the
> computing ecosystem and general C++ marketplace. char8_t (and char16_t
> and char32_t) enable code that reads/writes text in UTF encodings to be
> portable across all C++ implementations. It was never the expectation
> that all code would be migrated to char8_t.
Hello Tom
My problem is how char8_t was introduced. It took a feature that was useful --
the ability to create an UTF-8-encoded string literal from an arbitrarily-
encoded source file -- and made it useless by making it incompatible with 100%
of existing software.
Admittedly, I don't know if the bigger problem are non-UTF8 sources or non-
UTF8 execution environments. From my experience in the past 20 years of C++,
it is the former, but I can't say that my experience is representative of the
general case.
> The lack of proper support for the char/N/_t encodings in the C++
> standard library is acknowledged. We are working towards improving the
> situation, but progress has been slow. I with I had more time/energy to
> personally devote to such improvements.
That's true, but while unfortunate, we can live with the issue. We have and
have had the transcoding functions for 25 years. Right now, Qt 6 still has a
hybrid approach where UTF-16 is represented at the same time by QChar,
char16_t, and ushort arrays, with the latter diminishing as char16_t
increases.
> There are a few programming models actively deployed in the C++
> ecosystem for handling of UTF-8 today.
>
> * char is assumed to be some implementation-defined character
> encoding, probably something derived from ASCII though not
> necessarily UTF-8, but is also used for UTF-8. Programs that use
> this model must track character encoding themselves and, in general,
> don't do a great job of it. This is the historic programming model
> and remains the dominant model (note that modern POSIX
> implementations still respect locale settings that affect the choice
> of character encoding).
This is what we do in Qt.
> * char is assumed to be some implementation-defined character
> encoding, char8_t (or wchar_t or char16_t) is used as an internal
> encoding with transcoding performed at program boundaries (system,
> Win32, I/O, etc...). This is the traditional programming model for
> Windows. char8_t provides an alternative to wchar_t with portable
> semantics.
If you remove any mentions of char8_t from this, it becomes the first model. In
my experience, this is not what Windows uses because there is no char8_t API
anywhere. In order to effectively use UTF-8 on Windows at all, you must opt
into your third model and make char be UTF-8.
> * char is assumed to be UTF-8. This is very common on Linux and macOS.
> Support for this has improved in the Windows ecosystem, but rough
> corners remain. Note that compiling with MSVC's /utf-8 option is not
> sufficient by itself to enable an assumption of UTF-8 for text held
> in char-based storage; the environment encoding, console encoding,
> and locale region settings also have to be considered.
Indeed and Microsoft's slow uptake on this is annoying, even if completely
understandable due to the massive legacy it is dealing with.
> These all remain valid programming models and the right choice remains
> dependent on project goals. The C++ ecosystem is not a monoculture.
I'm sorry, but are they?
Aside from the Windows 8-bit-is-ANSI case, are there any non-UTF8 environments
where C++ is still relevant? See the discussion on the 8-bit byte from a few
weeks ago, where some people are arguing that there are no systems nor will
there be any systems of relevance to C++29 and later where that assumption
fails to hold. So I have to ask: is anyone still deploying C++23 or later on a
system where char isn't UTF-8, aside from Windows?
Though it's entirely possible that Windows being still an exception renders my
question moot and we must support arbitrary char encodings anyway.
In any case, that doesn't require a char8_t type to exist. We could reduce to
two cases: char is arbitrarily encoded (and the encoding information is kept
out of band somewhere) or char is UTF8. That's what we've lived with for 20
years since the transition to UTF8 began, and I don't see char8_t's existence
helping move the needle here. From my experience, it's hurt more than helped.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Platform & System Engineering
Received on 2025-08-24 02:21:47