ISOCPP std-proposals List: Re: [std-proposals] charN_t (was: TBAA and extended floating-point types)

From: zxuiji <gb2985_at_[hidden]>
Date: Wed, 27 Aug 2025 09:47:21 +0100

Correct me if I'm wrong but isn't the purpose of the char8/16/32_t types
not to guarantee the encoding used but that the types are unsigned and big
enough for encodings using the respective amount of bits so that string
literals like u8"...", u"..." and U"..." can map to a consistent type
rather than the inconsistent wchar_t? If so then what's the issue? The
types don't stop arbitrary bytes in files being read as X encoding, only
convey to the compiler that you'll be working with at that many bytes at a
time, making it easier to process the encoding in the code.

On Tue, 26 Aug 2025 at 19:32, Tom Honermann via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> On 8/23/25 10:21 PM, Thiago Macieira wrote:
>
> On Saturday, 23 August 2025 19:55:14 Central Daylight Time Tom Honermann
> wrote:
>
> As you know, there is a long history of C++ implementations that do not
> target ASCII-based platforms. The typical C++ developer does not use
> these implementations nor directly write code for their respective
> target platforms, but these implementations remain relevant to the
> computing ecosystem and general C++ marketplace. char8_t (and char16_t
> and char32_t) enable code that reads/writes text in UTF encodings to be
> portable across all C++ implementations. It was never the expectation
> that all code would be migrated to char8_t.
>
> Hello Tom
>
> My problem is how char8_t was introduced. It took a feature that was useful --
> the ability to create an UTF-8-encoded string literal from an arbitrarily-
> encoded source file -- and made it useless by making it incompatible with 100%
> of existing software.
>
> The encoding of a source file has never been relevant for determining the
> encoding of a string literal (UTF-8 or otherwise).
>
> Interpreting a source file as having an encoding other than what it was
> authored with may affect the code units observed in a string literal. This
> is standard mojibake and is not solvable (in general) by any means.
>
> Admittedly, I don't know if the bigger problem are non-UTF8 sources or non-
> UTF8 execution environments. From my experience in the past 20 years of C++,
> it is the former, but I can't say that my experience is representative of the
> general case.
>
> From a standardization perspective, it is the latter. The C++ standard has
> the most strict requirements of any C++ project since all targets supported
> by all implementations must be considered.
>
> The lack of proper support for the char/N/_t encodings in the C++
> standard library is acknowledged. We are working towards improving the
> situation, but progress has been slow. I with I had more time/energy to
> personally devote to such improvements.
>
> That's true, but while unfortunate, we can live with the issue. We have and
> have had the transcoding functions for 25 years. Right now, Qt 6 still has a
> hybrid approach where UTF-16 is represented at the same time by QChar,
> char16_t, and ushort arrays, with the latter diminishing as char16_t
> increases.
>
> We still don't have a standardized interface to transcoding functions that
> covers all encodings the standard has to consider (the environment encoding
> (environment variables, command line arguments, stdin/stdout/stderr), the
> current locale encoding, the console encoding (for Windows), filename
> encoding (for Windows, kind of), ordinary literal encoding, wide literal
> encoding, UTF-8, UTF-16, UTF-32). I think that is where we have the most
> pain at the moment.
>
> Can you elaborate regarding your observed increased use of char16_t? Even
> anecdotal data would be interesting. I haven't seen any data one way or the
> other regarding use of char16_t.
>
> There are a few programming models actively deployed in the C++
> ecosystem for handling of UTF-8 today.
>
> * char is assumed to be some implementation-defined character
> encoding, probably something derived from ASCII though not
> necessarily UTF-8, but is also used for UTF-8. Programs that use
> this model must track character encoding themselves and, in general,
> don't do a great job of it. This is the historic programming model
> and remains the dominant model (note that modern POSIX
> implementations still respect locale settings that affect the choice
> of character encoding).
>
> This is what we do in Qt.
>
>
> * char is assumed to be some implementation-defined character
> encoding, char8_t (or wchar_t or char16_t) is used as an internal
> encoding with transcoding performed at program boundaries (system,
> Win32, I/O, etc...). This is the traditional programming model for
> Windows. char8_t provides an alternative to wchar_t with portable
> semantics.
>
> If you remove any mentions of char8_t from this, it becomes the first model. In
> my experience, this is not what Windows uses because there is no char8_t API
> anywhere. In order to effectively use UTF-8 on Windows at all, you must opt
> into your third model and make char be UTF-8.
>
> This is what many Windows applications have historically used with wchar_t
> substituted for char8_t.
>
> * char is assumed to be UTF-8. This is very common on Linux and macOS.
> Support for this has improved in the Windows ecosystem, but rough
> corners remain. Note that compiling with MSVC's /utf-8 option is not
> sufficient by itself to enable an assumption of UTF-8 for text held
> in char-based storage; the environment encoding, console encoding,
> and locale region settings also have to be considered.
>
> Indeed and Microsoft's slow uptake on this is annoying, even if completely
> understandable due to the massive legacy it is dealing with.
>
> Agreed.
>
> These all remain valid programming models and the right choice remains
> dependent on project goals. The C++ ecosystem is not a monoculture.
>
> I'm sorry, but are they?
>
> Aside from the Windows 8-bit-is-ANSI case, are there any non-UTF8 environments
> where C++ is still relevant? See the discussion on the 8-bit byte from a few
> weeks ago, where some people are arguing that there are no systems nor will
> there be any systems of relevance to C++29 and later where that assumption
> fails to hold. So I have to ask: is anyone still deploying C++23 or later on a
> system where char isn't UTF-8, aside from Windows?
>
> Jens already stated that we have participation in WG21 from people that
> support EBCDIC systems. But I can add some more specifics.
>
> IBM's z/OS is the primary EBCDIC platform that we talk about in SG16. I
> understand that there are other EBCDIC platforms, but I'm not familiar with
> them. There are three relevant C++ compilers for z/OS
>
> - IBM provides two C and C++ compilers as part of their IBM C/C++ for
> z/OS <https://www.ibm.com/products/xl-cpp-compiler-zos> product.
> - xlC is the historic C and C++ compiler and it has not seen language
> updates since before C++11. While this compiler is maintained, I don't
> expect it to be updated for any recent C++ standards.
> - xlclang is a fork of LLVM/Clang that is actively developed. At
> present, support is only claimed for C++17 but I don't know if that is due
> to missing language or library features. IBM has been contributing changes
> to LLVM/Clang at a steady rate; see PRs here
> <https://github.com/llvm/llvm-project/pulls?q=is%3Apr+in%3Atitle+z%2FOS+>.
> These include some minimal support for EBCDIC; see PRs here
> <https://github.com/llvm/llvm-project/pulls?q=is%3Apr+in%3Atitle+EBCDIC+is%3Aclosed>.
> There is also a PR
> <https://github.com/llvm/llvm-project/pull/138895> for support of
> the -fexec-charset option and corresponding RFCs (here
> <https://discourse.llvm.org/t/rfc-enabling-fexec-charset-support-to-llvm-and-clang-reposting/71512>
> and here
> <https://discourse.llvm.org/t/rfc-adding-a-charset-converter-to-the-llvm-support-library/69795>).
> Unfortunately, some of these PRs have been awaiting code review approval
> for a long time (it's on my todo list). I don't know what the state of
> LLVM/Clang native support for z/OS currently is, but my expectation is that
> LLVM/Clang will eventually have full native support for z/OS with EBCDIC.
> - Dignus provides a C and C++ compiler for z/OS called Systems/C++
> <http://www.dignus.com/dcxx/whatsnew.html>. It is an LLVM-based
> compiler but I don't know if it is based on Clang. Their most recent
> release is from November, 2024. A compiler manual is available here
> <http://www.dignus.com/dcxx/syscxx.pdf> and it claims support for
> C++17.
>
> Though it's entirely possible that Windows being still an exception renders my
> question moot and we must support arbitrary char encodings anyway.
>
> Indeed, there is a reason that Microsoft hasn't made UTF-8 the default.
>
> In any case, that doesn't require a char8_t type to exist. We could reduce to
> two cases: char is arbitrarily encoded (and the encoding information is kept
> out of band somewhere) or char is UTF8. That's what we've lived with for 20
> years since the transition to UTF8 began, and I don't see char8_t's existence
> helping move the needle here. From my experience, it's hurt more than helped.
>
> I understand, and as I said, continuing to develop for those programming
> models remains, and will remain, viable. char8_t solves a number of
> problems within the C++ standard and provides a means for projects that
> want to use the type system to enforce encoding concerns to do so. It is
> not necessary, nor realistic, for all C++ projects to adopt use of char8_t
> .
>
> We do have work to do to improve interoperability between char and char8_t
> though. P2626 (charN_t incremental adoption: Casting pointers of UTF
> character types) <https://wg21.link/p2626> discusses one such approach.
> That proposal still needs work to determine what is and is not viable for
> non-aliasing types. Assuming we do identify an appropriate solution, the
> underlying support could also enable such interoperability between the
> various floating point types discussed earlier in this email thread.
>
> Tom.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2025-08-27 08:33:10