ISOCPP std-proposals List: Re: [std-proposals] char8

From: zxuiji <gb2985_at_[hidden]>
Date: Sun, 31 Aug 2025 12:34:22 +0100

On Sun, 31 Aug 2025 at 11:56, Tiago Freire <tmiguelf_at_[hidden]> wrote:

> Just so I don't leave this point unanswered:
>
> >I'd add window text to that list since that's a large part of what
> software is developed for.
>
> But the C++ standard doesn't know what a "window" is, it's not likely to
> ever dictate how to render one. So C++ will not define the encoding of such
> API's.
>
>
> >Maybe shader code too, not too sure on that one but I assume shader code
> compilers expect utf-8
>
> The C++ standard doesn't know what a shader is either. And that would fall
> under the discretion of your shader compiler, not C++.
>
>
> Similar comment is to be made of Qt APIs, it's not part of the C++
> standard.
>
> And this is what I 'm getting at. This looks like a discussion about the
> C++ standard but if you look closely at the details, it isn't, it really
> isn't.
>
>
>
>
> ------------------------------
> *From:* zxuiji <gb2985_at_[hidden]>
> *Sent:* Sunday, August 31, 2025 9:52:03 AM
> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
> *Cc:* Oliver Hunt <oliver_at_[hidden]>; Tiago Freire <tmiguelf_at_[hidden]>
> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>
> On Sun, 31 Aug 2025 at 08:47, Tiago Freire via Std-Proposals <
> std-proposals_at_[hidden]> wrote:
>
> Too much to distill here.
>
> Let's start with something simple.
> Do we agree that the interfaces we are talking about are either:
> 1. file system
> 2. terminal interaction
> and nothing else?
>
>
>
>
> ------------------------------
> *From:* Oliver Hunt <oliver_at_[hidden]>
> *Sent:* Sunday, August 31, 2025 12:17:54 AM
> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
> *Cc:* Tiago Freire <tmiguelf_at_[hidden]>
> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>
>
>
> On Aug 30, 2025, at 12:42 PM, Tiago Freire via Std-Proposals <
> std-proposals_at_[hidden]> wrote:
>
> I hope you don't mind, I'm going to migrate the topic to a new thread
> since this is diverging from floating point type aliasing, which is a
> different thing from Unicode support.
>
> I think there is an important detail that is overlooked here:
>
> This is why I advocate for char8_t over char16_t for functions.
>
>
> What char8_t or char16_t functions?
>
> As far as I know, there aren't many APIs that are even text, much less
> Unicode.
> Sure, you have file system paths and std::cout, but those are not Unicode
> there are no "char8_t or char16_t" in this domain, even if we like to
> pretend that it is.
>
>
> Please stop saying things like this, it simply is not true for all
> platforms. It _may_ be true for platforms you are familiar with, but that
> does not mean “all platforms”.
>
> File system names, and paths, are utf8 in APFS, which is the default
> filesystem for all darwin systems for (wikipedia) 8 years - and the only
> one for system partitions.
>
> In addition to the macOS filesystems being case insensitive by default -
> with the *unicode* definition of case/capitalisation, all apfs
> filesystems (case sensitive or case insensitive) are normalization
> insensitive, i.e. file names and paths with the same characters (grapheme
> clusters), are the same regardless of encoding.
>
> All char* apis are assumed to be utf8, and treated as such (described
> below: all display APIs other than the old ABI-fixed foundation ones, do
> not have a mechanism to handle utf16 directly).
>
> The fact that C and C++ treat strings these as binary blobs, and don’t
> provide any support for actual character iteration is moot - they all hit
> the underlying OS APIs that treat them as utf8.
>
> For example, swift can interoperate with C++ to some extent, and I believe
> it does understand things like std::string, and if you look at a
> std::string in swift, it will be presented as a sequence of Characters -
> e.g extended grapheme clusters (I did look up the correct name since last
> night :D ) - again the fact the C++ treats these as a stream of bytes is
> moot.
>
> If you pass a std::string or char* or whatever you want to the filesystem
> APIs, they will be treated as utf-8. Using old utf16 APIs - when supported
> at all - result in re-encoding to utf8 as the filesystems don’t know utf61.
>
> You have some text conversion facilities, the functions to convert between
> encodings, those are fine, the standard can deal with those without a
> problem.
>
>
> If it's not one of those 3 categories or similar (ex. program arguments,
> environment variables, debug symbols; which don't exist in unicode),
> frankly speaking I don't want text in my API's.
>
>
> Yes they do. I can happily write `some_program 🐶🐶🐶🐶🐶` — and the
> higher level APIs use things like [String] so the raw blobs are not exposed
> anyway — and they will happily go through the entire system and when they
> reach non-archaic interfaces those interfaces correctly treat them as
> unicode.
>
> I think the core issue here is that you are interpreting C and C++’s
> archaic pre-“oh there are characters beyond those relevant for EBCDIC and
> ASCII” APIs, and the subsequent addition of UCS2, which *all* just
> incorrectly assume code point per character means that the rest of the
> platform is oblivious to this.
>
> There are many many places in C and C++ where the language, and the
> standard libraries, just treat data as binary blobs and similar with no
> actual understanding of the meaning of the bytes, but we don’t then say
> “therefore that meaning does not exist”.
>
> Let’s consider the existing wchar_t and related APIs: they are not
> specified to be ucs2 or utf16 - that only appears to be the case on
> windows. On macOS (I can test locally) and linux (via godbolt) at least
> they’re 32bit values, so are presumably full unicode scalars, not
> ucs2/utf16.
>
> I was going to say that while utf8 is displayed correctly in the terminal,
> utf16 comes out as garbage, but I could not work out how to make any API
> even attempt to print char16_t/utf16 strings. Happily wchar_t still
> suffices to demonstrate the (lack of) support: without manually changing
> the terminal mode (which then seems like it breaks all other printing
> unless you toggle the mode back \o/), the scalar output breaks.
>
>
> I don't know of the problem of which you speak in which the standard
> should provide preference of one over the other.
> Can you be more concrete here?
>
>
> The issue that *I* am finding in this thread is you seem to be advocating
> for migrating C++ from it’s current utf8 default - again, the fact that C++
> believes codepoint==character is irrelevant - to utf16, or the addition of
> new utf16 (by default?) APIs, despite the widespread understanding that
> utf16 is always the wrong call *except* when ABI stability requires it.
>
> In that regard it seems analogous to arguing for EBCDIC as the standard
> representation due to widespread support and use of it by existing
> platforms, even though it was clear that everyone had moved, or was moving,
> to ASCII.
>
> Or take my comments on Swift: rather than taking that as an example of
> *another* system language being utf8 by design, you have clearly gone
> searching for the specific purpose of finding _any_ utf16 usage, not matter
> how far removed from anything the OS or user would (or could) see,
> presumably to point to such and say “see? It’s all tuf61” - a lot of
> swift’s runtime was based on existing OS code, or C[++] libraries, as it
> takes time to reimplement that code in a new language, even if doing so has
> advantages over the existing implementations. If you were to have done this
> kind of code spelunking back in the days of Swift 1.0 you would quite
> possibly (I do not know - I write C++ compilers) have found lots of/all of
> the implementation of String was just forwarded to NSString (or libicu or
> something), and at that point been able to say “look it’s all utf16”
> despite that never being visible to any user of the language and swift
> _never_ exposing a utf16 interface, and never being intended to be such,
> with the knowledge that over time it would all be rewritten in swift purely
> as utf8.
>
> I think you need to be clear here in explaining exactly what it is you are
> wanting: Like I said, it sounds like you are saying C++ should be using
> utf16 everywhere by default, but it could be that you are just arguing that
> it needs to maintain support for char16_t/wchar_t blobs.
>
> —Oliver
>
>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
>
> I'd add window text to that list since that's a large part of what
> software is developed for. Maybe shader code too, not too sure on that one
> but I assume shader code compilers expect utf-8
>
>
I mentioned them because C/C++ standard libraries are used in conjunction
with them too often to be ignored when deciding things that will often be
used with them, such as text encodings. The library doesn't need to know
the notion of them for the standards to be designed with compatibility with
them in mind.

Received on 2025-08-31 11:20:12