ISOCPP std-proposals List: Re: [std-proposals] char8

From: Oliver Hunt <oliver_at_[hidden]>
Date: Sat, 30 Aug 2025 15:17:35 -0700

> On Aug 30, 2025, at 12:42 PM, Tiago Freire via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
> I hope you don't mind, I'm going to migrate the topic to a new thread since this is diverging from floating point type aliasing, which is a different thing from Unicode support.
>
> I think there is an important detail that is overlooked here:
>
>> This is why I advocate for char8_t over char16_t for functions.
>
> What char8_t or char16_t functions?
>
> As far as I know, there aren't many APIs that are even text, much less Unicode.
> Sure, you have file system paths and std::cout, but those are not Unicode there are no "char8_t or char16_t" in this domain, even if we like to pretend that it is.

Please stop saying things like this, it simply is not true for all platforms. It _may_ be true for platforms you are familiar with, but that does not mean “all platforms”.

File system names, and paths, are utf8 in APFS, which is the default filesystem for all darwin systems for (wikipedia) 8 years - and the only one for system partitions.

In addition to the macOS filesystems being case insensitive by default - with the unicode definition of case/capitalisation, all apfs filesystems (case sensitive or case insensitive) are normalization insensitive, i.e. file names and paths with the same characters (grapheme clusters), are the same regardless of encoding.

All char* apis are assumed to be utf8, and treated as such (described below: all display APIs other than the old ABI-fixed foundation ones, do not have a mechanism to handle utf16 directly).

The fact that C and C++ treat strings these as binary blobs, and don’t provide any support for actual character iteration is moot - they all hit the underlying OS APIs that treat them as utf8.

For example, swift can interoperate with C++ to some extent, and I believe it does understand things like std::string, and if you look at a std::string in swift, it will be presented as a sequence of Characters - e.g extended grapheme clusters (I did look up the correct name since last night :D ) - again the fact the C++ treats these as a stream of bytes is moot.

If you pass a std::string or char* or whatever you want to the filesystem APIs, they will be treated as utf-8. Using old utf16 APIs - when supported at all - result in re-encoding to utf8 as the filesystems don’t know utf61.

> You have some text conversion facilities, the functions to convert between encodings, those are fine, the standard can deal with those without a problem.
>
> If it's not one of those 3 categories or similar (ex. program arguments, environment variables, debug symbols; which don't exist in unicode), frankly speaking I don't want text in my API's.

Yes they do. I can happily write `some_program 🐶🐶🐶🐶🐶` — and the higher level APIs use things like [String] so the raw blobs are not exposed anyway — and they will happily go through the entire system and when they reach non-archaic interfaces those interfaces correctly treat them as unicode.

I think the core issue here is that you are interpreting C and C++’s archaic pre-“oh there are characters beyond those relevant for EBCDIC and ASCII” APIs, and the subsequent addition of UCS2, which all just incorrectly assume code point per character means that the rest of the platform is oblivious to this.

There are many many places in C and C++ where the language, and the standard libraries, just treat data as binary blobs and similar with no actual understanding of the meaning of the bytes, but we don’t then say “therefore that meaning does not exist”.

Let’s consider the existing wchar_t and related APIs: they are not specified to be ucs2 or utf16 - that only appears to be the case on windows. On macOS (I can test locally) and linux (via godbolt) at least they’re 32bit values, so are presumably full unicode scalars, not ucs2/utf16.

I was going to say that while utf8 is displayed correctly in the terminal, utf16 comes out as garbage, but I could not work out how to make any API even attempt to print char16_t/utf16 strings. Happily wchar_t still suffices to demonstrate the (lack of) support: without manually changing the terminal mode (which then seems like it breaks all other printing unless you toggle the mode back \o/), the scalar output breaks.

> I don't know of the problem of which you speak in which the standard should provide preference of one over the other.
> Can you be more concrete here?

The issue that I am finding in this thread is you seem to be advocating for migrating C++ from it’s current utf8 default - again, the fact that C++ believes codepoint==character is irrelevant - to utf16, or the addition of new utf16 (by default?) APIs, despite the widespread understanding that utf16 is always the wrong call *except* when ABI stability requires it.

In that regard it seems analogous to arguing for EBCDIC as the standard representation due to widespread support and use of it by existing platforms, even though it was clear that everyone had moved, or was moving, to ASCII.

Or take my comments on Swift: rather than taking that as an example of *another* system language being utf8 by design, you have clearly gone searching for the specific purpose of finding _any_ utf16 usage, not matter how far removed from anything the OS or user would (or could) see, presumably to point to such and say “see? It’s all tuf61” - a lot of swift’s runtime was based on existing OS code, or C[++] libraries, as it takes time to reimplement that code in a new language, even if doing so has advantages over the existing implementations. If you were to have done this kind of code spelunking back in the days of Swift 1.0 you would quite possibly (I do not know - I write C++ compilers) have found lots of/all of the implementation of String was just forwarded to NSString (or libicu or something), and at that point been able to say “look it’s all utf16” despite that never being visible to any user of the language and swift _never_ exposing a utf16 interface, and never being intended to be such, with the knowledge that over time it would all be rewritten in swift purely as utf8.

I think you need to be clear here in explaining exactly what it is you are wanting: Like I said, it sounds like you are saying C++ should be using utf16 everywhere by default, but it could be that you are just arguing that it needs to maintain support for char16_t/wchar_t blobs.

—Oliver

Received on 2025-08-30 22:17:49