C++ Logo

std-proposals

Advanced search

Re: [std-proposals] char8_t aliasing and Unicode

From: zxuiji <gb2985_at_[hidden]>
Date: Sun, 31 Aug 2025 09:06:16 +0100
On Sun, 31 Aug 2025 at 08:47, Tiago Freire via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> Too much to distill here.
>
> Let's start with something simple.
> Do we agree that the interfaces we are talking about are either:
> 1. file system
> 2. terminal interaction
> and nothing else?
>
>
>
>
> ------------------------------
> *From:* Oliver Hunt <oliver_at_[hidden]>
> *Sent:* Sunday, August 31, 2025 12:17:54 AM
> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
> *Cc:* Tiago Freire <tmiguelf_at_[hidden]>
> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>
>
>
> On Aug 30, 2025, at 12:42 PM, Tiago Freire via Std-Proposals <
> std-proposals_at_[hidden]> wrote:
>
> I hope you don't mind, I'm going to migrate the topic to a new thread
> since this is diverging from floating point type aliasing, which is a
> different thing from Unicode support.
>
> I think there is an important detail that is overlooked here:
>
> This is why I advocate for char8_t over char16_t for functions.
>
>
> What char8_t or char16_t functions?
>
> As far as I know, there aren't many APIs that are even text, much less
> Unicode.
> Sure, you have file system paths and std::cout, but those are not Unicode
> there are no "char8_t or char16_t" in this domain, even if we like to
> pretend that it is.
>
>
> Please stop saying things like this, it simply is not true for all
> platforms. It _may_ be true for platforms you are familiar with, but that
> does not mean “all platforms”.
>
> File system names, and paths, are utf8 in APFS, which is the default
> filesystem for all darwin systems for (wikipedia) 8 years - and the only
> one for system partitions.
>
> In addition to the macOS filesystems being case insensitive by default -
> with the *unicode* definition of case/capitalisation, all apfs
> filesystems (case sensitive or case insensitive) are normalization
> insensitive, i.e. file names and paths with the same characters (grapheme
> clusters), are the same regardless of encoding.
>
> All char* apis are assumed to be utf8, and treated as such (described
> below: all display APIs other than the old ABI-fixed foundation ones, do
> not have a mechanism to handle utf16 directly).
>
> The fact that C and C++ treat strings these as binary blobs, and don’t
> provide any support for actual character iteration is moot - they all hit
> the underlying OS APIs that treat them as utf8.
>
> For example, swift can interoperate with C++ to some extent, and I believe
> it does understand things like std::string, and if you look at a
> std::string in swift, it will be presented as a sequence of Characters -
> e.g extended grapheme clusters (I did look up the correct name since last
> night :D ) - again the fact the C++ treats these as a stream of bytes is
> moot.
>
> If you pass a std::string or char* or whatever you want to the filesystem
> APIs, they will be treated as utf-8. Using old utf16 APIs - when supported
> at all - result in re-encoding to utf8 as the filesystems don’t know utf61.
>
> You have some text conversion facilities, the functions to convert between
> encodings, those are fine, the standard can deal with those without a
> problem.
>
>
> If it's not one of those 3 categories or similar (ex. program arguments,
> environment variables, debug symbols; which don't exist in unicode),
> frankly speaking I don't want text in my API's.
>
>
> Yes they do. I can happily write `some_program 🐶🐶🐶🐶🐶` — and the
> higher level APIs use things like [String] so the raw blobs are not exposed
> anyway — and they will happily go through the entire system and when they
> reach non-archaic interfaces those interfaces correctly treat them as
> unicode.
>
> I think the core issue here is that you are interpreting C and C++’s
> archaic pre-“oh there are characters beyond those relevant for EBCDIC and
> ASCII” APIs, and the subsequent addition of UCS2, which *all* just
> incorrectly assume code point per character means that the rest of the
> platform is oblivious to this.
>
> There are many many places in C and C++ where the language, and the
> standard libraries, just treat data as binary blobs and similar with no
> actual understanding of the meaning of the bytes, but we don’t then say
> “therefore that meaning does not exist”.
>
> Let’s consider the existing wchar_t and related APIs: they are not
> specified to be ucs2 or utf16 - that only appears to be the case on
> windows. On macOS (I can test locally) and linux (via godbolt) at least
> they’re 32bit values, so are presumably full unicode scalars, not
> ucs2/utf16.
>
> I was going to say that while utf8 is displayed correctly in the terminal,
> utf16 comes out as garbage, but I could not work out how to make any API
> even attempt to print char16_t/utf16 strings. Happily wchar_t still
> suffices to demonstrate the (lack of) support: without manually changing
> the terminal mode (which then seems like it breaks all other printing
> unless you toggle the mode back \o/), the scalar output breaks.
>
>
> I don't know of the problem of which you speak in which the standard
> should provide preference of one over the other.
> Can you be more concrete here?
>
>
> The issue that *I* am finding in this thread is you seem to be advocating
> for migrating C++ from it’s current utf8 default - again, the fact that C++
> believes codepoint==character is irrelevant - to utf16, or the addition of
> new utf16 (by default?) APIs, despite the widespread understanding that
> utf16 is always the wrong call *except* when ABI stability requires it.
>
> In that regard it seems analogous to arguing for EBCDIC as the standard
> representation due to widespread support and use of it by existing
> platforms, even though it was clear that everyone had moved, or was moving,
> to ASCII.
>
> Or take my comments on Swift: rather than taking that as an example of
> *another* system language being utf8 by design, you have clearly gone
> searching for the specific purpose of finding _any_ utf16 usage, not matter
> how far removed from anything the OS or user would (or could) see,
> presumably to point to such and say “see? It’s all tuf61” - a lot of
> swift’s runtime was based on existing OS code, or C[++] libraries, as it
> takes time to reimplement that code in a new language, even if doing so has
> advantages over the existing implementations. If you were to have done this
> kind of code spelunking back in the days of Swift 1.0 you would quite
> possibly (I do not know - I write C++ compilers) have found lots of/all of
> the implementation of String was just forwarded to NSString (or libicu or
> something), and at that point been able to say “look it’s all utf16”
> despite that never being visible to any user of the language and swift
> _never_ exposing a utf16 interface, and never being intended to be such,
> with the knowledge that over time it would all be rewritten in swift purely
> as utf8.
>
> I think you need to be clear here in explaining exactly what it is you are
> wanting: Like I said, it sounds like you are saying C++ should be using
> utf16 everywhere by default, but it could be that you are just arguing that
> it needs to maintain support for char16_t/wchar_t blobs.
>
> —Oliver
>
>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals


I'd add window text to that list since that's a large part of what software
is developed for. Maybe shader code too, not too sure on that one but I
assume shader code compilers expect utf-8

Received on 2025-08-31 07:52:05