C++ Logo

std-proposals

Advanced search

Re: [std-proposals] char8_t aliasing and Unicode

From: zxuiji <gb2985_at_[hidden]>
Date: Sun, 31 Aug 2025 13:01:45 +0100
On Sun, 31 Aug 2025 at 12:45, Tiago Freire <tmiguelf_at_[hidden]> wrote:

> > I mentioned them because C/C++ standard libraries are used in
> conjunction with them too often to be ignored when deciding things that
> will often be used with them, such as text encodings. The library doesn't
> need to know the notion of them for the standards to be designed with
> compatibility with them in mind.
>
> Sure. And this can be done by:
> 1. Provide facilities that facilitate transcoding ( the only thing I
> support)
> 2. Do not make any encoding preferential
> 3. Avoid having to deal with encoding when at all possible, put that in
> the exclusive discretion of the user's domain.
>
> ------------------------------
> *From:* zxuiji <gb2985_at_[hidden]>
> *Sent:* Sunday, August 31, 2025 1:20:12 PM
> *To:* Tiago Freire <tmiguelf_at_[hidden]>
> *Cc:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>;
> Oliver Hunt <oliver_at_[hidden]>
> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>
> On Sun, 31 Aug 2025 at 11:56, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>
>> Just so I don't leave this point unanswered:
>>
>> >I'd add window text to that list since that's a large part of what
>> software is developed for.
>>
>> But the C++ standard doesn't know what a "window" is, it's not likely to
>> ever dictate how to render one. So C++ will not define the encoding of such
>> API's.
>>
>>
>> >Maybe shader code too, not too sure on that one but I assume shader code
>> compilers expect utf-8
>>
>> The C++ standard doesn't know what a shader is either. And that would
>> fall under the discretion of your shader compiler, not C++.
>>
>>
>> Similar comment is to be made of Qt APIs, it's not part of the C++
>> standard.
>>
>> And this is what I 'm getting at. This looks like a discussion about the
>> C++ standard but if you look closely at the details, it isn't, it really
>> isn't.
>>
>>
>>
>>
>> ------------------------------
>> *From:* zxuiji <gb2985_at_[hidden]>
>> *Sent:* Sunday, August 31, 2025 9:52:03 AM
>> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
>> *Cc:* Oliver Hunt <oliver_at_[hidden]>; Tiago Freire <tmiguelf_at_[hidden]>
>> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>>
>> On Sun, 31 Aug 2025 at 08:47, Tiago Freire via Std-Proposals <
>> std-proposals_at_[hidden]> wrote:
>>
>> Too much to distill here.
>>
>> Let's start with something simple.
>> Do we agree that the interfaces we are talking about are either:
>> 1. file system
>> 2. terminal interaction
>> and nothing else?
>>
>>
>>
>>
>> ------------------------------
>> *From:* Oliver Hunt <oliver_at_[hidden]>
>> *Sent:* Sunday, August 31, 2025 12:17:54 AM
>> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
>> *Cc:* Tiago Freire <tmiguelf_at_[hidden]>
>> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>>
>>
>>
>> On Aug 30, 2025, at 12:42 PM, Tiago Freire via Std-Proposals <
>> std-proposals_at_[hidden]> wrote:
>>
>> I hope you don't mind, I'm going to migrate the topic to a new thread
>> since this is diverging from floating point type aliasing, which is a
>> different thing from Unicode support.
>>
>> I think there is an important detail that is overlooked here:
>>
>> This is why I advocate for char8_t over char16_t for functions.
>>
>>
>> What char8_t or char16_t functions?
>>
>> As far as I know, there aren't many APIs that are even text, much less
>> Unicode.
>> Sure, you have file system paths and std::cout, but those are not Unicode
>> there are no "char8_t or char16_t" in this domain, even if we like to
>> pretend that it is.
>>
>>
>> Please stop saying things like this, it simply is not true for all
>> platforms. It _may_ be true for platforms you are familiar with, but that
>> does not mean “all platforms”.
>>
>> File system names, and paths, are utf8 in APFS, which is the default
>> filesystem for all darwin systems for (wikipedia) 8 years - and the only
>> one for system partitions.
>>
>> In addition to the macOS filesystems being case insensitive by default -
>> with the *unicode* definition of case/capitalisation, all apfs
>> filesystems (case sensitive or case insensitive) are normalization
>> insensitive, i.e. file names and paths with the same characters (grapheme
>> clusters), are the same regardless of encoding.
>>
>> All char* apis are assumed to be utf8, and treated as such (described
>> below: all display APIs other than the old ABI-fixed foundation ones, do
>> not have a mechanism to handle utf16 directly).
>>
>> The fact that C and C++ treat strings these as binary blobs, and don’t
>> provide any support for actual character iteration is moot - they all hit
>> the underlying OS APIs that treat them as utf8.
>>
>> For example, swift can interoperate with C++ to some extent, and I
>> believe it does understand things like std::string, and if you look at a
>> std::string in swift, it will be presented as a sequence of Characters -
>> e.g extended grapheme clusters (I did look up the correct name since last
>> night :D ) - again the fact the C++ treats these as a stream of bytes is
>> moot.
>>
>> If you pass a std::string or char* or whatever you want to the filesystem
>> APIs, they will be treated as utf-8. Using old utf16 APIs - when supported
>> at all - result in re-encoding to utf8 as the filesystems don’t know utf61.
>>
>> You have some text conversion facilities, the functions to convert
>> between encodings, those are fine, the standard can deal with those without
>> a problem.
>>
>>
>> If it's not one of those 3 categories or similar (ex. program arguments,
>> environment variables, debug symbols; which don't exist in unicode),
>> frankly speaking I don't want text in my API's.
>>
>>
>> Yes they do. I can happily write `some_program 🐶🐶🐶🐶🐶` — and the
>> higher level APIs use things like [String] so the raw blobs are not exposed
>> anyway — and they will happily go through the entire system and when they
>> reach non-archaic interfaces those interfaces correctly treat them as
>> unicode.
>>
>> I think the core issue here is that you are interpreting C and C++’s
>> archaic pre-“oh there are characters beyond those relevant for EBCDIC and
>> ASCII” APIs, and the subsequent addition of UCS2, which *all* just
>> incorrectly assume code point per character means that the rest of the
>> platform is oblivious to this.
>>
>> There are many many places in C and C++ where the language, and the
>> standard libraries, just treat data as binary blobs and similar with no
>> actual understanding of the meaning of the bytes, but we don’t then say
>> “therefore that meaning does not exist”.
>>
>> Let’s consider the existing wchar_t and related APIs: they are not
>> specified to be ucs2 or utf16 - that only appears to be the case on
>> windows. On macOS (I can test locally) and linux (via godbolt) at least
>> they’re 32bit values, so are presumably full unicode scalars, not
>> ucs2/utf16.
>>
>> I was going to say that while utf8 is displayed correctly in the
>> terminal, utf16 comes out as garbage, but I could not work out how to make
>> any API even attempt to print char16_t/utf16 strings. Happily wchar_t still
>> suffices to demonstrate the (lack of) support: without manually changing
>> the terminal mode (which then seems like it breaks all other printing
>> unless you toggle the mode back \o/), the scalar output breaks.
>>
>>
>> I don't know of the problem of which you speak in which the standard
>> should provide preference of one over the other.
>> Can you be more concrete here?
>>
>>
>> The issue that *I* am finding in this thread is you seem to be
>> advocating for migrating C++ from it’s current utf8 default - again, the
>> fact that C++ believes codepoint==character is irrelevant - to utf16, or
>> the addition of new utf16 (by default?) APIs, despite the widespread
>> understanding that utf16 is always the wrong call *except* when ABI
>> stability requires it.
>>
>> In that regard it seems analogous to arguing for EBCDIC as the standard
>> representation due to widespread support and use of it by existing
>> platforms, even though it was clear that everyone had moved, or was moving,
>> to ASCII.
>>
>> Or take my comments on Swift: rather than taking that as an example of
>> *another* system language being utf8 by design, you have clearly gone
>> searching for the specific purpose of finding _any_ utf16 usage, not matter
>> how far removed from anything the OS or user would (or could) see,
>> presumably to point to such and say “see? It’s all tuf61” - a lot of
>> swift’s runtime was based on existing OS code, or C[++] libraries, as it
>> takes time to reimplement that code in a new language, even if doing so has
>> advantages over the existing implementations. If you were to have done this
>> kind of code spelunking back in the days of Swift 1.0 you would quite
>> possibly (I do not know - I write C++ compilers) have found lots of/all of
>> the implementation of String was just forwarded to NSString (or libicu or
>> something), and at that point been able to say “look it’s all utf16”
>> despite that never being visible to any user of the language and swift
>> _never_ exposing a utf16 interface, and never being intended to be such,
>> with the knowledge that over time it would all be rewritten in swift purely
>> as utf8.
>>
>> I think you need to be clear here in explaining exactly what it is you
>> are wanting: Like I said, it sounds like you are saying C++ should be using
>> utf16 everywhere by default, but it could be that you are just arguing that
>> it needs to maintain support for char16_t/wchar_t blobs.
>>
>> —Oliver
>>
>>
>>
>> --
>> Std-Proposals mailing list
>> Std-Proposals_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>>
>>
>> I'd add window text to that list since that's a large part of what
>> software is developed for. Maybe shader code too, not too sure on that one
>> but I assume shader code compilers expect utf-8
>>
>>
> I mentioned them because C/C++ standard libraries are used in conjunction
> with them too often to be ignored when deciding things that will often be
> used with them, such as text encodings. The library doesn't need to know
> the notion of them for the standards to be designed with compatibility with
> them in mind.
>

"Do not make any encoding preferential" And yet he contradicts himself by
trying to make what whas it, UCS2?, preferential

Received on 2025-08-31 11:47:33