C++ Logo

std-proposals

Advanced search

Re: [std-proposals] char8_t aliasing and Unicode

From: zxuiji <gb2985_at_[hidden]>
Date: Sun, 31 Aug 2025 13:07:25 +0100
On Sun, 31 Aug 2025 at 12:51, Tiago Freire <tmiguelf_at_[hidden]> wrote:

> I've never advocated for UCS2. My position has always been that the
> standard should stay out it.
>
> You must be confusing me for a different person on this thread. I'm not
> the only Tiago.
>
>
> ------------------------------
> *From:* zxuiji <gb2985_at_[hidden]>
> *Sent:* Sunday, August 31, 2025 1:47:35 PM
> *To:* Tiago Freire <tmiguelf_at_[hidden]>
> *Cc:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>;
> Oliver Hunt <oliver_at_[hidden]>
> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>
> On Sun, 31 Aug 2025 at 12:45, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>
>> > I mentioned them because C/C++ standard libraries are used in
>> conjunction with them too often to be ignored when deciding things that
>> will often be used with them, such as text encodings. The library doesn't
>> need to know the notion of them for the standards to be designed with
>> compatibility with them in mind.
>>
>> Sure. And this can be done by:
>> 1. Provide facilities that facilitate transcoding ( the only thing I
>> support)
>> 2. Do not make any encoding preferential
>> 3. Avoid having to deal with encoding when at all possible, put that in
>> the exclusive discretion of the user's domain.
>>
>> ------------------------------
>> *From:* zxuiji <gb2985_at_[hidden]>
>> *Sent:* Sunday, August 31, 2025 1:20:12 PM
>> *To:* Tiago Freire <tmiguelf_at_[hidden]>
>> *Cc:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>;
>> Oliver Hunt <oliver_at_[hidden]>
>> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>>
>> On Sun, 31 Aug 2025 at 11:56, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>>
>>> Just so I don't leave this point unanswered:
>>>
>>> >I'd add window text to that list since that's a large part of what
>>> software is developed for.
>>>
>>> But the C++ standard doesn't know what a "window" is, it's not likely to
>>> ever dictate how to render one. So C++ will not define the encoding of such
>>> API's.
>>>
>>>
>>> >Maybe shader code too, not too sure on that one but I assume shader
>>> code compilers expect utf-8
>>>
>>> The C++ standard doesn't know what a shader is either. And that would
>>> fall under the discretion of your shader compiler, not C++.
>>>
>>>
>>> Similar comment is to be made of Qt APIs, it's not part of the C++
>>> standard.
>>>
>>> And this is what I 'm getting at. This looks like a discussion about the
>>> C++ standard but if you look closely at the details, it isn't, it really
>>> isn't.
>>>
>>>
>>>
>>>
>>> ------------------------------
>>> *From:* zxuiji <gb2985_at_[hidden]>
>>> *Sent:* Sunday, August 31, 2025 9:52:03 AM
>>> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
>>> *Cc:* Oliver Hunt <oliver_at_[hidden]>; Tiago Freire <tmiguelf_at_[hidden]
>>> >
>>> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>>>
>>> On Sun, 31 Aug 2025 at 08:47, Tiago Freire via Std-Proposals <
>>> std-proposals_at_[hidden]> wrote:
>>>
>>> Too much to distill here.
>>>
>>> Let's start with something simple.
>>> Do we agree that the interfaces we are talking about are either:
>>> 1. file system
>>> 2. terminal interaction
>>> and nothing else?
>>>
>>>
>>>
>>>
>>> ------------------------------
>>> *From:* Oliver Hunt <oliver_at_[hidden]>
>>> *Sent:* Sunday, August 31, 2025 12:17:54 AM
>>> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
>>> *Cc:* Tiago Freire <tmiguelf_at_[hidden]>
>>> *Subject:* Re: [std-proposals] char8_t aliasing and Unicode
>>>
>>>
>>>
>>> On Aug 30, 2025, at 12:42 PM, Tiago Freire via Std-Proposals <
>>> std-proposals_at_[hidden]> wrote:
>>>
>>> I hope you don't mind, I'm going to migrate the topic to a new thread
>>> since this is diverging from floating point type aliasing, which is a
>>> different thing from Unicode support.
>>>
>>> I think there is an important detail that is overlooked here:
>>>
>>> This is why I advocate for char8_t over char16_t for functions.
>>>
>>>
>>> What char8_t or char16_t functions?
>>>
>>> As far as I know, there aren't many APIs that are even text, much less
>>> Unicode.
>>> Sure, you have file system paths and std::cout, but those are not
>>> Unicode there are no "char8_t or char16_t" in this domain, even if we like
>>> to pretend that it is.
>>>
>>>
>>> Please stop saying things like this, it simply is not true for all
>>> platforms. It _may_ be true for platforms you are familiar with, but that
>>> does not mean “all platforms”.
>>>
>>> File system names, and paths, are utf8 in APFS, which is the default
>>> filesystem for all darwin systems for (wikipedia) 8 years - and the only
>>> one for system partitions.
>>>
>>> In addition to the macOS filesystems being case insensitive by default -
>>> with the *unicode* definition of case/capitalisation, all apfs
>>> filesystems (case sensitive or case insensitive) are normalization
>>> insensitive, i.e. file names and paths with the same characters (grapheme
>>> clusters), are the same regardless of encoding.
>>>
>>> All char* apis are assumed to be utf8, and treated as such (described
>>> below: all display APIs other than the old ABI-fixed foundation ones, do
>>> not have a mechanism to handle utf16 directly).
>>>
>>> The fact that C and C++ treat strings these as binary blobs, and don’t
>>> provide any support for actual character iteration is moot - they all hit
>>> the underlying OS APIs that treat them as utf8.
>>>
>>> For example, swift can interoperate with C++ to some extent, and I
>>> believe it does understand things like std::string, and if you look at a
>>> std::string in swift, it will be presented as a sequence of Characters -
>>> e.g extended grapheme clusters (I did look up the correct name since last
>>> night :D ) - again the fact the C++ treats these as a stream of bytes is
>>> moot.
>>>
>>> If you pass a std::string or char* or whatever you want to the
>>> filesystem APIs, they will be treated as utf-8. Using old utf16 APIs - when
>>> supported at all - result in re-encoding to utf8 as the filesystems don’t
>>> know utf61.
>>>
>>> You have some text conversion facilities, the functions to convert
>>> between encodings, those are fine, the standard can deal with those without
>>> a problem.
>>>
>>>
>>> If it's not one of those 3 categories or similar (ex. program arguments,
>>> environment variables, debug symbols; which don't exist in unicode),
>>> frankly speaking I don't want text in my API's.
>>>
>>>
>>> Yes they do. I can happily write `some_program 🐶🐶🐶🐶🐶` — and the
>>> higher level APIs use things like [String] so the raw blobs are not exposed
>>> anyway — and they will happily go through the entire system and when they
>>> reach non-archaic interfaces those interfaces correctly treat them as
>>> unicode.
>>>
>>> I think the core issue here is that you are interpreting C and C++’s
>>> archaic pre-“oh there are characters beyond those relevant for EBCDIC and
>>> ASCII” APIs, and the subsequent addition of UCS2, which *all* just
>>> incorrectly assume code point per character means that the rest of the
>>> platform is oblivious to this.
>>>
>>> There are many many places in C and C++ where the language, and the
>>> standard libraries, just treat data as binary blobs and similar with no
>>> actual understanding of the meaning of the bytes, but we don’t then say
>>> “therefore that meaning does not exist”.
>>>
>>> Let’s consider the existing wchar_t and related APIs: they are not
>>> specified to be ucs2 or utf16 - that only appears to be the case on
>>> windows. On macOS (I can test locally) and linux (via godbolt) at least
>>> they’re 32bit values, so are presumably full unicode scalars, not
>>> ucs2/utf16.
>>>
>>> I was going to say that while utf8 is displayed correctly in the
>>> terminal, utf16 comes out as garbage, but I could not work out how to make
>>> any API even attempt to print char16_t/utf16 strings. Happily wchar_t still
>>> suffices to demonstrate the (lack of) support: without manually changing
>>> the terminal mode (which then seems like it breaks all other printing
>>> unless you toggle the mode back \o/), the scalar output breaks.
>>>
>>>
>>> I don't know of the problem of which you speak in which the standard
>>> should provide preference of one over the other.
>>> Can you be more concrete here?
>>>
>>>
>>> The issue that *I* am finding in this thread is you seem to be
>>> advocating for migrating C++ from it’s current utf8 default - again, the
>>> fact that C++ believes codepoint==character is irrelevant - to utf16, or
>>> the addition of new utf16 (by default?) APIs, despite the widespread
>>> understanding that utf16 is always the wrong call *except* when ABI
>>> stability requires it.
>>>
>>> In that regard it seems analogous to arguing for EBCDIC as the standard
>>> representation due to widespread support and use of it by existing
>>> platforms, even though it was clear that everyone had moved, or was moving,
>>> to ASCII.
>>>
>>> Or take my comments on Swift: rather than taking that as an example of
>>> *another* system language being utf8 by design, you have clearly gone
>>> searching for the specific purpose of finding _any_ utf16 usage, not matter
>>> how far removed from anything the OS or user would (or could) see,
>>> presumably to point to such and say “see? It’s all tuf61” - a lot of
>>> swift’s runtime was based on existing OS code, or C[++] libraries, as it
>>> takes time to reimplement that code in a new language, even if doing so has
>>> advantages over the existing implementations. If you were to have done this
>>> kind of code spelunking back in the days of Swift 1.0 you would quite
>>> possibly (I do not know - I write C++ compilers) have found lots of/all of
>>> the implementation of String was just forwarded to NSString (or libicu or
>>> something), and at that point been able to say “look it’s all utf16”
>>> despite that never being visible to any user of the language and swift
>>> _never_ exposing a utf16 interface, and never being intended to be such,
>>> with the knowledge that over time it would all be rewritten in swift purely
>>> as utf8.
>>>
>>> I think you need to be clear here in explaining exactly what it is you
>>> are wanting: Like I said, it sounds like you are saying C++ should be using
>>> utf16 everywhere by default, but it could be that you are just arguing that
>>> it needs to maintain support for char16_t/wchar_t blobs.
>>>
>>> —Oliver
>>>
>>>
>>>
>>> --
>>> Std-Proposals mailing list
>>> Std-Proposals_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>>>
>>>
>>> I'd add window text to that list since that's a large part of what
>>> software is developed for. Maybe shader code too, not too sure on that one
>>> but I assume shader code compilers expect utf-8
>>>
>>>
>> I mentioned them because C/C++ standard libraries are used in conjunction
>> with them too often to be ignored when deciding things that will often be
>> used with them, such as text encodings. The library doesn't need to know
>> the notion of them for the standards to be designed with compatibility with
>> them in mind.
>>
>
> "Do not make any encoding preferential" And yet he contradicts himself by
> trying to make what whas it, UCS2?, preferential
>
>
Ah yeah, my bad. For some reason thought you were the OP, probably because
I have a bad habit of failing to check names 😅

Received on 2025-08-31 11:53:14