C++ Logo

std-proposals

Advanced search

[std-proposals] char8_t aliasing and Unicode

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Sat, 30 Aug 2025 19:42:40 +0000
I hope you don't mind, I'm going to migrate the topic to a new thread since this is diverging from floating point type aliasing, which is a different thing from Unicode support.

I think there is an important detail that is overlooked here:

> This is why I advocate for char8_t over char16_t for functions.

What char8_t or char16_t functions?

As far as I know, there aren't many APIs that are even text, much less Unicode.
Sure, you have file system paths and std::cout, but those are not Unicode there are no "char8_t or char16_t" in this domain, even if we like to pretend that it is.
You have some text conversion facilities, the functions to convert between encodings, those are fine, the standard can deal with those without a problem.

If it's not one of those 3 categories or similar (ex. program arguments, environment variables, debug symbols; which don't exist in unicode), frankly speaking I don't want text in my API's.

I don't know of the problem of which you speak in which the standard should provide preference of one over the other.
Can you be more concrete here?


-----Original Message-----
From: Std-Proposals <std-proposals-bounces_at_lists.isocpp.org> On Behalf Of Simon Schröder via Std-Proposals
Sent: Saturday, August 30, 2025 14:38
To: std-proposals_at_[hidden].org
Cc: Simon Schröder <dr.simon.schroeder_at_[hidden]>; std-proposals_at_lists.isocpp.org
Subject: Re: [std-proposals] TBAA and extended floating-point types

I agree that currently most libraries are using UTF-16 internally. However, most of them have started out as UCS-2. Just a couple of days ago I have read of yet another UCS-2 related Qt bug on the forum.

I see it from the view of the regular programmer: if you start out with programming you might have a lot of text files to handle. Many simple file formats store text instead of binary. Obviously, for text files it is easier to use UTF-8 instead of any other Unicode encoding because of byte ordering. Most of the internet runs on UTF-8. I have done my research and I’m on team “UTF-8 everywhere”. This is why I advocate for char8_t over char16_t for functions. We should not abolish char16_t or char32_t. However, most implementations that use UTF-16 internally predate char16_t. On Windows they use wchar_t instead to interface with Windows APIs. Writing code with Unicode in mind is already hard enough. And I feel that because of I/O UTF-8 is necessary and the simplest. The burden should be on string libraries to seamlessly convert between different encodings, not on the user. If we push towards UTF-8 consistently, maybe even Windows will rewrite their APIs in that direction in the coming decades. This is bad news for existing libraries, but it is IMHO the best solution for the community as a whole. (Let the experts, i.e. the string library implementers, deal with these problems and not every other programmer individually.)

> On Aug 28, 2025, at 3:03 AM, Thiago Macieira via Std-Proposals <std-proposals_at_lists.isocpp.org> wrote:
>
> On Wednesday, 27 August 2025 00:55:04 Pacific Daylight Time David
> Brown via Std-Proposals wrote:
>>> In the real world, UTF-16 has a place and is in use for in-memory
>>> representation more frequently than UTF-8 or UTF-32.
>>
>> Seriously?
>
> Yes. I'm didn't pass a quality judgement above (but will below). I was
> just stating fact: UTF-16 is in use as an in-memory representation for
> Unicode far more frequently than UTF-8 or UTF-32, given that Java,
> Cocoa/CoreFoundation, ICU, Qt and the Win32 API all use it. UTF-8 is
> used a great deal but usually in the context of arbitrary 8-bit
> encodings. If you try to find software that will decode from a
> specified 8-bit encoding onto one of the UTF codecs, you'll find that it's invariably UTF-16, not 8 or 32.
>
>> UTF-16 is firmly established as the worst possible choice for
>> representation - internal or external. It has all the disadvantages
>> of UTF-8, all the disadvantages of UTF-32, and none of their benefits.
>
> You can qualify the same things as advantages: it has half the memory
> overhead of UTF-32 for most text and a much easier decoding procedure
> than UTF-8, especially if you need to go backwards.
>
>> I can appreciate UTF-16 having some convenience on Windows as an
>> internal format, because that's close to the internal format Windows
>> has used for some subsystems and APIs. But Windows doesn't fully
>> support
>> UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
>> through the generations of Windows. If you want to use filesystem
>> APIs, you can't work with any old general UTF-16 strings - you need
>> to sanitise them for the limitations of the filename lengths and
>> characters supported.
>
> As far as I know, Windows doesn't care about surrogates in the string.
> That means it does allow improperly-encoded content, but it does allow
> the full Unicode range.
>
>> So you are going to have some kind of wrapper functions anyway - that
>> would be a fine place to put your conversion functions so that the
>> application code could use normal standard UTF-8 regardless of the
>> APIs. And AFAIK, MS is supporting steadily more UTF-8 and
>> encouraging its use rather than UCS2 or UTF-16.
>
> That's nowhere that I can see. All of the Win32 API is "W". There are
> a handful of UTF-8 functions out of what, 10000?
>
>> But I /am/ suggesting that most /new/ interfaces and features should
>> be char8_t and UTF-8 only. Any time you need to use something else
>> along with that new interface, you use a conversion function.
>
> If you're developing a library, you're welcome to do that. I actually
> welcome a full UTF-8 char8_t C++ Standard Library support.
>
> I am however saying that the C++ Standard must support char16_t as a
> first- class citizen, even ahead of char8_t if necessary if
> development cost is an issue. The fact that <format> only supports
> char and wchar_t (among other
> problems) makes it useless for us in Qt. There's far too much legacy
> to be ignored and there's more of it using UTF-16 than there is of
> UTF-8, especially using char8_t.
>
>>> UTF-8 is used
>>> for external representation (network protocols and files).
>>
>> It is also used for a great deal of internal representation - as well
>> as strings and data in source code. On occasions when you need
>> strings split up into directly accessible code points, UTF-32 is the
>> only option, so that is sometimes used internally in code. UTF-16
>> gives you an inefficient format that is different from all external
>> data and which cannot index code points or characters directly.
>
> UTF-32 in my experience is used as a stepping stone for iteration
> only, because it uses just too much memory. When dealing with *text*
> you also need to get away from indexing, because boundaries aren't
> obvious: you can't cut at an arbitrary codepoint and call it a day, if the next one is combining.
>
>> (I am not concerned with memory cost here. UTF-8 is invariably lower
>> memory for text that is big enough for the memory usage to be a
>> consideration. For shorter strings, conversion between encodings is
>> fast and the memory usage is a drop in the ocean.)
>
> That's a very Latin-centric view. For text using the Latin script, I
> guess you're going to have some 10-20% of non-US-ASCII codepoints,
> which makes the memory use in UTF-8 be 40-45% smaller than UTF-16, not
> 50%. As soon as you step outside of the Latin script, that's no longer
> the case: for text in Cyrillic, Greek or some other scripts, the
> memory use of UTF-8 is exactly the same as UTF-16. For CJK text, UTF-8
> is 50% more than UTF-16, requiring 3 bytes per character while UTF-16 still only requires 2.
>
>> In my suggestion, suppose the C++ standard library introduces a new
>> class "foo" that can take a string in its constructor, and also
>> provides methods for viewing the string. For the C++ <filesystem>
>> classes, the classes all need to handle input of 5 different string
>> types, and have 5 different observers to see the string. People
>> using it have 5 times the options, and stack overflow gets questions
>> from people wondering if they need string or u8string, or if they should use wstring or u16string.
>> You, as the implementer of a library that uses UTF-16 internally, can
>> happily use the u16string versions.
>
> Indeed we could.
>
> But just as FYI, Qt uses the UTF-16 path only if
> std::filesystem::path::value_type is UTF-16. Otherwise, we perform the
> conversion from UTF-16 to UTF-8 and then give that to fs::path,
> because our encoder and decoder are faster.
>
>> What I would like to see is that "foo" can /only/ take UTF-8 strings.
>> That makes life much simpler for the C++ library implementer, as well
>> as the unsung heroes who document all this stuff. It also makes it
>> simpler for the C++ user. Unfortunately, the library writer in the
>> middle will now need to add wrappers or conversion functions or
>> methods when using foo. But it should be a small matter - especially
>> compared to converting between standard C++ library and
>> toolkit-specific vectors, strings, and other such beasties where the
>> toolkit made their own classes before the C++ standard libraries were available or appropriate.
>
> And that's what I am complaining about. We do not live in a green
> field scenario and there's a lot of UTF-16 out there. And you're right
> it's expensive to add more API and document and test, but given that
> char must be supported, I am asking that the next one on the list be
> char16_t, ahead of char8_t.
>
> If the Standard Library and SG16 decided that going forward it was not
> going to add new char16_t or char32_t APIs, I'd understand. At least
> we would know where we stand and whether we could rely on certain
> Standard APIs or need to pay attention to development. As it stands,
> we neither can use <format> nor want to reimplement it because we may want to use it in the future.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Platform & System Engineering
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden].org
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> <signature.asc>
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-08-30 19:42:43