ISOCPP std-proposals List: Re: [std-proposals] TBAA and extended floating-point types

From: David Brown <david.brown_at_[hidden]>
Date: Thu, 28 Aug 2025 14:08:32 +0200

On 28/08/2025 03:03, Thiago Macieira via Std-Proposals wrote:
> On Wednesday, 27 August 2025 00:55:04 Pacific Daylight Time David Brown via
> Std-Proposals wrote:
>>> In the real world, UTF-16 has a place and is in use for
>>> in-memory representation more frequently than UTF-8 or UTF-32.
>>
>> Seriously?
>
> Yes. I'm didn't pass a quality judgement above (but will below). I was just
> stating fact: UTF-16 is in use as an in-memory representation for Unicode far
> more frequently than UTF-8 or UTF-32, given that Java, Cocoa/CoreFoundation,
> ICU, Qt and the Win32 API all use it.

OK. It surprises me that this is the case even for modern code,
especially as many newer languages use UTF-8 almost everywhere, and the
*nix world has traditionally used 8-bit or 32-bit encodings rather than
16-bit. But you have a lot more direct experience than me here - my own
use of Unicode data has rarely involved anything that would care about
in-memory representation (and things like collation have been handled by
a database server rather than my own code).

> UTF-8 is used a great deal but usually
> in the context of arbitrary 8-bit encodings. If you try to find software that
> will decode from a specified 8-bit encoding onto one of the UTF codecs, you'll
> find that it's invariably UTF-16, not 8 or 32.
>

A quick google search for "C++ library converting Latin-9 to Unicode"
gave me UTF-8 only solutions and libraries that handled UTF-8, UTF-16
and UTF-32. I did not come across any that were UTF-16 only, in my
admittedly highly unscientific and non-representative search.

>> UTF-16 is firmly established as the worst possible choice for
>> representation - internal or external. It has all the disadvantages of
>> UTF-8, all the disadvantages of UTF-32, and none of their benefits.
>
> You can qualify the same things as advantages: it has half the memory overhead
> of UTF-32 for most text and a much easier decoding procedure than UTF-8,
> especially if you need to go backwards.

How often do you actually need to decode the strings? Normally you are
passing around full strings, where smaller memory usage means faster
copying and decode speed does not matter. UTF-8 strings can be
searched, cut up, and pasted together quite happily. And when you /do/
need to decode or encode for things like collation, normalisation or
case changes, the speed for decoding UTF-16 vs. UTF-8 is unlikely to be
a major factor. (The time taken for converting back and forth between
UTF-8 and UTF-16 when input/output is UTF-8 and some languages,
libraries and APIs need UTF-8 can be a factor.)

>
>> I can appreciate UTF-16 having some convenience on Windows as an
>> internal format, because that's close to the internal format Windows has
>> used for some subsystems and APIs. But Windows doesn't fully support
>> UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
>> through the generations of Windows. If you want to use filesystem APIs,
>> you can't work with any old general UTF-16 strings - you need to
>> sanitise them for the limitations of the filename lengths and characters
>> supported.
>
> As far as I know, Windows doesn't care about surrogates in the string. That
> means it does allow improperly-encoded content, but it does allow the full
> Unicode range.
>
>> So you are going to have some kind of wrapper functions
>> anyway - that would be a fine place to put your conversion functions so
>> that the application code could use normal standard UTF-8 regardless of
>> the APIs. And AFAIK, MS is supporting steadily more UTF-8 and
>> encouraging its use rather than UCS2 or UTF-16.
>
> That's nowhere that I can see. All of the Win32 API is "W". There are a
> handful of UTF-8 functions out of what, 10000?
>

That is what I read on MS's own pages. However, it is entirely possible
that the pages I came across were biased in some way - perhaps in the
context of code for web applications. I did read information
recommending setting the code page to UTF-8 and using the 8-bit APIs -
with information about the limitations that still exist.

Of course the existing API's will continue for a long time to come - as
close to "for ever" as you get in this game.

>> But I /am/ suggesting that most /new/ interfaces and features should be
>> char8_t and UTF-8 only. Any time you need to use something else along
>> with that new interface, you use a conversion function.
>
> If you're developing a library, you're welcome to do that. I actually welcome
> a full UTF-8 char8_t C++ Standard Library support.
>
> I am however saying that the C++ Standard must support char16_t as a first-
> class citizen, even ahead of char8_t if necessary if development cost is an
> issue.

I am suggesting that the C++ Standard should support char8_t first and
foremost, and encourage its use everywhere. It must, of course,
continue to support all existing character types (char, wchar_t,
char16_t and char32_t) - it must be possible to write code that works
with these types in an efficient and practical manner, and all existing
interfaces and library features need to remain. But I think it is fine
if /new/ library or language features are char8_t only, and the use of
other character types is a bit more inconvenient.

> The fact that <format> only supports char and wchar_t (among other
> problems) makes it useless for us in Qt. There's far too much legacy to be
> ignored and there's more of it using UTF-16 than there is of UTF-8, especially
> using char8_t.
>

Legacy is always the curse in programming. If the Unicode folks had a
crystal ball when they started, UCS2 / UTF-16 would never have existed
and it would have been UTF-8 from the start. The challenge is always
about how to go forward - the UTF-16 hole is already deep, so should we
keep digging, or should be try to climb out?

But even for fans of UTF-8, support for only "char" can be unfortunately
because it is not guaranteed to be the same as "char8_t".

>>> UTF-8 is used
>>> for external representation (network protocols and files).
>>
>> It is also used for a great deal of internal representation - as well as
>> strings and data in source code. On occasions when you need strings
>> split up into directly accessible code points, UTF-32 is the only
>> option, so that is sometimes used internally in code. UTF-16 gives you
>> an inefficient format that is different from all external data and which
>> cannot index code points or characters directly.
>
> UTF-32 in my experience is used as a stepping stone for iteration only,
> because it uses just too much memory.

Yes.

It is a lot less useful than many people think - often because indexing
is a lot less useful than many people think. But if you need to deal
with code points rather than opaque text, it is a convenient way to
store and pass around decoded Unicode.

> When dealing with *text* you also need
> to get away from indexing, because boundaries aren't obvious: you can't cut at
> an arbitrary codepoint and call it a day, if the next one is combining.
>

Every time you think you have covered everything, there is something
else to complicate matters! It all seems easy until someone wants a
username with six diacriticals on the one letter...

>> (I am not concerned with memory cost here. UTF-8 is invariably lower
>> memory for text that is big enough for the memory usage to be a
>> consideration. For shorter strings, conversion between encodings is
>> fast and the memory usage is a drop in the ocean.)
>
> That's a very Latin-centric view. For text using the Latin script, I guess
> you're going to have some 10-20% of non-US-ASCII codepoints, which makes the
> memory use in UTF-8 be 40-45% smaller than UTF-16, not 50%. As soon as you
> step outside of the Latin script, that's no longer the case: for text in
> Cyrillic, Greek or some other scripts, the memory use of UTF-8 is exactly the
> same as UTF-16. For CJK text, UTF-8 is 50% more than UTF-16, requiring 3 bytes
> per character while UTF-16 still only requires 2.

A great deal of real-world text of significant length, where byte count
gets important, is in some kind of markup format - html, xml, json, etc.
Almost no documents, even for CJK, are smaller in UTF-16 compared to
UTF-8. For bits of plain text, then of course you are correct that
UTF-8 is similar for some languages and larger for some languages. But
how often is that the case in situations where the memory use makes a
big difference?

(If data size is really important, it is always possible to compress the
text - and you'll get roughly the same size regardless of the encoding
because the information content is the same.)

>
>> In my suggestion, suppose the C++ standard library introduces a new
>> class "foo" that can take a string in its constructor, and also provides
>> methods for viewing the string. For the C++ <filesystem> classes, the
>> classes all need to handle input of 5 different string types, and have 5
>> different observers to see the string. People using it have 5 times the
>> options, and stack overflow gets questions from people wondering if they
>> need string or u8string, or if they should use wstring or u16string.
>> You, as the implementer of a library that uses UTF-16 internally, can
>> happily use the u16string versions.
>
> Indeed we could.
>
> But just as FYI, Qt uses the UTF-16 path only if
> std::filesystem::path::value_type is UTF-16. Otherwise, we perform the
> conversion from UTF-16 to UTF-8 and then give that to fs::path, because our
> encoder and decoder are faster.
>
>> What I would like to see is that "foo" can /only/ take UTF-8 strings.
>> That makes life much simpler for the C++ library implementer, as well as
>> the unsung heroes who document all this stuff. It also makes it simpler
>> for the C++ user. Unfortunately, the library writer in the middle will
>> now need to add wrappers or conversion functions or methods when using
>> foo. But it should be a small matter - especially compared to
>> converting between standard C++ library and toolkit-specific vectors,
>> strings, and other such beasties where the toolkit made their own
>> classes before the C++ standard libraries were available or appropriate.
>
> And that's what I am complaining about. We do not live in a green field
> scenario and there's a lot of UTF-16 out there. And you're right it's
> expensive to add more API and document and test, but given that char must be
> supported, I am asking that the next one on the list be char16_t, ahead of
> char8_t.
>

Fair enough. /I/ would ask for char8_t to be first, but I appreciate
your position.

> If the Standard Library and SG16 decided that going forward it was not going
> to add new char16_t or char32_t APIs, I'd understand. At least we would know
> where we stand and whether we could rely on certain Standard APIs or need to
> pay attention to development. As it stands, we neither can use <format> nor
> want to reimplement it because we may want to use it in the future.
>

That is indeed awkward. I fully appreciate the use of char16_t, but I
am at a loss to see how support for wchar_t is helpful.

Received on 2025-08-28 12:08:40