C++ Logo

std-proposals

Advanced search

Re: [std-proposals] TBAA and extended floating-point types

From: Thiago Macieira <thiago_at_[hidden]>
Date: Wed, 27 Aug 2025 18:03:21 -0700
On Wednesday, 27 August 2025 00:55:04 Pacific Daylight Time David Brown via
Std-Proposals wrote:
> > In the real world, UTF-16 has a place and is in use for
> > in-memory representation more frequently than UTF-8 or UTF-32.
>
> Seriously?

Yes. I'm didn't pass a quality judgement above (but will below). I was just
stating fact: UTF-16 is in use as an in-memory representation for Unicode far
more frequently than UTF-8 or UTF-32, given that Java, Cocoa/CoreFoundation,
ICU, Qt and the Win32 API all use it. UTF-8 is used a great deal but usually
in the context of arbitrary 8-bit encodings. If you try to find software that
will decode from a specified 8-bit encoding onto one of the UTF codecs, you'll
find that it's invariably UTF-16, not 8 or 32.

> UTF-16 is firmly established as the worst possible choice for
> representation - internal or external. It has all the disadvantages of
> UTF-8, all the disadvantages of UTF-32, and none of their benefits.

You can qualify the same things as advantages: it has half the memory overhead
of UTF-32 for most text and a much easier decoding procedure than UTF-8,
especially if you need to go backwards.

> I can appreciate UTF-16 having some convenience on Windows as an
> internal format, because that's close to the internal format Windows has
> used for some subsystems and APIs. But Windows doesn't fully support
> UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
> through the generations of Windows. If you want to use filesystem APIs,
> you can't work with any old general UTF-16 strings - you need to
> sanitise them for the limitations of the filename lengths and characters
> supported.

As far as I know, Windows doesn't care about surrogates in the string. That
means it does allow improperly-encoded content, but it does allow the full
Unicode range.

> So you are going to have some kind of wrapper functions
> anyway - that would be a fine place to put your conversion functions so
> that the application code could use normal standard UTF-8 regardless of
> the APIs. And AFAIK, MS is supporting steadily more UTF-8 and
> encouraging its use rather than UCS2 or UTF-16.

That's nowhere that I can see. All of the Win32 API is "W". There are a
handful of UTF-8 functions out of what, 10000?

> But I /am/ suggesting that most /new/ interfaces and features should be
> char8_t and UTF-8 only. Any time you need to use something else along
> with that new interface, you use a conversion function.

If you're developing a library, you're welcome to do that. I actually welcome
a full UTF-8 char8_t C++ Standard Library support.

I am however saying that the C++ Standard must support char16_t as a first-
class citizen, even ahead of char8_t if necessary if development cost is an
issue. The fact that <format> only supports char and wchar_t (among other
problems) makes it useless for us in Qt. There's far too much legacy to be
ignored and there's more of it using UTF-16 than there is of UTF-8, especially
using char8_t.

> > UTF-8 is used
> > for external representation (network protocols and files).
>
> It is also used for a great deal of internal representation - as well as
> strings and data in source code. On occasions when you need strings
> split up into directly accessible code points, UTF-32 is the only
> option, so that is sometimes used internally in code. UTF-16 gives you
> an inefficient format that is different from all external data and which
> cannot index code points or characters directly.

UTF-32 in my experience is used as a stepping stone for iteration only,
because it uses just too much memory. When dealing with *text* you also need
to get away from indexing, because boundaries aren't obvious: you can't cut at
an arbitrary codepoint and call it a day, if the next one is combining.

> (I am not concerned with memory cost here. UTF-8 is invariably lower
> memory for text that is big enough for the memory usage to be a
> consideration. For shorter strings, conversion between encodings is
> fast and the memory usage is a drop in the ocean.)

That's a very Latin-centric view. For text using the Latin script, I guess
you're going to have some 10-20% of non-US-ASCII codepoints, which makes the
memory use in UTF-8 be 40-45% smaller than UTF-16, not 50%. As soon as you
step outside of the Latin script, that's no longer the case: for text in
Cyrillic, Greek or some other scripts, the memory use of UTF-8 is exactly the
same as UTF-16. For CJK text, UTF-8 is 50% more than UTF-16, requiring 3 bytes
per character while UTF-16 still only requires 2.

> In my suggestion, suppose the C++ standard library introduces a new
> class "foo" that can take a string in its constructor, and also provides
> methods for viewing the string. For the C++ <filesystem> classes, the
> classes all need to handle input of 5 different string types, and have 5
> different observers to see the string. People using it have 5 times the
> options, and stack overflow gets questions from people wondering if they
> need string or u8string, or if they should use wstring or u16string.
> You, as the implementer of a library that uses UTF-16 internally, can
> happily use the u16string versions.

Indeed we could.

But just as FYI, Qt uses the UTF-16 path only if
std::filesystem::path::value_type is UTF-16. Otherwise, we perform the
conversion from UTF-16 to UTF-8 and then give that to fs::path, because our
encoder and decoder are faster.

> What I would like to see is that "foo" can /only/ take UTF-8 strings.
> That makes life much simpler for the C++ library implementer, as well as
> the unsung heroes who document all this stuff. It also makes it simpler
> for the C++ user. Unfortunately, the library writer in the middle will
> now need to add wrappers or conversion functions or methods when using
> foo. But it should be a small matter - especially compared to
> converting between standard C++ library and toolkit-specific vectors,
> strings, and other such beasties where the toolkit made their own
> classes before the C++ standard libraries were available or appropriate.

And that's what I am complaining about. We do not live in a green field
scenario and there's a lot of UTF-16 out there. And you're right it's
expensive to add more API and document and test, but given that char must be
supported, I am asking that the next one on the list be char16_t, ahead of
char8_t.

If the Standard Library and SG16 decided that going forward it was not going
to add new char16_t or char32_t APIs, I'd understand. At least we would know
where we stand and whether we could rely on certain Standard APIs or need to
pay attention to development. As it stands, we neither can use <format> nor
want to reimplement it because we may want to use it in the future.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Principal Engineer - Intel Platform & System Engineering

Received on 2025-08-28 01:03:30