C++ Logo

std-proposals

Advanced search

Re: [std-proposals] TBAA and extended floating-point types

From: Simon Schröder <dr.simon.schroeder_at_[hidden]>
Date: Sat, 30 Aug 2025 14:37:51 +0200
I agree that currently most libraries are using UTF-16 internally. However, most of them have started out as UCS-2. Just a couple of days ago I have read of yet another UCS-2 related Qt bug on the forum.

I see it from the view of the regular programmer: if you start out with programming you might have a lot of text files to handle. Many simple file formats store text instead of binary. Obviously, for text files it is easier to use UTF-8 instead of any other Unicode encoding because of byte ordering. Most of the internet runs on UTF-8. I have done my research and I’m on team “UTF-8 everywhere”. This is why I advocate for char8_t over char16_t for functions. We should not abolish char16_t or char32_t. However, most implementations that use UTF-16 internally predate char16_t. On Windows they use wchar_t instead to interface with Windows APIs. Writing code with Unicode in mind is already hard enough. And I feel that because of I/O UTF-8 is necessary and the simplest. The burden should be on string libraries to seamlessly convert between different encodings, not on the user. If we push towards UTF-8 consistently, maybe even Windows will rewrite their APIs in that direction in the coming decades. This is bad news for existing libraries, but it is IMHO the best solution for the community as a whole. (Let the experts, i.e. the string library implementers, deal with these problems and not every other programmer individually.)

> On Aug 28, 2025, at 3:03 AM, Thiago Macieira via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
> On Wednesday, 27 August 2025 00:55:04 Pacific Daylight Time David Brown via
> Std-Proposals wrote:
>>> In the real world, UTF-16 has a place and is in use for
>>> in-memory representation more frequently than UTF-8 or UTF-32.
>>
>> Seriously?
>
> Yes. I'm didn't pass a quality judgement above (but will below). I was just
> stating fact: UTF-16 is in use as an in-memory representation for Unicode far
> more frequently than UTF-8 or UTF-32, given that Java, Cocoa/CoreFoundation,
> ICU, Qt and the Win32 API all use it. UTF-8 is used a great deal but usually
> in the context of arbitrary 8-bit encodings. If you try to find software that
> will decode from a specified 8-bit encoding onto one of the UTF codecs, you'll
> find that it's invariably UTF-16, not 8 or 32.
>
>> UTF-16 is firmly established as the worst possible choice for
>> representation - internal or external. It has all the disadvantages of
>> UTF-8, all the disadvantages of UTF-32, and none of their benefits.
>
> You can qualify the same things as advantages: it has half the memory overhead
> of UTF-32 for most text and a much easier decoding procedure than UTF-8,
> especially if you need to go backwards.
>
>> I can appreciate UTF-16 having some convenience on Windows as an
>> internal format, because that's close to the internal format Windows has
>> used for some subsystems and APIs. But Windows doesn't fully support
>> UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
>> through the generations of Windows. If you want to use filesystem APIs,
>> you can't work with any old general UTF-16 strings - you need to
>> sanitise them for the limitations of the filename lengths and characters
>> supported.
>
> As far as I know, Windows doesn't care about surrogates in the string. That
> means it does allow improperly-encoded content, but it does allow the full
> Unicode range.
>
>> So you are going to have some kind of wrapper functions
>> anyway - that would be a fine place to put your conversion functions so
>> that the application code could use normal standard UTF-8 regardless of
>> the APIs. And AFAIK, MS is supporting steadily more UTF-8 and
>> encouraging its use rather than UCS2 or UTF-16.
>
> That's nowhere that I can see. All of the Win32 API is "W". There are a
> handful of UTF-8 functions out of what, 10000?
>
>> But I /am/ suggesting that most /new/ interfaces and features should be
>> char8_t and UTF-8 only. Any time you need to use something else along
>> with that new interface, you use a conversion function.
>
> If you're developing a library, you're welcome to do that. I actually welcome
> a full UTF-8 char8_t C++ Standard Library support.
>
> I am however saying that the C++ Standard must support char16_t as a first-
> class citizen, even ahead of char8_t if necessary if development cost is an
> issue. The fact that <format> only supports char and wchar_t (among other
> problems) makes it useless for us in Qt. There's far too much legacy to be
> ignored and there's more of it using UTF-16 than there is of UTF-8, especially
> using char8_t.
>
>>> UTF-8 is used
>>> for external representation (network protocols and files).
>>
>> It is also used for a great deal of internal representation - as well as
>> strings and data in source code. On occasions when you need strings
>> split up into directly accessible code points, UTF-32 is the only
>> option, so that is sometimes used internally in code. UTF-16 gives you
>> an inefficient format that is different from all external data and which
>> cannot index code points or characters directly.
>
> UTF-32 in my experience is used as a stepping stone for iteration only,
> because it uses just too much memory. When dealing with *text* you also need
> to get away from indexing, because boundaries aren't obvious: you can't cut at
> an arbitrary codepoint and call it a day, if the next one is combining.
>
>> (I am not concerned with memory cost here. UTF-8 is invariably lower
>> memory for text that is big enough for the memory usage to be a
>> consideration. For shorter strings, conversion between encodings is
>> fast and the memory usage is a drop in the ocean.)
>
> That's a very Latin-centric view. For text using the Latin script, I guess
> you're going to have some 10-20% of non-US-ASCII codepoints, which makes the
> memory use in UTF-8 be 40-45% smaller than UTF-16, not 50%. As soon as you
> step outside of the Latin script, that's no longer the case: for text in
> Cyrillic, Greek or some other scripts, the memory use of UTF-8 is exactly the
> same as UTF-16. For CJK text, UTF-8 is 50% more than UTF-16, requiring 3 bytes
> per character while UTF-16 still only requires 2.
>
>> In my suggestion, suppose the C++ standard library introduces a new
>> class "foo" that can take a string in its constructor, and also provides
>> methods for viewing the string. For the C++ <filesystem> classes, the
>> classes all need to handle input of 5 different string types, and have 5
>> different observers to see the string. People using it have 5 times the
>> options, and stack overflow gets questions from people wondering if they
>> need string or u8string, or if they should use wstring or u16string.
>> You, as the implementer of a library that uses UTF-16 internally, can
>> happily use the u16string versions.
>
> Indeed we could.
>
> But just as FYI, Qt uses the UTF-16 path only if
> std::filesystem::path::value_type is UTF-16. Otherwise, we perform the
> conversion from UTF-16 to UTF-8 and then give that to fs::path, because our
> encoder and decoder are faster.
>
>> What I would like to see is that "foo" can /only/ take UTF-8 strings.
>> That makes life much simpler for the C++ library implementer, as well as
>> the unsung heroes who document all this stuff. It also makes it simpler
>> for the C++ user. Unfortunately, the library writer in the middle will
>> now need to add wrappers or conversion functions or methods when using
>> foo. But it should be a small matter - especially compared to
>> converting between standard C++ library and toolkit-specific vectors,
>> strings, and other such beasties where the toolkit made their own
>> classes before the C++ standard libraries were available or appropriate.
>
> And that's what I am complaining about. We do not live in a green field
> scenario and there's a lot of UTF-16 out there. And you're right it's
> expensive to add more API and document and test, but given that char must be
> supported, I am asking that the next one on the list be char16_t, ahead of
> char8_t.
>
> If the Standard Library and SG16 decided that going forward it was not going
> to add new char16_t or char32_t APIs, I'd understand. At least we would know
> where we stand and whether we could rely on certain Standard APIs or need to
> pay attention to development. As it stands, we neither can use <format> nor
> want to reimplement it because we may want to use it in the future.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel Platform & System Engineering
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> <signature.asc>

Received on 2025-08-30 12:38:07