Date: Thu, 28 Aug 2025 06:11:03 +0000
When you work with API's that are UTF-16 it will make it look like UTF-16 is more important than others.
I personally find myself using UTF-8 more, because it has better memory properties in practice, and most of the web is utf-8.
I even use UTF-32, to be short for reasons regarding its properties, UTF-16 I wish it didn't exist, but I have to deal with it.
My point is Unicode is a terrible standard. I would say it's poorly designed, but the reality is it was barely designed at all, it just a collection of hacks stacked on top of each other that has accumulated over the years. I loath it, I whish it would die and be replaced by something better.
But until then we are stuck whit it, what is good for one is bad for everyone else, and no matter what you do the majority will always be unhappy.
The least the standard could do is to stay away from it as far as possible.
Sure, the standard should acknowledge that it exists and have facilities to be able to support conversion between the different encodings but that's it.
Everything else is just asking for trouble and endless fighting.
AND FYI this seems to have gone in a different direction and hijacked the OP thread. You should start a new one.
-----Original Message-----
From: Std-Proposals <std-proposals-bounces_at_[hidden]rg> On Behalf Of Thiago Macieira via Std-Proposals
Sent: Thursday, August 28, 2025 03:03
To: std-proposals_at_lists.isocpp.org
Cc: Thiago Macieira <thiago_at_[hidden]>
Subject: Re: [std-proposals] TBAA and extended floating-point types
On Wednesday, 27 August 2025 00:55:04 Pacific Daylight Time David Brown via Std-Proposals wrote:
> > In the real world, UTF-16 has a place and is in use for in-memory
> > representation more frequently than UTF-8 or UTF-32.
>
> Seriously?
Yes. I'm didn't pass a quality judgement above (but will below). I was just stating fact: UTF-16 is in use as an in-memory representation for Unicode far more frequently than UTF-8 or UTF-32, given that Java, Cocoa/CoreFoundation, ICU, Qt and the Win32 API all use it. UTF-8 is used a great deal but usually in the context of arbitrary 8-bit encodings. If you try to find software that will decode from a specified 8-bit encoding onto one of the UTF codecs, you'll find that it's invariably UTF-16, not 8 or 32.
> UTF-16 is firmly established as the worst possible choice for
> representation - internal or external. It has all the disadvantages
> of UTF-8, all the disadvantages of UTF-32, and none of their benefits.
You can qualify the same things as advantages: it has half the memory overhead of UTF-32 for most text and a much easier decoding procedure than UTF-8, especially if you need to go backwards.
> I can appreciate UTF-16 having some convenience on Windows as an
> internal format, because that's close to the internal format Windows
> has used for some subsystems and APIs. But Windows doesn't fully
> support
> UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
> through the generations of Windows. If you want to use filesystem
> APIs, you can't work with any old general UTF-16 strings - you need to
> sanitise them for the limitations of the filename lengths and
> characters supported.
As far as I know, Windows doesn't care about surrogates in the string. That means it does allow improperly-encoded content, but it does allow the full Unicode range.
> So you are going to have some kind of wrapper functions anyway - that
> would be a fine place to put your conversion functions so that the
> application code could use normal standard UTF-8 regardless of the
> APIs. And AFAIK, MS is supporting steadily more UTF-8 and encouraging
> its use rather than UCS2 or UTF-16.
That's nowhere that I can see. All of the Win32 API is "W". There are a handful of UTF-8 functions out of what, 10000?
> But I /am/ suggesting that most /new/ interfaces and features should
> be char8_t and UTF-8 only. Any time you need to use something else
> along with that new interface, you use a conversion function.
If you're developing a library, you're welcome to do that. I actually welcome a full UTF-8 char8_t C++ Standard Library support.
I am however saying that the C++ Standard must support char16_t as a first- class citizen, even ahead of char8_t if necessary if development cost is an issue. The fact that <format> only supports char and wchar_t (among other
problems) makes it useless for us in Qt. There's far too much legacy to be ignored and there's more of it using UTF-16 than there is of UTF-8, especially using char8_t.
> > UTF-8 is used
> > for external representation (network protocols and files).
>
> It is also used for a great deal of internal representation - as well
> as strings and data in source code. On occasions when you need
> strings split up into directly accessible code points, UTF-32 is the
> only option, so that is sometimes used internally in code. UTF-16
> gives you an inefficient format that is different from all external
> data and which cannot index code points or characters directly.
UTF-32 in my experience is used as a stepping stone for iteration only, because it uses just too much memory. When dealing with *text* you also need to get away from indexing, because boundaries aren't obvious: you can't cut at an arbitrary codepoint and call it a day, if the next one is combining.
> (I am not concerned with memory cost here. UTF-8 is invariably lower
> memory for text that is big enough for the memory usage to be a
> consideration. For shorter strings, conversion between encodings is
> fast and the memory usage is a drop in the ocean.)
That's a very Latin-centric view. For text using the Latin script, I guess you're going to have some 10-20% of non-US-ASCII codepoints, which makes the memory use in UTF-8 be 40-45% smaller than UTF-16, not 50%. As soon as you step outside of the Latin script, that's no longer the case: for text in Cyrillic, Greek or some other scripts, the memory use of UTF-8 is exactly the same as UTF-16. For CJK text, UTF-8 is 50% more than UTF-16, requiring 3 bytes per character while UTF-16 still only requires 2.
> In my suggestion, suppose the C++ standard library introduces a new
> class "foo" that can take a string in its constructor, and also
> provides methods for viewing the string. For the C++ <filesystem>
> classes, the classes all need to handle input of 5 different string
> types, and have 5 different observers to see the string. People using
> it have 5 times the options, and stack overflow gets questions from
> people wondering if they need string or u8string, or if they should use wstring or u16string.
> You, as the implementer of a library that uses UTF-16 internally, can
> happily use the u16string versions.
Indeed we could.
But just as FYI, Qt uses the UTF-16 path only if std::filesystem::path::value_type is UTF-16. Otherwise, we perform the conversion from UTF-16 to UTF-8 and then give that to fs::path, because our encoder and decoder are faster.
> What I would like to see is that "foo" can /only/ take UTF-8 strings.
> That makes life much simpler for the C++ library implementer, as well
> as the unsung heroes who document all this stuff. It also makes it
> simpler for the C++ user. Unfortunately, the library writer in the
> middle will now need to add wrappers or conversion functions or
> methods when using foo. But it should be a small matter - especially
> compared to converting between standard C++ library and
> toolkit-specific vectors, strings, and other such beasties where the
> toolkit made their own classes before the C++ standard libraries were available or appropriate.
And that's what I am complaining about. We do not live in a green field scenario and there's a lot of UTF-16 out there. And you're right it's expensive to add more API and document and test, but given that char must be supported, I am asking that the next one on the list be char16_t, ahead of char8_t.
If the Standard Library and SG16 decided that going forward it was not going to add new char16_t or char32_t APIs, I'd understand. At least we would know where we stand and whether we could rely on certain Standard APIs or need to pay attention to development. As it stands, we neither can use <format> nor want to reimplement it because we may want to use it in the future.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel Platform & System Engineering
I personally find myself using UTF-8 more, because it has better memory properties in practice, and most of the web is utf-8.
I even use UTF-32, to be short for reasons regarding its properties, UTF-16 I wish it didn't exist, but I have to deal with it.
My point is Unicode is a terrible standard. I would say it's poorly designed, but the reality is it was barely designed at all, it just a collection of hacks stacked on top of each other that has accumulated over the years. I loath it, I whish it would die and be replaced by something better.
But until then we are stuck whit it, what is good for one is bad for everyone else, and no matter what you do the majority will always be unhappy.
The least the standard could do is to stay away from it as far as possible.
Sure, the standard should acknowledge that it exists and have facilities to be able to support conversion between the different encodings but that's it.
Everything else is just asking for trouble and endless fighting.
AND FYI this seems to have gone in a different direction and hijacked the OP thread. You should start a new one.
-----Original Message-----
From: Std-Proposals <std-proposals-bounces_at_[hidden]rg> On Behalf Of Thiago Macieira via Std-Proposals
Sent: Thursday, August 28, 2025 03:03
To: std-proposals_at_lists.isocpp.org
Cc: Thiago Macieira <thiago_at_[hidden]>
Subject: Re: [std-proposals] TBAA and extended floating-point types
On Wednesday, 27 August 2025 00:55:04 Pacific Daylight Time David Brown via Std-Proposals wrote:
> > In the real world, UTF-16 has a place and is in use for in-memory
> > representation more frequently than UTF-8 or UTF-32.
>
> Seriously?
Yes. I'm didn't pass a quality judgement above (but will below). I was just stating fact: UTF-16 is in use as an in-memory representation for Unicode far more frequently than UTF-8 or UTF-32, given that Java, Cocoa/CoreFoundation, ICU, Qt and the Win32 API all use it. UTF-8 is used a great deal but usually in the context of arbitrary 8-bit encodings. If you try to find software that will decode from a specified 8-bit encoding onto one of the UTF codecs, you'll find that it's invariably UTF-16, not 8 or 32.
> UTF-16 is firmly established as the worst possible choice for
> representation - internal or external. It has all the disadvantages
> of UTF-8, all the disadvantages of UTF-32, and none of their benefits.
You can qualify the same things as advantages: it has half the memory overhead of UTF-32 for most text and a much easier decoding procedure than UTF-8, especially if you need to go backwards.
> I can appreciate UTF-16 having some convenience on Windows as an
> internal format, because that's close to the internal format Windows
> has used for some subsystems and APIs. But Windows doesn't fully
> support
> UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
> through the generations of Windows. If you want to use filesystem
> APIs, you can't work with any old general UTF-16 strings - you need to
> sanitise them for the limitations of the filename lengths and
> characters supported.
As far as I know, Windows doesn't care about surrogates in the string. That means it does allow improperly-encoded content, but it does allow the full Unicode range.
> So you are going to have some kind of wrapper functions anyway - that
> would be a fine place to put your conversion functions so that the
> application code could use normal standard UTF-8 regardless of the
> APIs. And AFAIK, MS is supporting steadily more UTF-8 and encouraging
> its use rather than UCS2 or UTF-16.
That's nowhere that I can see. All of the Win32 API is "W". There are a handful of UTF-8 functions out of what, 10000?
> But I /am/ suggesting that most /new/ interfaces and features should
> be char8_t and UTF-8 only. Any time you need to use something else
> along with that new interface, you use a conversion function.
If you're developing a library, you're welcome to do that. I actually welcome a full UTF-8 char8_t C++ Standard Library support.
I am however saying that the C++ Standard must support char16_t as a first- class citizen, even ahead of char8_t if necessary if development cost is an issue. The fact that <format> only supports char and wchar_t (among other
problems) makes it useless for us in Qt. There's far too much legacy to be ignored and there's more of it using UTF-16 than there is of UTF-8, especially using char8_t.
> > UTF-8 is used
> > for external representation (network protocols and files).
>
> It is also used for a great deal of internal representation - as well
> as strings and data in source code. On occasions when you need
> strings split up into directly accessible code points, UTF-32 is the
> only option, so that is sometimes used internally in code. UTF-16
> gives you an inefficient format that is different from all external
> data and which cannot index code points or characters directly.
UTF-32 in my experience is used as a stepping stone for iteration only, because it uses just too much memory. When dealing with *text* you also need to get away from indexing, because boundaries aren't obvious: you can't cut at an arbitrary codepoint and call it a day, if the next one is combining.
> (I am not concerned with memory cost here. UTF-8 is invariably lower
> memory for text that is big enough for the memory usage to be a
> consideration. For shorter strings, conversion between encodings is
> fast and the memory usage is a drop in the ocean.)
That's a very Latin-centric view. For text using the Latin script, I guess you're going to have some 10-20% of non-US-ASCII codepoints, which makes the memory use in UTF-8 be 40-45% smaller than UTF-16, not 50%. As soon as you step outside of the Latin script, that's no longer the case: for text in Cyrillic, Greek or some other scripts, the memory use of UTF-8 is exactly the same as UTF-16. For CJK text, UTF-8 is 50% more than UTF-16, requiring 3 bytes per character while UTF-16 still only requires 2.
> In my suggestion, suppose the C++ standard library introduces a new
> class "foo" that can take a string in its constructor, and also
> provides methods for viewing the string. For the C++ <filesystem>
> classes, the classes all need to handle input of 5 different string
> types, and have 5 different observers to see the string. People using
> it have 5 times the options, and stack overflow gets questions from
> people wondering if they need string or u8string, or if they should use wstring or u16string.
> You, as the implementer of a library that uses UTF-16 internally, can
> happily use the u16string versions.
Indeed we could.
But just as FYI, Qt uses the UTF-16 path only if std::filesystem::path::value_type is UTF-16. Otherwise, we perform the conversion from UTF-16 to UTF-8 and then give that to fs::path, because our encoder and decoder are faster.
> What I would like to see is that "foo" can /only/ take UTF-8 strings.
> That makes life much simpler for the C++ library implementer, as well
> as the unsung heroes who document all this stuff. It also makes it
> simpler for the C++ user. Unfortunately, the library writer in the
> middle will now need to add wrappers or conversion functions or
> methods when using foo. But it should be a small matter - especially
> compared to converting between standard C++ library and
> toolkit-specific vectors, strings, and other such beasties where the
> toolkit made their own classes before the C++ standard libraries were available or appropriate.
And that's what I am complaining about. We do not live in a green field scenario and there's a lot of UTF-16 out there. And you're right it's expensive to add more API and document and test, but given that char must be supported, I am asking that the next one on the list be char16_t, ahead of char8_t.
If the Standard Library and SG16 decided that going forward it was not going to add new char16_t or char32_t APIs, I'd understand. At least we would know where we stand and whether we could rely on certain Standard APIs or need to pay attention to development. As it stands, we neither can use <format> nor want to reimplement it because we may want to use it in the future.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel Platform & System Engineering
Received on 2025-08-28 06:11:10