Date: Thu, 28 Aug 2025 09:05:01 -0700
On Thursday, 28 August 2025 05:08:32 Pacific Daylight Time David Brown via Std-
Proposals wrote:
> > Yes. I'm didn't pass a quality judgement above (but will below). I was
> > just
> > stating fact: UTF-16 is in use as an in-memory representation for Unicode
> > far more frequently than UTF-8 or UTF-32, given that Java,
> > Cocoa/CoreFoundation, ICU, Qt and the Win32 API all use it.
>
> OK. It surprises me that this is the case even for modern code,
> especially as many newer languages use UTF-8 almost everywhere, and the
> *nix world has traditionally used 8-bit or 32-bit encodings rather than
> 16-bit. But you have a lot more direct experience than me here - my own
> use of Unicode data has rarely involved anything that would care about
> in-memory representation (and things like collation have been handled by
> a database server rather than my own code).
Note I am talking about specifically "Unicode content", not "arbitrary payloads
that may or may not be Unicode", such as file paths. I am not disputing
applications use 8-bit a lot and thus would encode file names as UTF-8 if they
could.
> > UTF-8 is used a great deal but usually
> > in the context of arbitrary 8-bit encodings. If you try to find software
> > that will decode from a specified 8-bit encoding onto one of the UTF
> > codecs, you'll find that it's invariably UTF-16, not 8 or 32.
>
> A quick google search for "C++ library converting Latin-9 to Unicode"
> gave me UTF-8 only solutions and libraries that handled UTF-8, UTF-16
> and UTF-32. I did not come across any that were UTF-16 only, in my
> admittedly highly unscientific and non-representative search.
I didn't mean the conversion algorithms didn't exist. All conversions are
basically decode into UTF-32 and then encode into something else, so encoding
to UTF-8 or encoding to UTF-16 is as easy.
My argument was how much of each of them was used. You need to look into
what's happening afterwards of the conversion. As the example I gave before:
try to collate two strings or a string list. There are 5 ways I can think of:
strcoll(), wcscoll(), ICU's ucol_strcoll, Win32 CompareStringEx and
CoreFoundation's NSString (Qt's QCollator wraps the above). strcoll() is the
POSIX/Unix "arbitrary 8-bit", wcscoll() is either UTF-16 or UTF-32 depending
on OS, the other three are UTF-16.
POSIX.1-2008 added strcoll_l(), but support for the *_l functions has been
slow.
> How often do you actually need to decode the strings? Normally you are
> passing around full strings, where smaller memory usage means faster
> copying and decode speed does not matter. UTF-8 strings can be
> searched, cut up, and pasted together quite happily. And when you /do/
> need to decode or encode for things like collation, normalisation or
> case changes, the speed for decoding UTF-16 vs. UTF-8 is unlikely to be
> a major factor. (The time taken for converting back and forth between
> UTF-8 and UTF-16 when input/output is UTF-8 and some languages,
> libraries and APIs need UTF-8 can be a factor.)
When talking about Unicode manipulation, which is what we're talking about,
all the time.
I agree that if the Unicode algorithms were written anew everywhere, decoding
and operating on UTF-8 input would be as fast as UTF-16 (it's slightly more
complex, but since it can fit more data per cacheline for mostly-ASCII content,
that compensates). But those library implementations are all already written
and use UTF-16. If the C++ Standard wants to support certain Unicode
operations, implementations will either need to rewrite those algorithms from
scratch or will need to deal with UTF-16.
> > That's nowhere that I can see. All of the Win32 API is "W". There are a
> > handful of UTF-8 functions out of what, 10000?
>
> That is what I read on MS's own pages. However, it is entirely possible
> that the pages I came across were biased in some way - perhaps in the
> context of code for web applications. I did read information
> recommending setting the code page to UTF-8 and using the 8-bit APIs -
> with information about the limitations that still exist.
Changing the codepage is something the "owner of main()" can do, not something
a library implementor can. So neither I nor the Standard Library vendors can
count on that.
I also note that Microsoft adding some UTF-16-only API, like LCMapStringEx
<https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex>: there's no A-and-W for this, it's UTF-16 only.
> A great deal of real-world text of significant length, where byte count
> gets important, is in some kind of markup format - html, xml, json, etc.
> Almost no documents, even for CJK, are smaller in UTF-16 compared to
> UTF-8. For bits of plain text, then of course you are correct that
> UTF-8 is similar for some languages and larger for some languages. But
> how often is that the case in situations where the memory use makes a
> big difference?
I am not disputing UTF-8 is used for external representation. I am referring
to the strings obtained from the external representations with the mark-up
removed, so one can perform Unicode manipulations on. Suppose you're reading
an XML or JSON input that is a list of book titles in multiple languages and
now you want to present the Norwegian titles in alphabetical order (note how
"Å" comes after "Ø" in the alphabet).
Right now, you can't get away from going through UTF-16 at some point, because
of the APIs above. If the Standard Library implements this, it may support in
UTF-8. If Tiago gets his way and the Standard never implements this, then we
need first-class support of char16_t in the Standard Library because we'll need
to use it with those third-party API.
> That is indeed awkward. I fully appreciate the use of char16_t, but I
> am at a loss to see how support for wchar_t is helpful.
From another reply from Tom, the issue might be implementability: it was
implementable whereas the charNN_t weren't.
Proposals wrote:
> > Yes. I'm didn't pass a quality judgement above (but will below). I was
> > just
> > stating fact: UTF-16 is in use as an in-memory representation for Unicode
> > far more frequently than UTF-8 or UTF-32, given that Java,
> > Cocoa/CoreFoundation, ICU, Qt and the Win32 API all use it.
>
> OK. It surprises me that this is the case even for modern code,
> especially as many newer languages use UTF-8 almost everywhere, and the
> *nix world has traditionally used 8-bit or 32-bit encodings rather than
> 16-bit. But you have a lot more direct experience than me here - my own
> use of Unicode data has rarely involved anything that would care about
> in-memory representation (and things like collation have been handled by
> a database server rather than my own code).
Note I am talking about specifically "Unicode content", not "arbitrary payloads
that may or may not be Unicode", such as file paths. I am not disputing
applications use 8-bit a lot and thus would encode file names as UTF-8 if they
could.
> > UTF-8 is used a great deal but usually
> > in the context of arbitrary 8-bit encodings. If you try to find software
> > that will decode from a specified 8-bit encoding onto one of the UTF
> > codecs, you'll find that it's invariably UTF-16, not 8 or 32.
>
> A quick google search for "C++ library converting Latin-9 to Unicode"
> gave me UTF-8 only solutions and libraries that handled UTF-8, UTF-16
> and UTF-32. I did not come across any that were UTF-16 only, in my
> admittedly highly unscientific and non-representative search.
I didn't mean the conversion algorithms didn't exist. All conversions are
basically decode into UTF-32 and then encode into something else, so encoding
to UTF-8 or encoding to UTF-16 is as easy.
My argument was how much of each of them was used. You need to look into
what's happening afterwards of the conversion. As the example I gave before:
try to collate two strings or a string list. There are 5 ways I can think of:
strcoll(), wcscoll(), ICU's ucol_strcoll, Win32 CompareStringEx and
CoreFoundation's NSString (Qt's QCollator wraps the above). strcoll() is the
POSIX/Unix "arbitrary 8-bit", wcscoll() is either UTF-16 or UTF-32 depending
on OS, the other three are UTF-16.
POSIX.1-2008 added strcoll_l(), but support for the *_l functions has been
slow.
> How often do you actually need to decode the strings? Normally you are
> passing around full strings, where smaller memory usage means faster
> copying and decode speed does not matter. UTF-8 strings can be
> searched, cut up, and pasted together quite happily. And when you /do/
> need to decode or encode for things like collation, normalisation or
> case changes, the speed for decoding UTF-16 vs. UTF-8 is unlikely to be
> a major factor. (The time taken for converting back and forth between
> UTF-8 and UTF-16 when input/output is UTF-8 and some languages,
> libraries and APIs need UTF-8 can be a factor.)
When talking about Unicode manipulation, which is what we're talking about,
all the time.
I agree that if the Unicode algorithms were written anew everywhere, decoding
and operating on UTF-8 input would be as fast as UTF-16 (it's slightly more
complex, but since it can fit more data per cacheline for mostly-ASCII content,
that compensates). But those library implementations are all already written
and use UTF-16. If the C++ Standard wants to support certain Unicode
operations, implementations will either need to rewrite those algorithms from
scratch or will need to deal with UTF-16.
> > That's nowhere that I can see. All of the Win32 API is "W". There are a
> > handful of UTF-8 functions out of what, 10000?
>
> That is what I read on MS's own pages. However, it is entirely possible
> that the pages I came across were biased in some way - perhaps in the
> context of code for web applications. I did read information
> recommending setting the code page to UTF-8 and using the 8-bit APIs -
> with information about the limitations that still exist.
Changing the codepage is something the "owner of main()" can do, not something
a library implementor can. So neither I nor the Standard Library vendors can
count on that.
I also note that Microsoft adding some UTF-16-only API, like LCMapStringEx
<https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex>: there's no A-and-W for this, it's UTF-16 only.
> A great deal of real-world text of significant length, where byte count
> gets important, is in some kind of markup format - html, xml, json, etc.
> Almost no documents, even for CJK, are smaller in UTF-16 compared to
> UTF-8. For bits of plain text, then of course you are correct that
> UTF-8 is similar for some languages and larger for some languages. But
> how often is that the case in situations where the memory use makes a
> big difference?
I am not disputing UTF-8 is used for external representation. I am referring
to the strings obtained from the external representations with the mark-up
removed, so one can perform Unicode manipulations on. Suppose you're reading
an XML or JSON input that is a list of book titles in multiple languages and
now you want to present the Norwegian titles in alphabetical order (note how
"Å" comes after "Ø" in the alphabet).
Right now, you can't get away from going through UTF-16 at some point, because
of the APIs above. If the Standard Library implements this, it may support in
UTF-8. If Tiago gets his way and the Standard never implements this, then we
need first-class support of char16_t in the Standard Library because we'll need
to use it with those third-party API.
> That is indeed awkward. I fully appreciate the use of char16_t, but I
> am at a loss to see how support for wchar_t is helpful.
From another reply from Tom, the issue might be implementability: it was
implementable whereas the charNN_t weren't.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Platform & System Engineering
Received on 2025-08-28 16:05:15