Date: Wed, 27 Aug 2025 09:55:04 +0200
On 27/08/2025 02:12, Thiago Macieira via Std-Proposals wrote:
> On Tuesday, 26 August 2025 14:48:21 Pacific Daylight Time David Brown via Std-
> Proposals wrote:
>> The consensus of the modern world is UTF-8 for everything, except for
>> legacy API's that are difficult to change.
>
> And that is the big issue: all the legacy APIs. We're not talking about a
> green field scenario.
Of course. Legacy support is both a benefit and a curse for languages,
libraries and APIs - and is definitely vital to support existing code
and applications. But it is equally vital to avoid letting unlucky
design choices of the past limit and restrict the future.
> In the real world, UTF-16 has a place and is in use for
> in-memory representation more frequently than UTF-8 or UTF-32.
Seriously?
UTF-16 is firmly established as the worst possible choice for
representation - internal or external. It has all the disadvantages of
UTF-8, all the disadvantages of UTF-32, and none of their benefits.
I can appreciate UTF-16 having some convenience on Windows as an
internal format, because that's close to the internal format Windows has
used for some subsystems and APIs. But Windows doesn't fully support
UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
through the generations of Windows. If you want to use filesystem APIs,
you can't work with any old general UTF-16 strings - you need to
sanitise them for the limitations of the filename lengths and characters
supported. So you are going to have some kind of wrapper functions
anyway - that would be a fine place to put your conversion functions so
that the application code could use normal standard UTF-8 regardless of
the APIs. And AFAIK, MS is supporting steadily more UTF-8 and
encouraging its use rather than UCS2 or UTF-16.
I do understand that changes here do not come quickly or easily, and I
am in no way suggesting char16_t or wchar_t should be deprecated (at
this stage). Obviously the implementation code for all sorts of
libraries and toolkits that were started when Unicode was UCS, will
still use that format for a lot of their internal code. And that code
must be supported for a very long time to come.
But I /am/ suggesting that most /new/ interfaces and features should be
char8_t and UTF-8 only. Any time you need to use something else along
with that new interface, you use a conversion function.
> UTF-8 is used
> for external representation (network protocols and files).
It is also used for a great deal of internal representation - as well as
strings and data in source code. On occasions when you need strings
split up into directly accessible code points, UTF-32 is the only
option, so that is sometimes used internally in code. UTF-16 gives you
an inefficient format that is different from all external data and which
cannot index code points or characters directly.
>
>> AFAIUI, all these languages, libraries and OS's that had UCS2 character
>> encodings also now support UTF-8, and generally encourage UTF-8 as the
>> main choice of character type.
>
> Internally they still operate in UTF-16 and will need to perform conversion
> to/from it to operate on UTF-8.
Sure. New parts or new versions of these libraries might be written
with UTF-8, but I am not suggesting changing existing code.
> And that includes *the* library for Unicode
> support, ICU. If the Standard proposed an API for performing collation in
> Unicode, chances are it would be implemented using ucol_strcoll[1].
>
> So there's a big difference between supporting UTF-8 and doing so with zero
> memory cost.
(I am not concerned with memory cost here. UTF-8 is invariably lower
memory for text that is big enough for the memory usage to be a
consideration. For shorter strings, conversion between encodings is
fast and the memory usage is a drop in the ocean.)
>
> [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/
> ucol_8h.html#a6a7c9e0e58b825b240ccb3005951247a
>
>> Reducing the burden on C++ implementers, and - more importantly -
>> reducing the burden on C++ users and programmers, would be best served
>> by standardising on UTF-8 for all internal code use, and providing
>> conversion and recoding functions for the boundaries when the programmer
>> is interacting with other encodings.
>
> I somewhat agree on the users.
>
And it is the users that should be the focus.
> However, for implementers, that's different, because we are not in a green field
> scenario. As an implementer, I am saying my life would be easier[*] of
> char16_t were first-class supported in C++.
>
> [*] where "easier" = "can use some Standard API", as opposed to "easier" =
> "can ignore the Standard"
>
You are (if I understand it correctly) an implementer that is caught
between two stools - you work with the implementation of libraries and
toolkits used by application programmers, and which in turn uses the C++
standard library.
In my suggestion, suppose the C++ standard library introduces a new
class "foo" that can take a string in its constructor, and also provides
methods for viewing the string. For the C++ <filesystem> classes, the
classes all need to handle input of 5 different string types, and have 5
different observers to see the string. People using it have 5 times the
options, and stack overflow gets questions from people wondering if they
need string or u8string, or if they should use wstring or u16string.
You, as the implementer of a library that uses UTF-16 internally, can
happily use the u16string versions.
What I would like to see is that "foo" can /only/ take UTF-8 strings.
That makes life much simpler for the C++ library implementer, as well as
the unsung heroes who document all this stuff. It also makes it simpler
for the C++ user. Unfortunately, the library writer in the middle will
now need to add wrappers or conversion functions or methods when using
foo. But it should be a small matter - especially compared to
converting between standard C++ library and toolkit-specific vectors,
strings, and other such beasties where the toolkit made their own
classes before the C++ standard libraries were available or appropriate.
It is too late for C++ to have a simple, clean character and string
solution with pure UTF-8 that would be ideal for users and implementers
alike. We are not starting from scratch. But it is not too late to
decide that that is the best direction, and gradually let other
encodings fade into the background of legacy along with C-style string
functions and memory handling.
> On Tuesday, 26 August 2025 14:48:21 Pacific Daylight Time David Brown via Std-
> Proposals wrote:
>> The consensus of the modern world is UTF-8 for everything, except for
>> legacy API's that are difficult to change.
>
> And that is the big issue: all the legacy APIs. We're not talking about a
> green field scenario.
Of course. Legacy support is both a benefit and a curse for languages,
libraries and APIs - and is definitely vital to support existing code
and applications. But it is equally vital to avoid letting unlucky
design choices of the past limit and restrict the future.
> In the real world, UTF-16 has a place and is in use for
> in-memory representation more frequently than UTF-8 or UTF-32.
Seriously?
UTF-16 is firmly established as the worst possible choice for
representation - internal or external. It has all the disadvantages of
UTF-8, all the disadvantages of UTF-32, and none of their benefits.
I can appreciate UTF-16 having some convenience on Windows as an
internal format, because that's close to the internal format Windows has
used for some subsystems and APIs. But Windows doesn't fully support
UTF-16 properly - it supports a gradual move from UCS2 towards UTF-16
through the generations of Windows. If you want to use filesystem APIs,
you can't work with any old general UTF-16 strings - you need to
sanitise them for the limitations of the filename lengths and characters
supported. So you are going to have some kind of wrapper functions
anyway - that would be a fine place to put your conversion functions so
that the application code could use normal standard UTF-8 regardless of
the APIs. And AFAIK, MS is supporting steadily more UTF-8 and
encouraging its use rather than UCS2 or UTF-16.
I do understand that changes here do not come quickly or easily, and I
am in no way suggesting char16_t or wchar_t should be deprecated (at
this stage). Obviously the implementation code for all sorts of
libraries and toolkits that were started when Unicode was UCS, will
still use that format for a lot of their internal code. And that code
must be supported for a very long time to come.
But I /am/ suggesting that most /new/ interfaces and features should be
char8_t and UTF-8 only. Any time you need to use something else along
with that new interface, you use a conversion function.
> UTF-8 is used
> for external representation (network protocols and files).
It is also used for a great deal of internal representation - as well as
strings and data in source code. On occasions when you need strings
split up into directly accessible code points, UTF-32 is the only
option, so that is sometimes used internally in code. UTF-16 gives you
an inefficient format that is different from all external data and which
cannot index code points or characters directly.
>
>> AFAIUI, all these languages, libraries and OS's that had UCS2 character
>> encodings also now support UTF-8, and generally encourage UTF-8 as the
>> main choice of character type.
>
> Internally they still operate in UTF-16 and will need to perform conversion
> to/from it to operate on UTF-8.
Sure. New parts or new versions of these libraries might be written
with UTF-8, but I am not suggesting changing existing code.
> And that includes *the* library for Unicode
> support, ICU. If the Standard proposed an API for performing collation in
> Unicode, chances are it would be implemented using ucol_strcoll[1].
>
> So there's a big difference between supporting UTF-8 and doing so with zero
> memory cost.
(I am not concerned with memory cost here. UTF-8 is invariably lower
memory for text that is big enough for the memory usage to be a
consideration. For shorter strings, conversion between encodings is
fast and the memory usage is a drop in the ocean.)
>
> [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/
> ucol_8h.html#a6a7c9e0e58b825b240ccb3005951247a
>
>> Reducing the burden on C++ implementers, and - more importantly -
>> reducing the burden on C++ users and programmers, would be best served
>> by standardising on UTF-8 for all internal code use, and providing
>> conversion and recoding functions for the boundaries when the programmer
>> is interacting with other encodings.
>
> I somewhat agree on the users.
>
And it is the users that should be the focus.
> However, for implementers, that's different, because we are not in a green field
> scenario. As an implementer, I am saying my life would be easier[*] of
> char16_t were first-class supported in C++.
>
> [*] where "easier" = "can use some Standard API", as opposed to "easier" =
> "can ignore the Standard"
>
You are (if I understand it correctly) an implementer that is caught
between two stools - you work with the implementation of libraries and
toolkits used by application programmers, and which in turn uses the C++
standard library.
In my suggestion, suppose the C++ standard library introduces a new
class "foo" that can take a string in its constructor, and also provides
methods for viewing the string. For the C++ <filesystem> classes, the
classes all need to handle input of 5 different string types, and have 5
different observers to see the string. People using it have 5 times the
options, and stack overflow gets questions from people wondering if they
need string or u8string, or if they should use wstring or u16string.
You, as the implementer of a library that uses UTF-16 internally, can
happily use the u16string versions.
What I would like to see is that "foo" can /only/ take UTF-8 strings.
That makes life much simpler for the C++ library implementer, as well as
the unsung heroes who document all this stuff. It also makes it simpler
for the C++ user. Unfortunately, the library writer in the middle will
now need to add wrappers or conversion functions or methods when using
foo. But it should be a small matter - especially compared to
converting between standard C++ library and toolkit-specific vectors,
strings, and other such beasties where the toolkit made their own
classes before the C++ standard libraries were available or appropriate.
It is too late for C++ to have a simple, clean character and string
solution with pure UTF-8 that would be ideal for users and implementers
alike. We are not starting from scratch. But it is not too late to
decide that that is the best direction, and gradually let other
encodings fade into the background of legacy along with C-style string
functions and memory handling.
Received on 2025-08-27 07:55:11