Date: Tue, 26 Aug 2025 17:12:22 -0700
On Tuesday, 26 August 2025 14:48:21 Pacific Daylight Time David Brown via Std-
Proposals wrote:
> The consensus of the modern world is UTF-8 for everything, except for
> legacy API's that are difficult to change.
And that is the big issue: all the legacy APIs. We're not talking about a
green field scenario. In the real world, UTF-16 has a place and is in use for
in-memory representation more frequently than UTF-8 or UTF-32. UTF-8 is used
for external representation (network protocols and files).
> AFAIUI, all these languages, libraries and OS's that had UCS2 character
> encodings also now support UTF-8, and generally encourage UTF-8 as the
> main choice of character type.
Internally they still operate in UTF-16 and will need to perform conversion
to/from it to operate on UTF-8. And that includes *the* library for Unicode
support, ICU. If the Standard proposed an API for performing collation in
Unicode, chances are it would be implemented using ucol_strcoll[1].
So there's a big difference between supporting UTF-8 and doing so with zero
memory cost.
[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/
ucol_8h.html#a6a7c9e0e58b825b240ccb3005951247a
> Reducing the burden on C++ implementers, and - more importantly -
> reducing the burden on C++ users and programmers, would be best served
> by standardising on UTF-8 for all internal code use, and providing
> conversion and recoding functions for the boundaries when the programmer
> is interacting with other encodings.
I somewhat agree on the users.
However, for implementers, that's different, because we are not in a green field
scenario. As an implementer, I am saying my life would be easier[*] of
char16_t were first-class supported in C++.
[*] where "easier" = "can use some Standard API", as opposed to "easier" =
"can ignore the Standard"
Proposals wrote:
> The consensus of the modern world is UTF-8 for everything, except for
> legacy API's that are difficult to change.
And that is the big issue: all the legacy APIs. We're not talking about a
green field scenario. In the real world, UTF-16 has a place and is in use for
in-memory representation more frequently than UTF-8 or UTF-32. UTF-8 is used
for external representation (network protocols and files).
> AFAIUI, all these languages, libraries and OS's that had UCS2 character
> encodings also now support UTF-8, and generally encourage UTF-8 as the
> main choice of character type.
Internally they still operate in UTF-16 and will need to perform conversion
to/from it to operate on UTF-8. And that includes *the* library for Unicode
support, ICU. If the Standard proposed an API for performing collation in
Unicode, chances are it would be implemented using ucol_strcoll[1].
So there's a big difference between supporting UTF-8 and doing so with zero
memory cost.
[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/
ucol_8h.html#a6a7c9e0e58b825b240ccb3005951247a
> Reducing the burden on C++ implementers, and - more importantly -
> reducing the burden on C++ users and programmers, would be best served
> by standardising on UTF-8 for all internal code use, and providing
> conversion and recoding functions for the boundaries when the programmer
> is interacting with other encodings.
I somewhat agree on the users.
However, for implementers, that's different, because we are not in a green field
scenario. As an implementer, I am saying my life would be easier[*] of
char16_t were first-class supported in C++.
[*] where "easier" = "can use some Standard API", as opposed to "easier" =
"can ignore the Standard"
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Platform & System Engineering
Received on 2025-08-27 00:12:31