Date: Tue, 26 Aug 2025 23:48:21 +0200
On 26/08/2025 22:39, Thiago Macieira via Std-Proposals wrote:
> On Tuesday, 26 August 2025 11:40:38 Pacific Daylight Time Tom Honermann wrote:
>> Part of the current struggle is deciding to continue doing what we did
>> for std::filesystem and provide interfaces for all of char (in current
>> locale encoding or environment locale encoding or literal encoding or
>> ...), wchar_t, char8_t, char16_t, and char32_t or to reduce the
>> encodings exposed in such interfaces to just char8_t and maybe char16_t
>> to reduce the burden on C++ implementors.
>
> I'm arguing that for anything *text* it should be "char16_t, char, maybe
> char8_t and implementation-defined anything else" (thus allowing Windows
> environments to provide wchar_t support, as it's the only environment where
> wchar_t is realistically used, through a simple reinterpret_cast to char16_t).
>
> The big example here is <format>: it currently supports char and wchar_t. That
> renders it mostly useless for Qt. To support it, we need to roll our own of
> almost everything, like for example <chrono> formatting. It's not worth it, so
> we are not adopting <format>.
>
The consensus of the modern world is UTF-8 for everything, except for
legacy API's that are difficult to change. Many systems (Windows NT,
Java, JavaScript, Python, QT, and probably more) jumped to UCS2 when
Unicode was new - a sensible decision at the time, but unfortunately a
bad choice in the long term. To me, the only sane choices of character
types are plain char (for 7-bit ASCII - good enough for many types of
code), char8_t for UTF-8, and char32_t for UTF-32 on the rare occasions
when you need to expand Unicode code points. Everything else, including
non-Unicode encodings, should be considered legacy. You need to support
it for handling old or rare documents, and for interacting with
non-UTF-8 APIs.
Things like "format" or C++ filesystem functions should be char8_t only
- strictly UTF-8. A dozen different character formats - fully defined
or implementation defined - does a disservice to the programmer.
AFAIUI, all these languages, libraries and OS's that had UCS2 character
encodings also now support UTF-8, and generally encourage UTF-8 as the
main choice of character type.
Reducing the burden on C++ implementers, and - more importantly -
reducing the burden on C++ users and programmers, would be best served
by standardising on UTF-8 for all internal code use, and providing
conversion and recoding functions for the boundaries when the programmer
is interacting with other encodings.
That is, of course, just my own opinion.
> On Tuesday, 26 August 2025 11:40:38 Pacific Daylight Time Tom Honermann wrote:
>> Part of the current struggle is deciding to continue doing what we did
>> for std::filesystem and provide interfaces for all of char (in current
>> locale encoding or environment locale encoding or literal encoding or
>> ...), wchar_t, char8_t, char16_t, and char32_t or to reduce the
>> encodings exposed in such interfaces to just char8_t and maybe char16_t
>> to reduce the burden on C++ implementors.
>
> I'm arguing that for anything *text* it should be "char16_t, char, maybe
> char8_t and implementation-defined anything else" (thus allowing Windows
> environments to provide wchar_t support, as it's the only environment where
> wchar_t is realistically used, through a simple reinterpret_cast to char16_t).
>
> The big example here is <format>: it currently supports char and wchar_t. That
> renders it mostly useless for Qt. To support it, we need to roll our own of
> almost everything, like for example <chrono> formatting. It's not worth it, so
> we are not adopting <format>.
>
The consensus of the modern world is UTF-8 for everything, except for
legacy API's that are difficult to change. Many systems (Windows NT,
Java, JavaScript, Python, QT, and probably more) jumped to UCS2 when
Unicode was new - a sensible decision at the time, but unfortunately a
bad choice in the long term. To me, the only sane choices of character
types are plain char (for 7-bit ASCII - good enough for many types of
code), char8_t for UTF-8, and char32_t for UTF-32 on the rare occasions
when you need to expand Unicode code points. Everything else, including
non-Unicode encodings, should be considered legacy. You need to support
it for handling old or rare documents, and for interacting with
non-UTF-8 APIs.
Things like "format" or C++ filesystem functions should be char8_t only
- strictly UTF-8. A dozen different character formats - fully defined
or implementation defined - does a disservice to the programmer.
AFAIUI, all these languages, libraries and OS's that had UCS2 character
encodings also now support UTF-8, and generally encourage UTF-8 as the
main choice of character type.
Reducing the burden on C++ implementers, and - more importantly -
reducing the burden on C++ users and programmers, would be best served
by standardising on UTF-8 for all internal code use, and providing
conversion and recoding functions for the boundaries when the programmer
is interacting with other encodings.
That is, of course, just my own opinion.
Received on 2025-08-26 21:48:24