Date: Wed, 2 Apr 2025 17:43:13 -0400
On 4/1/25 4:02 PM, Thiago Macieira via Std-Proposals wrote:
> On Tuesday, 1 April 2025 13:42:27 Mountain Daylight Time Tymi via Std-
> Proposals wrote:
>> Sure, but we are talking aboutstd::wprint and notstd::u16print and I
>> assumed they wanted to say "format_string should support wide strings".
>> Maybe I'm wrong in this, but yes, I'm not defending char16_t in any case
>> here
> I am.
>
> char16_t is far more useful than wchar_t because it's the same everywhere.
> wchar_t is only useful on Windows; everywhere else it's legacy dead weight.
> Other OSes may not have low-level UTF-16 API, but they do have mid- and high-
> level ones, including Apple's Cocoa and Carbon APIs, JDK interfaces (including
> Android), and ICU. And then there's of course Qt, which uses UTF-16
> extensively and exclusively, so UTF-16 is highly relevant for C++.
Agreed.
> That's why I am saying <format> should support char16_t before any other
> character type. This solves the formatting on Windows by the simple expediency
> of reinterpret_cast, while the formatting on all other platforms supported by
> mainstream Standard Libraries is a well-known UTF16-to-UTF8 conversion.
The <format> header already supports char and wchar_t. This thread is
specifically about a wide version of the std::print() family of
functions; I believe those are the only declarations from <format> and
<print> for which a wchar_t version is not provided.
The challenge with adding support for the char/N/_t types is that
std::format() and std::print() rely on locale support for some
functionality and char/N/_t support hasn't been added for std::locale
facets yet.
The issues that need to be solved with respect to improving wchar_t
support vs char/N/_t support are disjoint and do not compete with each
other. Time availability in WG21 also isn't a constraint (at least not
now, not for C++29). The problem is a lack of proposals. I would love to
have some to schedule in SG16! Earlier papers get priority (most of the
time).
> The way I see it, there are only two useful character types: char and
> char16_t. (char8_t came too late to be useful, so it's as useful as char32_t
> for all I care)
I think it is unlikely that char8_t arriving earlier would have changed
where we are now. There is a long history of unfortunate timing. IBM
chose EBCDIC for the System/360 because, despite being involved in ASCII
standardization, they didn't have time to produce ASCII hardware before
going to market. Unicode arrived too late to prevent the proliferation
of other character sets and encoding schemes like DBCS, ISO/IEC 2022,
shift state encodings, etc... Microsoft adopted Unicode too early and
got stuck with UCS-2 and a 16-bit wchar_t thus necessitating a move to
UTF-16 when the Unicode character set expanded to 21 bits. UTF-8 arrived
too late to prevent UCS-2 from being adopted in the first place (and it
might not have mattered anyway). char8_t couldn't have arrived earlier
than UTF-8 and the mess we have today was already well established then.
The only way out of the mess is for the industry to settle on a
direction. I personally think that the only solution that works across
the entire ecosystem is an approach that uses char8_t and/or char16_t
for text within component boundaries with transcoding as necessary at
program boundaries. Many programmers would like to see char be made
synonymous with UTF-8. That would be a good outcome if we could get
there, but I remain skeptical that it is achievable across the entire
ecosystem; EBCDIC and legacy character sets continue to be too important
to ignore, particularly within the C++ standard itself.
Tom.
> On Tuesday, 1 April 2025 13:42:27 Mountain Daylight Time Tymi via Std-
> Proposals wrote:
>> Sure, but we are talking aboutstd::wprint and notstd::u16print and I
>> assumed they wanted to say "format_string should support wide strings".
>> Maybe I'm wrong in this, but yes, I'm not defending char16_t in any case
>> here
> I am.
>
> char16_t is far more useful than wchar_t because it's the same everywhere.
> wchar_t is only useful on Windows; everywhere else it's legacy dead weight.
> Other OSes may not have low-level UTF-16 API, but they do have mid- and high-
> level ones, including Apple's Cocoa and Carbon APIs, JDK interfaces (including
> Android), and ICU. And then there's of course Qt, which uses UTF-16
> extensively and exclusively, so UTF-16 is highly relevant for C++.
Agreed.
> That's why I am saying <format> should support char16_t before any other
> character type. This solves the formatting on Windows by the simple expediency
> of reinterpret_cast, while the formatting on all other platforms supported by
> mainstream Standard Libraries is a well-known UTF16-to-UTF8 conversion.
The <format> header already supports char and wchar_t. This thread is
specifically about a wide version of the std::print() family of
functions; I believe those are the only declarations from <format> and
<print> for which a wchar_t version is not provided.
The challenge with adding support for the char/N/_t types is that
std::format() and std::print() rely on locale support for some
functionality and char/N/_t support hasn't been added for std::locale
facets yet.
The issues that need to be solved with respect to improving wchar_t
support vs char/N/_t support are disjoint and do not compete with each
other. Time availability in WG21 also isn't a constraint (at least not
now, not for C++29). The problem is a lack of proposals. I would love to
have some to schedule in SG16! Earlier papers get priority (most of the
time).
> The way I see it, there are only two useful character types: char and
> char16_t. (char8_t came too late to be useful, so it's as useful as char32_t
> for all I care)
I think it is unlikely that char8_t arriving earlier would have changed
where we are now. There is a long history of unfortunate timing. IBM
chose EBCDIC for the System/360 because, despite being involved in ASCII
standardization, they didn't have time to produce ASCII hardware before
going to market. Unicode arrived too late to prevent the proliferation
of other character sets and encoding schemes like DBCS, ISO/IEC 2022,
shift state encodings, etc... Microsoft adopted Unicode too early and
got stuck with UCS-2 and a 16-bit wchar_t thus necessitating a move to
UTF-16 when the Unicode character set expanded to 21 bits. UTF-8 arrived
too late to prevent UCS-2 from being adopted in the first place (and it
might not have mattered anyway). char8_t couldn't have arrived earlier
than UTF-8 and the mess we have today was already well established then.
The only way out of the mess is for the industry to settle on a
direction. I personally think that the only solution that works across
the entire ecosystem is an approach that uses char8_t and/or char16_t
for text within component boundaries with transcoding as necessary at
program boundaries. Many programmers would like to see char be made
synonymous with UTF-8. That would be a good outcome if we could get
there, but I remain skeptical that it is achievable across the entire
ecosystem; EBCDIC and legacy character sets continue to be too important
to ignore, particularly within the C++ standard itself.
Tom.
Received on 2025-04-02 21:43:15