Date: Sat, 11 Oct 2025 16:25:32 -0400
Catching up on some old messages I meant to respond too...
On 8/30/25 6:59 PM, Oliver Hunt via Std-Proposals wrote:
>
>> On Aug 30, 2025, at 1:58 PM, Thiago Macieira<thiago_at_[hidden]> wrote:
>>
>> On Saturday, 30 August 2025 13:46:59 Pacific Daylight Time Oliver Hunt wrote:
>>> I’ll prod folk again, but I’m not sure I understand why you seem so
>>> absolutely adamant that every does or should use utf16 internally when
>>> multiple people have said this is not true, and pointed to every API you
>>> reference correctly as “this API was introduced when ucs2 was thought to be
>>> sufficient, and then got utf16 bolted on after the fact and different
>>> rates”.
>> I'm not adamant on this any more. I think based on what you said that Swift
>> reimplemented the support for the Unicode Database. I just can't find it,
>> because I don't know how to navigate the source code. I've found where it
>> iterates over the UTF-8 string and returns UTF-32 code units/points, but not
>> where it looks up the collation value such that U+00E9 is less than U+0069.
>>
>> The problem is of course that this means they've duplicated the access to the
>> Unicode Database, instead of using the OS. Then again, if Swift is cross-
>> platform to other OSes, it kind of has to if it doesn't want to depend on ICU.
> On Apple platforms Swift foundation libraries _is_ part of the OS.
>
> That said, it would seem - though I don’t know the details of icu, etc - entirely
> plausible for the swift foundation libraries to directly include the icu tables, or
> reference them by symbols. But again I _really_ don’t know: C++ compiler guy,
> not swift.
>
>>> What you seem to be arguing is old ABI fixed APIs that were extended to
>>> support utf16, so despite the many problems of utf16 vs utf8, and the wide
>>> spread adoption of utf8 everywhere other than places that are stuck with
>>> utf16 due to aforementioned ABI constraints, all new systems languages
>>> being built on utf8 strings, we should make new APIs built around utf16 so
>>> we can continue to be required to maintain an encoding that is (what the
>>> domain experts have told me) is bad on every metric.
>> I'm arguing that because we have such a widespread use of UTF-16 in C and C++,
>> we need first-class UTF-16 support in the C++ Standard. I don't care about
>> other languages, because I'm not writing code for them. But the underlying
>> infrastructure for UTF-16 for C and C++ seems to be there.
>>
>> So instead of talking about Rust or Swift, let's ask what libc++ would use to
>> implement collation.
> I’m saying that we don’t have widespread use of utf16 in C and C++. C and C++
> do not have _any_ awareness of unicode, strings are blobs, and code points are
> equivalent to characters. The only platform in which C/C++ have even ucs2
> support seems to be windows - on linux, macOS, and I would guess the other
> unix like systems wchar_t is 32bit, e.g. a unicode scalar, not ucs2 or a utf16
> code point.
That isn't quite correct. Both C and C++ are aware of the Unicode
encodings and correctly encode UTF-8, UTF-16, and UTF-32 literals,
including when the wide literal encoding is UTF-16 (which is now
supported in C++23 thanks to P2460R2 (Relax requirements on wchar_t to
match existing practices) <https://wg21.link/p2460r2> and will be in C
if N3366 (Restartable Functions for Efficient Character Conversions,
r13) <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3366.htm> or one
of its successors is accepted). std::format() and its friends have
behavior defined by the Unicode Standard when when the associated
literal encoding is a Unicode encoding. Search for "Unicode" and "UTF"
in [format] <https://eel.is/c++draft/format>.
I acknowledge that isn't much support of course.
> If C++ _were_ to add support for unicode, based on my understanding of the C++
> spec process, it would not be a matter of pointing to a system library or libicu,
> it would be each C++ edition referencing a _specific_ unicode release that would
> need to be embedded in the standard library.
The C++ standard already contains normative references to the Unicode
Standard. See [intro.refs] <https://eel.is/c++draft/intro.refs> and
[intro.compliance.general]p10
<https://eel.is/c++draft/intro.compliance.general#10>. At present, we
normatively refer to Unicode 15.1 with an explicit allowance for
implementors to use a newer version.
>
> This used to be a problem with web standard specifications, as it put standards in
> a position that would require the same text rendering differently in a webpage
> than the rest of the OS. So browsers ignored it, and the specifications removed
> those issues. C++ has very different constraints however, which means it
> the requirement for specific version specifications is perhaps more reasonable.
C++ isn't concerned with rendering behavior (at present, and unlikely to
be any time soon). The existing normative reference and allowance for
use of a later Unicode Standard version is backed by stability
guarantees provided by the Unicode Standard to ensure compatibility.
>
>>> “AI” is just predictive text generation regurgitating existing content, so
>>> of course it will produce answers that are most like the above. The
>>> majority of the posts it regurgitates written about stuff like this are
>>> from _decades_ of objc + foundation. AI doesn’t magically know anything, it
>>> literally just regurgitates the work of others, periodically adding errors.
>>> There is no reason to use it in a technical forum.
>> Which is why I almost always ignore the AI and go straight for the sources,
>> because until it is 99% reliable or more, it's useless. But in this case,
>> since I can't pass the judgement either on the accuracy of the sources, the AI
>> answer suffices. It seemed plausible that, if you needed the exact same sorting
>> as Finder, you'd use the same function that Finder uses, not one that may be
>> slightly different due to a reimplementation, however correct it may be.
> As above: swift is part of the platform, the swift standard libraries are system
> libraries. Much of what you are thinking is swift libraries/implementation vs
> system implementation (or even “old” system functions) are part of the swift
> standard libraries, not something separate.
>
> It is a fundamental misunderstanding to think that swift behavior can diverge
> from “platform” behavior on out platforms: it fundamentally is the platform,
> and as such the swift implementations of operations does not, and cannot,
> change the ABI or observable behavior. It is also incorrect to thing “I am calling
> the system version of this not the swift one”, and think that those are necessarily
> not the same implementation, or that the implementation of those functions
> is not coming from the swift standard library.
>
> I am unclear on why you seem so adamant about adding new utf16 APIs when
> no one who works with unicode believes that utf16 is a good answer to any
> problem, most vendors of APIs that use utf16 regret those APIs, most language
> standards that specified thet use of utf16 as their string representation regret it,
> new languages and APIs are in terms of utf8, unless there are specific platform
> reasons that _require_ un-abstracted utf16/char16_t interfaces.
There is a large amount of C and C++ source code written for UTF-16 that
won't be rewritten for UTF-8 any time soon. That strikes me as more than
enough of a reason to support both UTF-8 and UTF-16.
Tom.
>
> This is especially true for C/C++ (as opposed to languages like Java, JavaScript,
> etc) where utf16 is not the standard non-8bit character encoding.
>
> —Oliver
>
On 8/30/25 6:59 PM, Oliver Hunt via Std-Proposals wrote:
>
>> On Aug 30, 2025, at 1:58 PM, Thiago Macieira<thiago_at_[hidden]> wrote:
>>
>> On Saturday, 30 August 2025 13:46:59 Pacific Daylight Time Oliver Hunt wrote:
>>> I’ll prod folk again, but I’m not sure I understand why you seem so
>>> absolutely adamant that every does or should use utf16 internally when
>>> multiple people have said this is not true, and pointed to every API you
>>> reference correctly as “this API was introduced when ucs2 was thought to be
>>> sufficient, and then got utf16 bolted on after the fact and different
>>> rates”.
>> I'm not adamant on this any more. I think based on what you said that Swift
>> reimplemented the support for the Unicode Database. I just can't find it,
>> because I don't know how to navigate the source code. I've found where it
>> iterates over the UTF-8 string and returns UTF-32 code units/points, but not
>> where it looks up the collation value such that U+00E9 is less than U+0069.
>>
>> The problem is of course that this means they've duplicated the access to the
>> Unicode Database, instead of using the OS. Then again, if Swift is cross-
>> platform to other OSes, it kind of has to if it doesn't want to depend on ICU.
> On Apple platforms Swift foundation libraries _is_ part of the OS.
>
> That said, it would seem - though I don’t know the details of icu, etc - entirely
> plausible for the swift foundation libraries to directly include the icu tables, or
> reference them by symbols. But again I _really_ don’t know: C++ compiler guy,
> not swift.
>
>>> What you seem to be arguing is old ABI fixed APIs that were extended to
>>> support utf16, so despite the many problems of utf16 vs utf8, and the wide
>>> spread adoption of utf8 everywhere other than places that are stuck with
>>> utf16 due to aforementioned ABI constraints, all new systems languages
>>> being built on utf8 strings, we should make new APIs built around utf16 so
>>> we can continue to be required to maintain an encoding that is (what the
>>> domain experts have told me) is bad on every metric.
>> I'm arguing that because we have such a widespread use of UTF-16 in C and C++,
>> we need first-class UTF-16 support in the C++ Standard. I don't care about
>> other languages, because I'm not writing code for them. But the underlying
>> infrastructure for UTF-16 for C and C++ seems to be there.
>>
>> So instead of talking about Rust or Swift, let's ask what libc++ would use to
>> implement collation.
> I’m saying that we don’t have widespread use of utf16 in C and C++. C and C++
> do not have _any_ awareness of unicode, strings are blobs, and code points are
> equivalent to characters. The only platform in which C/C++ have even ucs2
> support seems to be windows - on linux, macOS, and I would guess the other
> unix like systems wchar_t is 32bit, e.g. a unicode scalar, not ucs2 or a utf16
> code point.
That isn't quite correct. Both C and C++ are aware of the Unicode
encodings and correctly encode UTF-8, UTF-16, and UTF-32 literals,
including when the wide literal encoding is UTF-16 (which is now
supported in C++23 thanks to P2460R2 (Relax requirements on wchar_t to
match existing practices) <https://wg21.link/p2460r2> and will be in C
if N3366 (Restartable Functions for Efficient Character Conversions,
r13) <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3366.htm> or one
of its successors is accepted). std::format() and its friends have
behavior defined by the Unicode Standard when when the associated
literal encoding is a Unicode encoding. Search for "Unicode" and "UTF"
in [format] <https://eel.is/c++draft/format>.
I acknowledge that isn't much support of course.
> If C++ _were_ to add support for unicode, based on my understanding of the C++
> spec process, it would not be a matter of pointing to a system library or libicu,
> it would be each C++ edition referencing a _specific_ unicode release that would
> need to be embedded in the standard library.
The C++ standard already contains normative references to the Unicode
Standard. See [intro.refs] <https://eel.is/c++draft/intro.refs> and
[intro.compliance.general]p10
<https://eel.is/c++draft/intro.compliance.general#10>. At present, we
normatively refer to Unicode 15.1 with an explicit allowance for
implementors to use a newer version.
>
> This used to be a problem with web standard specifications, as it put standards in
> a position that would require the same text rendering differently in a webpage
> than the rest of the OS. So browsers ignored it, and the specifications removed
> those issues. C++ has very different constraints however, which means it
> the requirement for specific version specifications is perhaps more reasonable.
C++ isn't concerned with rendering behavior (at present, and unlikely to
be any time soon). The existing normative reference and allowance for
use of a later Unicode Standard version is backed by stability
guarantees provided by the Unicode Standard to ensure compatibility.
>
>>> “AI” is just predictive text generation regurgitating existing content, so
>>> of course it will produce answers that are most like the above. The
>>> majority of the posts it regurgitates written about stuff like this are
>>> from _decades_ of objc + foundation. AI doesn’t magically know anything, it
>>> literally just regurgitates the work of others, periodically adding errors.
>>> There is no reason to use it in a technical forum.
>> Which is why I almost always ignore the AI and go straight for the sources,
>> because until it is 99% reliable or more, it's useless. But in this case,
>> since I can't pass the judgement either on the accuracy of the sources, the AI
>> answer suffices. It seemed plausible that, if you needed the exact same sorting
>> as Finder, you'd use the same function that Finder uses, not one that may be
>> slightly different due to a reimplementation, however correct it may be.
> As above: swift is part of the platform, the swift standard libraries are system
> libraries. Much of what you are thinking is swift libraries/implementation vs
> system implementation (or even “old” system functions) are part of the swift
> standard libraries, not something separate.
>
> It is a fundamental misunderstanding to think that swift behavior can diverge
> from “platform” behavior on out platforms: it fundamentally is the platform,
> and as such the swift implementations of operations does not, and cannot,
> change the ABI or observable behavior. It is also incorrect to thing “I am calling
> the system version of this not the swift one”, and think that those are necessarily
> not the same implementation, or that the implementation of those functions
> is not coming from the swift standard library.
>
> I am unclear on why you seem so adamant about adding new utf16 APIs when
> no one who works with unicode believes that utf16 is a good answer to any
> problem, most vendors of APIs that use utf16 regret those APIs, most language
> standards that specified thet use of utf16 as their string representation regret it,
> new languages and APIs are in terms of utf8, unless there are specific platform
> reasons that _require_ un-abstracted utf16/char16_t interfaces.
There is a large amount of C and C++ source code written for UTF-16 that
won't be rewritten for UTF-8 any time soon. That strikes me as more than
enough of a reason to support both UTF-8 and UTF-16.
Tom.
>
> This is especially true for C/C++ (as opposed to languages like Java, JavaScript,
> etc) where utf16 is not the standard non-8bit character encoding.
>
> —Oliver
>
Received on 2025-10-11 20:25:37
