On Aug 30, 2025, at 1:58 PM, Thiago Macieira <thiago@macieira.org> wrote: On Saturday, 30 August 2025 13:46:59 Pacific Daylight Time Oliver Hunt wrote:I’ll prod folk again, but I’m not sure I understand why you seem so absolutely adamant that every does or should use utf16 internally when multiple people have said this is not true, and pointed to every API you reference correctly as “this API was introduced when ucs2 was thought to be sufficient, and then got utf16 bolted on after the fact and different rates”.I'm not adamant on this any more. I think based on what you said that Swift reimplemented the support for the Unicode Database. I just can't find it, because I don't know how to navigate the source code. I've found where it iterates over the UTF-8 string and returns UTF-32 code units/points, but not where it looks up the collation value such that U+00E9 is less than U+0069. The problem is of course that this means they've duplicated the access to the Unicode Database, instead of using the OS. Then again, if Swift is cross- platform to other OSes, it kind of has to if it doesn't want to depend on ICU.On Apple platforms Swift foundation libraries _is_ part of the OS. That said, it would seem - though I don’t know the details of icu, etc - entirely plausible for the swift foundation libraries to directly include the icu tables, or reference them by symbols. But again I _really_ don’t know: C++ compiler guy, not swift.What you seem to be arguing is old ABI fixed APIs that were extended to support utf16, so despite the many problems of utf16 vs utf8, and the wide spread adoption of utf8 everywhere other than places that are stuck with utf16 due to aforementioned ABI constraints, all new systems languages being built on utf8 strings, we should make new APIs built around utf16 so we can continue to be required to maintain an encoding that is (what the domain experts have told me) is bad on every metric.I'm arguing that because we have such a widespread use of UTF-16 in C and C++, we need first-class UTF-16 support in the C++ Standard. I don't care about other languages, because I'm not writing code for them. But the underlying infrastructure for UTF-16 for C and C++ seems to be there. So instead of talking about Rust or Swift, let's ask what libc++ would use to implement collation.I’m saying that we don’t have widespread use of utf16 in C and C++. C and C++ do not have _any_ awareness of unicode, strings are blobs, and code points are equivalent to characters. The only platform in which C/C++ have even ucs2 support seems to be windows - on linux, macOS, and I would guess the other unix like systems wchar_t is 32bit, e.g. a unicode scalar, not ucs2 or a utf16 code point.
That isn't quite correct. Both C and C++ are aware of the Unicode encodings and correctly encode UTF-8, UTF-16, and UTF-32 literals, including when the wide literal encoding is UTF-16 (which is now supported in C++23 thanks to P2460R2 (Relax requirements on wchar_t to match existing practices) and will be in C if N3366 (Restartable Functions for Efficient Character Conversions, r13) or one of its successors is accepted). std::format() and its friends have behavior defined by the Unicode Standard when when the associated literal encoding is a Unicode encoding. Search for "Unicode" and "UTF" in [format].
I acknowledge that isn't much support of course.
The C++ standard already contains normative references to the Unicode Standard. See [intro.refs] and [intro.compliance.general]p10. At present, we normatively refer to Unicode 15.1 with an explicit allowance for implementors to use a newer version.If C++ _were_ to add support for unicode, based on my understanding of the C++ spec process, it would not be a matter of pointing to a system library or libicu, it would be each C++ edition referencing a _specific_ unicode release that would need to be embedded in the standard library.
C++ isn't concerned with rendering behavior (at present, and unlikely to be any time soon). The existing normative reference and allowance for use of a later Unicode Standard version is backed by stability guarantees provided by the Unicode Standard to ensure compatibility.This used to be a problem with web standard specifications, as it put standards in a position that would require the same text rendering differently in a webpage than the rest of the OS. So browsers ignored it, and the specifications removed those issues. C++ has very different constraints however, which means it the requirement for specific version specifications is perhaps more reasonable.
“AI” is just predictive text generation regurgitating existing content, so of course it will produce answers that are most like the above. The majority of the posts it regurgitates written about stuff like this are from _decades_ of objc + foundation. AI doesn’t magically know anything, it literally just regurgitates the work of others, periodically adding errors. There is no reason to use it in a technical forum.Which is why I almost always ignore the AI and go straight for the sources, because until it is 99% reliable or more, it's useless. But in this case, since I can't pass the judgement either on the accuracy of the sources, the AI answer suffices. It seemed plausible that, if you needed the exact same sorting as Finder, you'd use the same function that Finder uses, not one that may be slightly different due to a reimplementation, however correct it may be.As above: swift is part of the platform, the swift standard libraries are system libraries. Much of what you are thinking is swift libraries/implementation vs system implementation (or even “old” system functions) are part of the swift standard libraries, not something separate. It is a fundamental misunderstanding to think that swift behavior can diverge from “platform” behavior on out platforms: it fundamentally is the platform, and as such the swift implementations of operations does not, and cannot, change the ABI or observable behavior. It is also incorrect to thing “I am calling the system version of this not the swift one”, and think that those are necessarily not the same implementation, or that the implementation of those functions is not coming from the swift standard library. I am unclear on why you seem so adamant about adding new utf16 APIs when no one who works with unicode believes that utf16 is a good answer to any problem, most vendors of APIs that use utf16 regret those APIs, most language standards that specified thet use of utf16 as their string representation regret it, new languages and APIs are in terms of utf8, unless there are specific platform reasons that _require_ un-abstracted utf16/char16_t interfaces.
There is a large amount of C and C++ source code written for UTF-16 that won't be rewritten for UTF-8 any time soon. That strikes me as more than enough of a reason to support both UTF-8 and UTF-16.
Tom.
This is especially true for C/C++ (as opposed to languages like Java, JavaScript, etc) where utf16 is not the standard non-8bit character encoding. —Oliver