Catching up on some old messages I meant to respond too...

On 8/30/25 6:59 PM, Oliver Hunt via Std-Proposals wrote:

On Aug 30, 2025, at 1:58 PM, Thiago Macieira <thiago@macieira.org> wrote:

On Saturday, 30 August 2025 13:46:59 Pacific Daylight Time Oliver Hunt wrote:

I’ll prod folk again, but I’m not sure I understand why you seem so
absolutely adamant that every does or should use utf16 internally when
multiple people have said this is not true, and pointed to every API you
reference correctly as “this API was introduced when ucs2 was thought to be
sufficient, and then got utf16 bolted on after the fact and different
rates”.

I'm not adamant on this any more. I think based on what you said that Swift 
reimplemented the support for the Unicode Database. I just can't find it, 
because I don't know how to navigate the source code. I've found where it 
iterates over the UTF-8 string and returns UTF-32 code units/points, but not 
where it looks up the collation value such that U+00E9 is less than U+0069.

The problem is of course that this means they've duplicated the access to the 
Unicode Database, instead of using the OS. Then again, if Swift is cross-
platform to other OSes, it kind of has to if it doesn't want to depend on ICU.

On Apple platforms Swift foundation libraries _is_ part of the OS.

That said, it would seem - though I don’t know the details of icu, etc - entirely
plausible for the swift foundation libraries to directly include the icu tables, or
reference them by symbols. But again I _really_ don’t know: C++ compiler guy,
not swift.

What you seem to be arguing is old ABI fixed APIs that were extended to
support utf16, so despite the many problems of utf16 vs utf8, and the wide
spread adoption of utf8 everywhere other than places that are stuck with
utf16 due to aforementioned ABI constraints, all new systems languages
being built on utf8 strings, we should make new APIs built around utf16 so
we can continue to be required to maintain an encoding that is (what the
domain experts have told me) is bad on every metric.

I'm arguing that because we have such a widespread use of UTF-16 in C and C++, 
we need first-class UTF-16 support in the C++ Standard. I don't care about 
other languages, because I'm not writing code for them. But the underlying 
infrastructure for UTF-16 for C and C++ seems to be there.

So instead of talking about Rust or Swift, let's ask what libc++ would use to 
implement collation.

I’m saying that we don’t have widespread use of utf16 in C and C++. C and C++
do not have _any_ awareness of unicode, strings are blobs, and code points are
equivalent to characters. The only platform in which C/C++ have even ucs2
support seems to be windows - on linux, macOS, and I would guess the other
unix like systems wchar_t is 32bit, e.g. a unicode scalar, not ucs2 or a utf16
code point.

That isn't quite correct. Both C and C++ are aware of the Unicode encodings and correctly encode UTF-8, UTF-16, and UTF-32 literals, including when the wide literal encoding is UTF-16 (which is now supported in C++23 thanks to P2460R2 (Relax requirements on wchar_t to match existing practices) and will be in C if N3366 (Restartable Functions for Efficient Character Conversions, r13) or one of its successors is accepted). std::format() and its friends have behavior defined by the Unicode Standard when when the associated literal encoding is a Unicode encoding. Search for "Unicode" and "UTF" in [format].

I acknowledge that isn't much support of course.

If C++ _were_ to add support for unicode, based on my understanding of the C++
spec process, it would not be a matter of pointing to a system library or libicu,
it would be each C++ edition referencing a _specific_ unicode release that would
need to be embedded in the standard library.

The C++ standard already contains normative references to the Unicode Standard. See [intro.refs] and [intro.compliance.general]p10. At present, we normatively refer to Unicode 15.1 with an explicit allowance for implementors to use a newer version.


This used to be a problem with web standard specifications, as it put standards in
a position that would require the same text rendering differently in a webpage
than the rest of the OS. So browsers ignored it, and the specifications removed
those issues. C++ has very different constraints however, which means it
the requirement for specific version specifications is perhaps more reasonable.

C++ isn't concerned with rendering behavior (at present, and unlikely to be any time soon). The existing normative reference and allowance for use of a later Unicode Standard version is backed by stability guarantees provided by the Unicode Standard to ensure compatibility.

“AI” is just predictive text generation regurgitating existing content, so
of course it will produce answers that are most like the above. The
majority of the posts it regurgitates written about stuff like this are
from _decades_ of objc + foundation. AI doesn’t magically know anything, it
literally just regurgitates the work of others, periodically adding errors.
There is no reason to use it in a technical forum.

Which is why I almost always ignore the AI and go straight for the sources, 
because until it is 99% reliable or more, it's useless. But in this case, 
since I can't pass the judgement either on the accuracy of the sources, the AI 
answer suffices. It seemed plausible that, if you needed the exact same sorting 
as Finder, you'd use the same function that Finder uses, not one that may be 
slightly different due to a reimplementation, however correct it may be.

As above: swift is part of the platform, the swift standard libraries are system
libraries. Much of what you are thinking is swift libraries/implementation vs
system implementation (or even “old” system functions) are part of the swift
standard libraries, not something separate.

It is a fundamental misunderstanding to think that swift behavior can diverge
from “platform” behavior on out platforms: it fundamentally is the platform,
and as such the swift implementations of operations does not, and cannot,
change the ABI or observable behavior. It is also incorrect to thing “I am calling
the system version of this not the swift one”, and think that those are necessarily
not the same implementation, or that the implementation of those functions
is not coming from the swift standard library.

I am unclear on why you seem so adamant about adding new utf16 APIs when
no one who works with unicode believes that utf16 is a good answer to any
problem, most vendors of APIs that use utf16 regret those APIs, most language
standards that specified thet use of utf16 as their string representation regret it,
new languages and APIs are in terms of utf8, unless there are specific platform
reasons that _require_ un-abstracted utf16/char16_t interfaces.

There is a large amount of C and C++ source code written for UTF-16 that won't be rewritten for UTF-8 any time soon. That strikes me as more than enough of a reason to support both UTF-8 and UTF-16.

Tom.


This is especially true for C/C++ (as opposed to languages like Java, JavaScript,
etc) where utf16 is not the standard non-8bit character encoding.

—Oliver