Date: Sat, 30 Aug 2025 00:19:57 -0700
> On Aug 29, 2025, at 10:43 PM, Jason McKesson via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
> On Fri, Aug 29, 2025 at 9:28 PM Oliver Hunt via Std-Proposals
> <std-proposals_at_[hidden]> wrote:
>>
>>
>>
>> On Aug 26, 2025, at 6:30 PM, Thiago Macieira <thiago_at_[hidden]> wrote:
>>
>> Would you be able to find out how Swift implements collation? I don't know
>> where to begin the search. For Rust, it appears to be ICU4X [1] which is a
>> full reimplementation of ICU4C in Rust.
>>
>>
>> What do you mean by collation here? (What is the context?)
>
> Collation is a complex Unicode operation for doing sorting of strings
> (https://www.unicode.org/reports/tr10/).
>
> The question being asked is whether Swift's collation support does
> this operation natively on UTF-8 strings or if it internally converts
> them to UTF-16 and then does collation.
It is entirely utf8 - they normalize to NFC first and then perform an lexical comparison on the normalized utf8 strings — the start of the slow path (non-ascii, possibly? Possible grapheme clusters?, etc the ascii check is obvious but it would seem plausible that comparisons can be faster if everything in the string is a single scalar?[1]) is at (Apache License v2.0+library): https://github.com/swiftlang/swift/blob/main/stdlib/public/core/StringComparison.swift#L127
It is also important to be aware that the Character type in Swift represents a complete(extended?) grapheme cluster, not a byte, code unit or scalar, e.g “👨👩👧👦”.count is 1, and iterating across the string will only see one character. Accessing anything below a grapheme cluster requires UTF8View, UTF16View, or UnicodeScalarView (someString.{utf8,utf16,unicodeScalars}), which allow you to enumerate the code units - or the scalars themselves - these views work by enumerating the scalars in the utf8 string, and then - if needed - returning the relevant codepoint from that scalar.
Basically, Swift’s string type is _always_ utf8 *except* for NSString bridging - where it is bridging to an objective-c object and it seems like it may just make dealing with the utf8 vs utf16 comparisons a problem for Foundation to deal with.
No utf16 APIs have been added in many years, unless there is some unavoidable reason (old API compatibility) and even then where possible, such APIs are written in terms of the relevant APIs like NS- or CF-String, not char16_t buffers.
>
>> Separate from this (which I’m trying to find an answer to), you’ve made comments about everything using utf16/ucs2 referencing apple platforms. We consider every utf16/ucs2 api to be legacy and all new APIs for years at this point (including the filesystems) are expected to be utf8. UTF16 is not something we have any interest in adding anywhere new.
>>
>> (I did just look at the thread title again, and I’m still confused about how this went from floating point aliasing to utf16)
>
> It's because `char8_t` can alias with `char`, which led to a
> discussion on the validity of using `char8_t`, etc.
Right, but now we’re dealing with unicode which seems a step further towards insanity :D
—Oliver
[1] almost all of my unicode experience is in browser engines, where I had to deal with people using regexps to parse image data loaded into strings. I did not get the benefit of even pretending I was ever dealing with remotely valid unicode - WTF-8 exists for a reason :D
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
> On Fri, Aug 29, 2025 at 9:28 PM Oliver Hunt via Std-Proposals
> <std-proposals_at_[hidden]> wrote:
>>
>>
>>
>> On Aug 26, 2025, at 6:30 PM, Thiago Macieira <thiago_at_[hidden]> wrote:
>>
>> Would you be able to find out how Swift implements collation? I don't know
>> where to begin the search. For Rust, it appears to be ICU4X [1] which is a
>> full reimplementation of ICU4C in Rust.
>>
>>
>> What do you mean by collation here? (What is the context?)
>
> Collation is a complex Unicode operation for doing sorting of strings
> (https://www.unicode.org/reports/tr10/).
>
> The question being asked is whether Swift's collation support does
> this operation natively on UTF-8 strings or if it internally converts
> them to UTF-16 and then does collation.
It is entirely utf8 - they normalize to NFC first and then perform an lexical comparison on the normalized utf8 strings — the start of the slow path (non-ascii, possibly? Possible grapheme clusters?, etc the ascii check is obvious but it would seem plausible that comparisons can be faster if everything in the string is a single scalar?[1]) is at (Apache License v2.0+library): https://github.com/swiftlang/swift/blob/main/stdlib/public/core/StringComparison.swift#L127
It is also important to be aware that the Character type in Swift represents a complete(extended?) grapheme cluster, not a byte, code unit or scalar, e.g “👨👩👧👦”.count is 1, and iterating across the string will only see one character. Accessing anything below a grapheme cluster requires UTF8View, UTF16View, or UnicodeScalarView (someString.{utf8,utf16,unicodeScalars}), which allow you to enumerate the code units - or the scalars themselves - these views work by enumerating the scalars in the utf8 string, and then - if needed - returning the relevant codepoint from that scalar.
Basically, Swift’s string type is _always_ utf8 *except* for NSString bridging - where it is bridging to an objective-c object and it seems like it may just make dealing with the utf8 vs utf16 comparisons a problem for Foundation to deal with.
No utf16 APIs have been added in many years, unless there is some unavoidable reason (old API compatibility) and even then where possible, such APIs are written in terms of the relevant APIs like NS- or CF-String, not char16_t buffers.
>
>> Separate from this (which I’m trying to find an answer to), you’ve made comments about everything using utf16/ucs2 referencing apple platforms. We consider every utf16/ucs2 api to be legacy and all new APIs for years at this point (including the filesystems) are expected to be utf8. UTF16 is not something we have any interest in adding anywhere new.
>>
>> (I did just look at the thread title again, and I’m still confused about how this went from floating point aliasing to utf16)
>
> It's because `char8_t` can alias with `char`, which led to a
> discussion on the validity of using `char8_t`, etc.
Right, but now we’re dealing with unicode which seems a step further towards insanity :D
—Oliver
[1] almost all of my unicode experience is in browser engines, where I had to deal with people using regexps to parse image data loaded into strings. I did not get the benefit of even pretending I was ever dealing with remotely valid unicode - WTF-8 exists for a reason :D
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
Received on 2025-08-30 07:20:17