sg16: Re: [SG16-Unicode] SG16 Unicode related questions for Swift and WebKit representatives

From: Michael Ilseman <milseman_at_[hidden]>
Date: Wed, 01 Aug 2018 12:04:03 -0700

Hello, I am the current maintainer of Swift’s String, and can speak to my thoughts on the status quo and future directions. Dave, who is on this thread, is much more familiar with the history behind this and can likely provide deeper insight into the reasoning.

> On Jul 23, 2018, at 7:39 PM, Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>
> SG16 is seeking input from Swift and WebKit representatives to help inform our work towards enhancing support for Unicode in the C++ standard. In particular, we recognize the significant amount of effort that went into the design of the Swift String type and would like to better understand the motivations that contributed to its current design and any pressures that might encourage further evolution or refinement; especially for any concerns that would be deemed significant enough to warrant backward incompatible changes.
> Though most of these questions specifically mention Swift, that is an artifact of our being more familiar with Swift than the internal workings of WebKit. Many of these questions would be applicable to any string type designed to support Unicode. We are therefore also interested in hearing about the string types used by WebKit, the motivations that guided their design, and the trade offs that have been made. Of particular interest would be the results of design decisions that are contrast with the design of Swift's String type.
> Thank you in advance for any time and expertise you are willing and able to share with us.
>> The Swift string manifesto is about 1 1/2 years old. What have you learned since writing it? What would you change? What have you changed?

We haven’t really diverged from that manifesto. Some things are still in progress, minor details were tweaked, but the core arguments are still relevant.

>>
>> Swift strings are extended grapheme cluster (EGC) based. What have been the best and worst consequences of this choice?

I’ll use “grapheme” casually to mean EGC. Swift’s Character type represents a grapheme cluster, Unicode.Scalar represents a Unicode scalar value (non-surrogate code point).

Cocoa APIs are UTF-16 code unit oriented, and thus there’s always caution (via documentation) about making sure such indices align to grapheme boundaries. This is a frequent source of bugs, especially as part of internationalization. By making Swift strings be grapheme-based by default, developers first reach for the correct APIs.

Another good consequence is that people picking up Swift and playing with string, e.g. in a repl or Playground, see Swift’s notion of characters align with what is displayed. This includes complex multi-component emoji such as family emoji (👨‍👨‍👧‍👧), which is a single Character composed of 7 Unicode.Scalars.

This does have downsides. What is and is not a grapheme cluster changes with each version of Unicode, and thus grapheme breaking is inherently a run-time concern and can’t be checked at compile time. Another is that while code units can be random-access, graphemes cannot, which is confusing to developers used to UTF-16 code unit access mostly working (until their users use non-BMP scalars or emoji that is). Furthermore, few existing specifications are phrased in terms grapheme-clusters, so something like a validator wouldn’t want to run on grapheme-segmented text, but a lower abstraction level.

Also, graphemes can be funky. A string containing only, U+0301 (COMBINING ACUTE ACCENT) has one grapheme, but modifies the prior grapheme upon concatenation. Such degenerate graphemes violate algebraic reasoning in these corner cases. Unicode defines properties and most operations on scalars or code points, and very little on top of graphemes.

>> When porting code unit or code point based code to Swift strings (e.g., when rewriting Objective-C code, or rewriting Swift code to use String instead of NSString), has profiling revealed performance regressions due to the switch to EGC based processing? If so, what action was taken to correct it?

We have many fast-paths in grapheme-breaking to identify common situations surrounding single-scalar graphemes. If a developer wants to work with Unicode at a lower level, String provides a UTF8View, a UTF16View, and a UnicodeScalarView. Those views lazily transcode/decode upon access.

There are also performance concerns and annoyances when working with ICU, but this is an implementation detail. If you’re interested in using ICU, we can discuss further what has worked best for us.

>>
>> Swift strings do not enforce storage in any particular Unicode normalization form. Was consideration given to forcing storage in a particular form such as FCC or NFC?

Swift strings now sort with NFC (currently UTF-16 code unit order, but likely changed to Unicode scalar value order). We didn’t find FCC significantly more compelling in practice. Since NFC is far more frequent in the wild (why waste space if you don’t have to), strings are likely to already be in NFC. We have fast-paths to detect on-the-fly normal sections of strings (e.g. all ASCII, all < U+0300, NFC_QC=yes, etc.). We lazily normalize portions of string during comparison when needed.

As far as enforcing on creation, no. We do want to add an option to perform a linear scan to set a performance flag, perhaps at creation, so that comparison can take the memcmp-like fast-path.

>> Swift strings support comparison via normalization. Has use of canonical string equality been a performance issue? Or been a source of surprise to programmers?

This was a big performance issue on Linux, where we used to do UCA+DUCET based comparisons. We switch to lexicographical order of NFC-normalized UTF-16 code units (future: scalar values), and saw a very significant speed up there. The remaining performance work revolves around checking and tracking whether a string is known to already be in a normal form, so we can just memcmp.

>> Swift strings are not locale sensitive. Was any consideration given to creation of a distinct locale sensitive string type?

This is still up for debate and hasn’t been settled yet, but we think it makes a lot of sense. If an array of strings is sorted, we certainly don’t want a locale-change to violate programmer invariants. A distinct type from string could avoid a lot of common errors here, including forgetting to localize before presenting to a user as part of a UI.

>> Swift strings provide a count property as required to satisfy the Collection protocol. How often do programmers use count (the number of EGCs in the string) inappropriately?

I’m not sure what would constitute inappropriate usage here. We do not currently provide access to the underlying stored code units, though this is a frequent request and we likely will in the future. I haven’t seen anyone baking in the assumption that count is the same for String and across all of Strings’s views (UTF-8, UTF-16, Unicode scalars).

I mentioned degenerate graphemes breaking algebraic properties of the Collection protocol, but this hasn’t been a huge issue in practice so far.

>>
>> Swift strings support several memory unsafe initializers and methods. How frequently are these used incorrectly?

Many of these initializers come from NSString originally, and developers migrating correct code to Swift maintain that correctness. Rust has a similar situation, though they do validation at creation-time and from_utf8_unchecked() voids memory-safety if the contents are invalid.

>> The Swift manifesto discussed three approaches to handling substrings and Swift 4 changed from "same type, shared storage" to "different type, shared storage". Any regrets?

Having two types can be a bit of a pain, but we still think it was the right thing to do. This is consistent with Swift treating slices as a distinct type from the base collection.

>>
>> How often do you find programmers doing work at the EGC level that would be better performed at the code unit or code point level?

Often, if a developer has strict requirements, they know what they’re doing enough to operate at one of those lower levels.

Not being able to random-access graphemes in a string is a common source of frustration and confusion amongst new users.

>> Likewise, how often do you find programmers working with unicodeScalars, utf8, or utf16 views to do work better performed at the EGC level? For what reasons does this occur? Perhaps to work around differences in EGC boundaries across Unicode versions or the underlying version of ICU in use?

This was very prevalent in Swift’s early days. String wasn’t a collection of graphemes by default prior to Swift 4, so without guidance many developers wrote code against the unicode scalars view. We also didn’t have any fast-paths for common-case situations back then, which further encouraged them to use one of the other views.

This is still done sometimes for performance-sensitive usage, or someone wanting to handle Unicode themselves. However, as mentioned previously, we don’t (yet) provide direct access to the actual storage.

We haven’t seen much desire for reconciling behavior across Unicode versions. This may be due to Swift being primarily an applications level programming language for devices which only have one version of Unicode that’s relevant (the current one).

>> Has consideration been given to exposing Unicode character database properties? CharacterSet exposes some of these properties, but have more been requested?

Yes, this was recently added to the language: https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md <https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md>. We surface much of the UCD via ICU.

>> How firmly is the Swift string implementation tied to ICU? If the C++ standard library were to add suitable Unicode support, what would motivate reimplementing Swift strings on top of it?

Swift’s tie to ICU is less firm than it used to be. We use ICU for the following:

1. Grapheme breaking
2. Normalization
3. Accessing UCD properties
4. Case conversion

Each of these are not too tightly entwined with string; they’re cordoned-off as a couple of shims called on fallback slow-paths.

If the C++ standard library provided these operations, sufficiently up-to-date with Unicode version and comparable or better to ICU in performance, we would be willing to switch. A big pain in interacting with ICU is their limited support for UTF-8. Some users who would like to use a “lighter-weight” Swift and are unhappy at having to link against ICU, as it’s fairly large, and it can complicate security audits.

>> Do Swift programmers tend to prefer string interpolation or string formatting functions?

Users tend to prefer string interpolation. However, Swift currently does not have much in the way of formatting control in interpolations, and this is something we’re currently working on.

>> What enhancements would you most like to see in C++ to improve Unicode support?

Swift’s string is perhaps geared as a higher-level construct than what you may want for C++, and Swift has Cocoa-interoperability concerns where everything is UTF-16. Rust might provide a closer model to what you’re looking for:

Strings are a sequence of (valid) UTF-8 code units
Validation is done on creation
Invalid contents (e.g. Windows file paths) can be handled via something like WTF-8, which is not intended for interchange
String provides bidirectional iterators for:
Transcoded and/or normalized code units
Unicode scalar values (their “character” type)
Grapheme clusters

> These questions were culled from various internal SG16 discussions. Special thanks to JeanHeyd Meneide, Mark Zeren, and Thiago Macieira for their contributions to crafting this list.
>
> Tom.

Received on 2018-08-01 21:22:23