sg16: Re: [SG16-Unicode] SG16 Unicode related questions for Swift and WebKit representatives

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 3 Aug 2018 01:26:39 -0400

Thank you Michael and Dave! I appreciate the time and detail. All of
your answers look to confirm our expectations, so I interpret this as a
good sign we're thinking about the right things.

I added a few inline comments/clarifications below.

We had tentatively planned to meet Wednesday of next week, but it turns
out that two of our core SG16 members are going to be on vacation so, at
a minimum, I'd like to postpone. I'm also feeling pretty content with
the responses that we got from you and I think it would suffice for us
to just follow up with any remaining thoughts via email. While I'd love
for any of you to attend one (or more) of our meetings (any time), I
want to be sensitive to productive use of your time. So, how about we
play it by ear for now?

On 08/02/2018 05:18 PM, Dave Abrahams wrote:
>
>
>> On Aug 1, 2018, at 12:04 PM, Michael Ilseman <milseman_at_[hidden]
>> <mailto:milseman_at_[hidden]>> wrote:
>>
>> Hello, I am the current maintainer of Swift’s String, and can speak
>> to my thoughts on the status quo and future directions. Dave, who is
>> on this thread, is much more familiar with the history behind this
>> and can likely provide deeper insight into the reasoning.
>
> Michael has done very well here; I only have a few things to add.
>
>>
>>> On Jul 23, 2018, at 7:39 PM, Tom Honermann <tom_at_[hidden]
>>> <mailto:tom_at_[hidden]>> wrote:
>>>
>>> SG16 is seeking input from Swift and WebKit representatives to help
>>> inform our work towards enhancing support for Unicode in the C++
>>> standard. In particular, we recognize the significant amount of
>>> effort that went into the design of the Swift String type and would
>>> like to better understand the motivations that contributed to its
>>> current design and any pressures that might encourage further
>>> evolution or refinement; especially for any concerns that would be
>>> deemed significant enough to warrant backward incompatible changes.
>>> Though most of these questions specifically mention Swift, that is
>>> an artifact of our being more familiar with Swift than the internal
>>> workings of WebKit. Many of these questions would be applicable to
>>> any string type designed to support Unicode. We are therefore also
>>> interested in hearing about the string types used by WebKit, the
>>> motivations that guided their design, and the trade offs that have
>>> been made. Of particular interest would be the results of design
>>> decisions that are contrast with the design of Swift's String type.
>>> Thank you in advance for any time and expertise you are willing and
>>> able to share with us.
>>>> The Swift string manifesto is about 1 1/2 years old. What have you
>>>> learned since writing it? What would you change? What have you
>>>> changed?
>>
>> We haven’t really diverged from that manifesto. Some things are still
>> in progress, minor details were tweaked, but the core arguments are
>> still relevant.
>>
>>>>
>>>> Swift strings are extended grapheme cluster (EGC) based. What have
>>>> been the best and worst consequences of this choice?
>>
>> I’ll use “grapheme” casually to mean EGC. Swift’s Character type
>> represents a grapheme cluster, Unicode.Scalar represents a Unicode
>> scalar value (non-surrogate code point).
>>
>> Cocoa APIs are UTF-16 code unit oriented, and thus there’s always
>> caution (via documentation) about making sure such indices align to
>> grapheme boundaries. This is a frequent source of bugs, especially as
>> part of internationalization. By making Swift strings be
>> grapheme-based by default, developers first reach for the correct APIs.
>>
>> Another good consequence is that people picking up Swift and playing
>> with string, e.g. in a repl or Playground, see Swift’s notion of
>> characters align with what is displayed. This includes complex
>> multi-component emoji such as family emoji (👨‍👨‍👧‍👧), which is a
>> single Character composed of 7 Unicode.Scalars.
>>
>> This does have downsides. What is and is not a grapheme cluster
>> changes with each version of Unicode, and thus grapheme breaking is
>> inherently a run-time concern and can’t be checked at compile time.
>> Another is that while code units can be random-access, graphemes
>> cannot, which is confusing to developers used to UTF-16 code unit
>> access mostly working (until their users use non-BMP scalars or emoji
>> that is).
>
> I'd say the biggest downside is that there are users who simply refuse
> to accept what we consider to be the fundamental non-random-access
> character of any efficient string representation. They are upset that
> they can't index a string directly with an integer, and can't be
> talked out of it. I still think we made the right decision in this
> regard; you'd have the same problem if your strings were
> unicode-scalar-based.

Are there common scenarios where programmers tend to be frustrated by
lack of random access? Perhaps most often when they are working with
inputs known to be ASCII only? Or is this mostly an education issue and
these programmers are having a difficult time accepting that they've
spent most of their career thus far writing bugs? :)

>
>> Furthermore, few existing specifications are phrased in terms
>> grapheme-clusters, so something like a validator wouldn’t want to run
>> on grapheme-segmented text, but a lower abstraction level.
>>
>> Also, graphemes can be funky. A string containing only, U+0301
>> (COMBINING ACUTE ACCENT) has one grapheme, but modifies the prior
>> grapheme upon concatenation. Such degenerate graphemes violate
>> algebraic reasoning in these corner cases.
>
> We are not aware of generic algorithms that rely on concatenation of
> collections conserving element counts, so we decided to simply
> document this quirk rather than saying that string is-not-a collection.

SG16 has previously discussed cases like this and I'm happy to hear you
haven't had to do anything special for it. This is a good example of
why we asked about inappropriate use of the String count property:
programmers assuming s1.count + s2.count == s1.append(s2).count.

>
>> Unicode defines properties and most operations on scalars or code
>> points, and very little on top of graphemes.
>>
>>>> When porting code unit or code point based code to Swift strings
>>>> (e.g., when rewriting Objective-C code, or rewriting Swift code to
>>>> use String instead of NSString), has profiling revealed performance
>>>> regressions due to the switch to EGC based processing? If so, what
>>>> action was taken to correct it?
>>
>> We have many fast-paths in grapheme-breaking to identify common
>> situations surrounding single-scalar graphemes. If a developer wants
>> to work with Unicode at a lower level, String provides a UTF8View, a
>> UTF16View, and a UnicodeScalarView. Those views lazily
>> transcode/decode upon access.

Cool, it sounds like the answer to any such regressions was 1)
optimization in terms of fast-paths, and 2) fall back to code unit/point
processing otherwise.

>>
>> There are also performance concerns and annoyances when working with
>> ICU, but this is an implementation detail. If you’re interested in
>> using ICU, we can discuss further what has worked best for us.
>
> I think you're interested in (at least optionally) using ICU unless
> you have evidence of major investment in another open-source
> implementation of Unicode algorithms and tables. Otherwise, C++
> implementors could not afford to develop standard libraries.

Yes, definitely. For the foreseeable future, I think we need to ensure
that any interfaces we propose can be reasonably implemented using ICU.
However, Zach Laine has made impressive progress implementing many of
the Unicode algorithms without use of ICU in his proposed Boost.Text
library. See https://github.com/tzlaine/text and
https://tzlaine.github.io/text/doc/html/index.html.

>
>>
>>>>
>>>> Swift strings do not enforce storage in any particular Unicode
>>>> normalization form. Was consideration given to forcing storage in
>>>> a particular form such as FCC or NFC?
>>
>> Swift strings now sort with NFC (currently UTF-16 code unit order,
>> but likely changed to Unicode scalar value order). We didn’t find FCC
>> significantly more compelling in practice. Since NFC is far more
>> frequent in the wild (why waste space if you don’t have to), strings
>> are likely to already be in NFC. We have fast-paths to detect
>> on-the-fly normal sections of strings (e.g. all ASCII, all < U+0300,
>> NFC_QC=yes, etc.). We lazily normalize portions of string during
>> comparison when needed.
>>
>> As far as enforcing on creation, no. We do want to add an option to
>> perform a linear scan to set a performance flag, perhaps at creation,
>> so that comparison can take the memcmp-like fast-path.

Ok, my take away from this is that fast-pathing has been sufficient for
lazy normalization (when needed) to not be (much of) a performance
concern. At least, not enough to want to take the normalization cost on
every string construction up front.

>>
>>>> Swift strings support comparison via normalization. Has use of
>>>> canonical string equality been a performance issue? Or been a
>>>> source of surprise to programmers?
>>
>> This was a big performance issue on Linux, where we used to do
>> UCA+DUCET based comparisons. We switch to lexicographical order of
>> NFC-normalized UTF-16 code units (future: scalar values), and saw a
>> very significant speed up there. The remaining performance work
>> revolves around checking and tracking whether a string is known to
>> already be in a normal form, so we can just memcmp.

This is very helpful, thank you. We've suspected that full collation
(with or without tailoring) would be too expensive for use as a default
comparison operator, so it is good to hear that confirmed.

I'm curious why this was a larger performance issue for Linux than for
(presumably) macOS and/or iOS.

>>
>>>> Swift strings are not locale sensitive. Was any consideration
>>>> given to creation of a distinct locale sensitive string type?
>>
>> This is still up for debate and hasn’t been settled yet, but we think
>> it makes a lot of sense. If an array of strings is sorted, we
>> certainly don’t want a locale-change to violate programmer
>> invariants. A distinct type from string could avoid a lot of common
>> errors here, including forgetting to localize before presenting to a
>> user as part of a UI.
>>
>>>> Swift strings provide a count property as required to satisfy the
>>>> Collection protocol. How often do programmers use count (the
>>>> number of EGCs in the string) inappropriately?
>>
>> I’m not sure what would constitute inappropriate usage here. We do
>> not currently provide access to the underlying stored code units,
>> though this is a frequent request and we likely will in the future. I
>> haven’t seen anyone baking in the assumption that count is the same
>> for String and across all of Strings’s views (UTF-8, UTF-16, Unicode
>> scalars).
>
> One thing to consider is that as long as String is not random-access,
> count will be a worst-case O(N) operation. An inappropriate usage
> might involve computing the length once per loop iteration.

In addition to the above and prior mention of algebraic concerns, other
potential abuses we had in mind were using it to determine field widths
for display or code unit/point based storage.

C++ container requirements specify that .size() be O(1). For us to meet
container requirements would require computing and caching the count
during construction and mutation operations. We could potentially get
by just meeting range requirements though.

>
>> I mentioned degenerate graphemes breaking algebraic properties of the
>> Collection protocol, but this hasn’t been a huge issue in practice so
>> far.
>>
>>>>
>>>> Swift strings support several memory unsafe initializers and
>>>> methods. How frequently are these used incorrectly?
>>
>> Many of these initializers come from NSString originally, and
>> developers migrating correct code to Swift maintain that correctness.
>> Rust has a similar situation, though they do validation at
>> creation-time and from_utf8_unchecked() voids memory-safety if the
>> contents are invalid.
>>
>>>> The Swift manifesto discussed three approaches to handling
>>>> substrings and Swift 4 changed from "same type, shared storage" to
>>>> "different type, shared storage". Any regrets?
>>
>> Having two types can be a bit of a pain, but we still think it was
>> the right thing to do. This is consistent with Swift treating slices
>> as a distinct type from the base collection.
>>
>>>>
>>>> How often do you find programmers doing work at the EGC level that
>>>> would be better performed at the code unit or code point level?
>>
>> Often, if a developer has strict requirements, they know what they’re
>> doing enough to operate at one of those lower levels.
>>
>> Not being able to random-access graphemes in a string is a common
>> source of frustration and confusion amongst new users.
>>
>>>> Likewise, how often do you find programmers working with
>>>> unicodeScalars, utf8, or utf16 views to do work better performed at
>>>> the EGC level? For what reasons does this occur? Perhaps to work
>>>> around differences in EGC boundaries across Unicode versions or the
>>>> underlying version of ICU in use?
>>
>> This was very prevalent in Swift’s early days. String wasn’t a
>> collection of graphemes by default prior to Swift 4,
>
> Well, it was. And then in Swift 2 or 3 it wasn't, due to the
> algebraic reasoning issue. Now it is again.
>
>> so without guidance many developers wrote code against the unicode
>> scalars view. We also didn’t have any fast-paths for common-case
>> situations back then, which further encouraged them to use one of the
>> other views.
>>
>> This is still done sometimes for performance-sensitive usage, or
>> someone wanting to handle Unicode themselves. However, as mentioned
>> previously, we don’t (yet) provide direct access to the actual storage.
>>
>> We haven’t seen much desire for reconciling behavior across Unicode
>> versions. This may be due to Swift being primarily an applications
>> level programming language for devices which only have one version of
>> Unicode that’s relevant (the current one).
>>
>>>> Has consideration been given to exposing Unicode character database
>>>> properties? CharacterSet exposes some of these properties, but have
>>>> more been requested?
>>
>> Yes, this was recently added to the language:
>> https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md.
>> We surface much of the UCD via ICU.

Ah, nice. All kinds of fun to be had with that :)

>>
>>>> How firmly is the Swift string implementation tied to ICU? If the
>>>> C++ standard library were to add suitable Unicode support, what
>>>> would motivate reimplementing Swift strings on top of it?
>>
>> Swift’s tie to ICU is less firm than it used to be. We use ICU for
>> the following:
>>
>> 1. Grapheme breaking
>> 2. Normalization
>> 3. Accessing UCD properties
>> 4. Case conversion
>>
>> Each of these are not too tightly entwined with string; they’re
>> cordoned-off as a couple of shims called on fallback slow-paths.
>>
>> If the C++ standard library provided these operations, sufficiently
>> up-to-date with Unicode version and comparable or better to ICU in
>> performance, we would be willing to switch. A big pain in interacting
>> with ICU is their limited support for UTF-8. Some users who would
>> like to use a “lighter-weight” Swift and are unhappy at having to
>> link against ICU, as it’s fairly large, and it can complicate
>> security audits.

Got it. Increasing the size of the C++ standard library is a definite
concern for us as well. We imagine some C++ users would be similarly
unhappy if their standard library suddenly required linking against ICU.

>>
>>>> Do Swift programmers tend to prefer string interpolation or string
>>>> formatting functions?
>>
>> Users tend to prefer string interpolation. However, Swift currently
>> does not have much in the way of formatting control in
>> interpolations, and this is something we’re currently working on.
>>
>>>> What enhancements would you most like to see in C++ to improve
>>>> Unicode support?
>>
>> Swift’s string is perhaps geared as a higher-level construct than
>> what you may want for C++, and Swift has Cocoa-interoperability
>> concerns where everything is UTF-16. Rust might provide a closer
>> model to what you’re looking for:
>>
>> * Strings are a sequence of (valid) UTF-8 code units
>> o Validation is done on creation
>> o Invalid contents (e.g. Windows file paths) can be handled via
>> something like WTF-8, which is not intended for interchange
>>
>> * String provides bidirectional iterators for:
>> o Transcoded and/or normalized code units
>> o Unicode scalar values (their “character” type)
>> o Grapheme clusters
>>
>
> Michael, I think you're not answering the question asked. They are
> asking what Swift would want from C++, e.g., to allow us to decouple
> from ICU. Wouldn't we like to be able to do that?

This question was intended to ask you, as expert C++ programmers
independently from Swift, what additions to C++ you think would be most
helpful to improve our (very lacking) Unicode support. So, Michael's
response is on point (thank you; we'll take a closer look at Rust), as
are any comments regarding what would benefit Swift specifically.
Michael's earlier comments regarding what Swift currently uses ICU for
are suggestive of what Swift might want from C++. But I imagine the
form in which those features are provided would matter greatly; devils
and details.

Tom.

>
> -Dave
>
>

Received on 2018-08-03 07:26:44