sg16: Re: [SG16-Unicode] SG16 Unicode related questions for Swift and WebKit representatives

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 6 Aug 2018 22:14:29 -0400

On 08/03/2018 12:53 PM, Michael Ilseman wrote:
>
>
>> On Aug 2, 2018, at 10:26 PM, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> Thank you Michael and Dave! I appreciate the time and detail. All
>> of your answers look to confirm our expectations, so I interpret this
>> as a good sign we're thinking about the right things.
>>
>> I added a few inline comments/clarifications below.
>>
>> We had tentatively planned to meet Wednesday of next week, but it
>> turns out that two of our core SG16 members are going to be on
>> vacation so, at a minimum, I'd like to postpone. I'm also feeling
>> pretty content with the responses that we got from you and I think it
>> would suffice for us to just follow up with any remaining thoughts
>> via email. While I'd love for any of you to attend one (or more) of
>> our meetings (any time), I want to be sensitive to productive use of
>> your time. So, how about we play it by ear for now?
>>
>
> I’d be happy to meet up sometime. JF mentioned an in-person meeting
> sometime this fall. Feel free to grab me whenever you think I can add
> value.
>
>> On 08/02/2018 05:18 PM, Dave Abrahams wrote:
>>>
>>>
>>>> On Aug 1, 2018, at 12:04 PM, Michael Ilseman <milseman_at_[hidden]
>>>> <mailto:milseman_at_[hidden]>> wrote:
>>>>
>>>> Hello, I am the current maintainer of Swift’s String, and can speak
>>>> to my thoughts on the status quo and future directions. Dave, who
>>>> is on this thread, is much more familiar with the history behind
>>>> this and can likely provide deeper insight into the reasoning.
>>>
>>> Michael has done very well here; I only have a few things to add.
>>>
>>>>
>>>>> On Jul 23, 2018, at 7:39 PM, Tom Honermann <tom_at_[hidden]
>>>>> <mailto:tom_at_[hidden]>> wrote:
>>>>>
>>>>> SG16 is seeking input from Swift and WebKit representatives to
>>>>> help inform our work towards enhancing support for Unicode in the
>>>>> C++ standard. In particular, we recognize the significant amount
>>>>> of effort that went into the design of the Swift String type and
>>>>> would like to better understand the motivations that contributed
>>>>> to its current design and any pressures that might encourage
>>>>> further evolution or refinement; especially for any concerns that
>>>>> would be deemed significant enough to warrant backward
>>>>> incompatible changes.
>>>>> Though most of these questions specifically mention Swift, that is
>>>>> an artifact of our being more familiar with Swift than the
>>>>> internal workings of WebKit. Many of these questions would be
>>>>> applicable to any string type designed to support Unicode. We are
>>>>> therefore also interested in hearing about the string types used
>>>>> by WebKit, the motivations that guided their design, and the trade
>>>>> offs that have been made. Of particular interest would be the
>>>>> results of design decisions that are contrast with the design of
>>>>> Swift's String type.
>>>>> Thank you in advance for any time and expertise you are willing
>>>>> and able to share with us.
>>>>>> The Swift string manifesto is about 1 1/2 years old. What have
>>>>>> you learned since writing it? What would you change? What have
>>>>>> you changed?
>>>>
>>>> We haven’t really diverged from that manifesto. Some things are
>>>> still in progress, minor details were tweaked, but the core
>>>> arguments are still relevant.
>>>>
>>>>>>
>>>>>> Swift strings are extended grapheme cluster (EGC) based. What
>>>>>> have been the best and worst consequences of this choice?
>>>>
>>>> I’ll use “grapheme” casually to mean EGC. Swift’s Character type
>>>> represents a grapheme cluster, Unicode.Scalar represents a Unicode
>>>> scalar value (non-surrogate code point).
>>>>
>>>> Cocoa APIs are UTF-16 code unit oriented, and thus there’s always
>>>> caution (via documentation) about making sure such indices align to
>>>> grapheme boundaries. This is a frequent source of bugs, especially
>>>> as part of internationalization. By making Swift strings be
>>>> grapheme-based by default, developers first reach for the correct APIs.
>>>>
>>>> Another good consequence is that people picking up Swift and
>>>> playing with string, e.g. in a repl or Playground, see Swift’s
>>>> notion of characters align with what is displayed. This includes
>>>> complex multi-component emoji such as family emoji (👨‍👨‍👧‍👧),
>>>> which is a single Character composed of 7 Unicode.Scalars.
>>>>
>>>> This does have downsides. What is and is not a grapheme cluster
>>>> changes with each version of Unicode, and thus grapheme breaking is
>>>> inherently a run-time concern and can’t be checked at compile time.
>>>> Another is that while code units can be random-access, graphemes
>>>> cannot, which is confusing to developers used to UTF-16 code unit
>>>> access mostly working (until their users use non-BMP scalars or
>>>> emoji that is).
>>>
>>> I'd say the biggest downside is that there are users who simply
>>> refuse to accept what we consider to be the fundamental
>>> non-random-access character of any efficient string representation.
>>> They are upset that they can't index a string directly with an
>>> integer, and can't be talked out of it. I still think we made the
>>> right decision in this regard; you'd have the same problem if your
>>> strings were unicode-scalar-based.
>>
>> Are there common scenarios where programmers tend to be frustrated by
>> lack of random access? Perhaps most often when they are working with
>> inputs known to be ASCII only? Or is this mostly an education issue
>> and these programmers are having a difficult time accepting that
>> they've spent most of their career thus far writing bugs? :)
>>
>
> A lot of it is shaped by expectations coming from other languages,
> whose programming models do not prioritize operating on Unicode scalar
> values, let alone grapheme clusters. Objective-C’s default interface
> with Strings is random-access to UTF-16 code units, which “works”
> right up until you encounter an emoji or other scalar not on the BMP.
> It also “works” for graphemes right up until you encounter emoji or a
> language you didn’t test or a non-NFC-normalized contents in a
> language you did test.
>
> This gets compounded by the prevalence of strings in teaching,
> interviews, programming puzzles, etc., where a string is treated like
> an array with a more visual representation.
>
> Also note that even for fully ASCII strings we cannot provide random
> access to grapheme clusters, as “\r\n” is a single grapheme cluster.
> For pretty much every Unicode-correct operation we provide fast-paths
> for, there’s nasty corner cases that complicates the model.

Thanks, I had not considered the "\r\n" case. Alas, there are no easy
cases.

>
>>>
>>>> Furthermore, few existing specifications are phrased in terms
>>>> grapheme-clusters, so something like a validator wouldn’t want to
>>>> run on grapheme-segmented text, but a lower abstraction level.
>>>>
>>>> Also, graphemes can be funky. A string containing only, U+0301
>>>> (COMBINING ACUTE ACCENT) has one grapheme, but modifies the prior
>>>> grapheme upon concatenation. Such degenerate graphemes violate
>>>> algebraic reasoning in these corner cases.
>>>
>>> We are not aware of generic algorithms that rely on concatenation of
>>> collections conserving element counts, so we decided to simply
>>> document this quirk rather than saying that string is-not-a collection.
>>
>> SG16 has previously discussed cases like this and I'm happy to hear
>> you haven't had to do anything special for it. This is a good
>> example of why we asked about inappropriate use of the String count
>> property: programmers assuming s1.count + s2.count ==
>> s1.append(s2).count.
>>
>>>
>>>> Unicode defines properties and most operations on scalars or code
>>>> points, and very little on top of graphemes.
>>>>
>>>>>> When porting code unit or code point based code to Swift strings
>>>>>> (e.g., when rewriting Objective-C code, or rewriting Swift code
>>>>>> to use String instead of NSString), has profiling revealed
>>>>>> performance regressions due to the switch to EGC based
>>>>>> processing? If so, what action was taken to correct it?
>>>>
>>>> We have many fast-paths in grapheme-breaking to identify common
>>>> situations surrounding single-scalar graphemes. If a developer
>>>> wants to work with Unicode at a lower level, String provides a
>>>> UTF8View, a UTF16View, and a UnicodeScalarView. Those views lazily
>>>> transcode/decode upon access.
>>
>> Cool, it sounds like the answer to any such regressions was 1)
>> optimization in terms of fast-paths, and 2) fall back to code
>> unit/point processing otherwise.
>>
>>>>
>>>> There are also performance concerns and annoyances when working
>>>> with ICU, but this is an implementation detail. If you’re
>>>> interested in using ICU, we can discuss further what has worked
>>>> best for us.
>>>
>>> I think you're interested in (at least optionally) using ICU unless
>>> you have evidence of major investment in another open-source
>>> implementation of Unicode algorithms and tables. Otherwise, C++
>>> implementors could not afford to develop standard libraries.
>>
>> Yes, definitely. For the foreseeable future, I think we need to
>> ensure that any interfaces we propose can be reasonably implemented
>> using ICU. However, Zach Laine has made impressive progress
>> implementing many of the Unicode algorithms without use of ICU in his
>> proposed Boost.Text library. See https://github.com/tzlaine/text and
>> https://tzlaine.github.io/text/doc/html/index.html.
>>
>>>
>>>>
>>>>>>
>>>>>> Swift strings do not enforce storage in any particular Unicode
>>>>>> normalization form. Was consideration given to forcing storage in
>>>>>> a particular form such as FCC or NFC?
>>>>
>>>> Swift strings now sort with NFC (currently UTF-16 code unit order,
>>>> but likely changed to Unicode scalar value order). We didn’t find
>>>> FCC significantly more compelling in practice. Since NFC is far
>>>> more frequent in the wild (why waste space if you don’t have to),
>>>> strings are likely to already be in NFC. We have fast-paths to
>>>> detect on-the-fly normal sections of strings (e.g. all ASCII, all <
>>>> U+0300, NFC_QC=yes, etc.). We lazily normalize portions of string
>>>> during comparison when needed.
>>>>
>>>> As far as enforcing on creation, no. We do want to add an option to
>>>> perform a linear scan to set a performance flag, perhaps at
>>>> creation, so that comparison can take the memcmp-like fast-path.
>>
>> Ok, my take away from this is that fast-pathing has been sufficient
>> for lazy normalization (when needed) to not be (much of) a
>> performance concern. At least, not enough to want to take the
>> normalization cost on every string construction up front.
>>
>>>>
>>>>>> Swift strings support comparison via normalization. Has use of
>>>>>> canonical string equality been a performance issue? Or been a
>>>>>> source of surprise to programmers?
>>>>
>>>> This was a big performance issue on Linux, where we used to do
>>>> UCA+DUCET based comparisons. We switch to lexicographical order of
>>>> NFC-normalized UTF-16 code units (future: scalar values), and saw a
>>>> very significant speed up there. The remaining performance work
>>>> revolves around checking and tracking whether a string is known to
>>>> already be in a normal form, so we can just memcmp.
>>
>> This is very helpful, thank you. We've suspected that full collation
>> (with or without tailoring) would be too expensive for use as a
>> default comparison operator, so it is good to hear that confirmed.
>>
>> I'm curious why this was a larger performance issue for Linux than
>> for (presumably) macOS and/or iOS.
>>
>
> There were two main factors. The first is that on Darwin platforms,
> CFString had an implementation that we used instead of UCA+DUCET which
> was faster. The second is that Darwin platforms are typically
> up-to-date and have very recent versions of ICU. On Linux, we still
> support Ubuntu LTS 14.04 which has a version of ICU which predates
> Swift and didn’t have any fast-paths for ASCII or mostly-ASCII text.
>
> Switching to our own implementation based on NFC gave us many X
> improvement over CFString, which in turn was many X faster than
> UCA+DUCET (especially on older versions of ICU).

Thanks. My take away is that implementation quality matters; those fast
paths are important.

>
>>>>
>>>>>> Swift strings are not locale sensitive. Was any consideration
>>>>>> given to creation of a distinct locale sensitive string type?
>>>>
>>>> This is still up for debate and hasn’t been settled yet, but we
>>>> think it makes a lot of sense. If an array of strings is sorted, we
>>>> certainly don’t want a locale-change to violate programmer
>>>> invariants. A distinct type from string could avoid a lot of common
>>>> errors here, including forgetting to localize before presenting to
>>>> a user as part of a UI.
>>>>
>>>>>> Swift strings provide a count property as required to satisfy the
>>>>>> Collection protocol. How often do programmers use count (the
>>>>>> number of EGCs in the string) inappropriately?
>>>>
>>>> I’m not sure what would constitute inappropriate usage here. We do
>>>> not currently provide access to the underlying stored code units,
>>>> though this is a frequent request and we likely will in the future.
>>>> I haven’t seen anyone baking in the assumption that count is the
>>>> same for String and across all of Strings’s views (UTF-8, UTF-16,
>>>> Unicode scalars).
>>>
>>> One thing to consider is that as long as String is not
>>> random-access, count will be a worst-case O(N) operation. An
>>> inappropriate usage might involve computing the length once per loop
>>> iteration.
>>
>> In addition to the above and prior mention of algebraic concerns,
>> other potential abuses we had in mind were using it to determine
>> field widths for display or code unit/point based storage.
>>
>
> Display width is a whole other concern accounting for rendering
> environment, font, etc. I don’t have expertise here.
>
>> C++ container requirements specify that .size() be O(1). For us to
>> meet container requirements would require computing and caching the
>> count during construction and mutation operations. We could
>> potentially get by just meeting range requirements though.
>>
>>>
>>>> I mentioned degenerate graphemes breaking algebraic properties of
>>>> the Collection protocol, but this hasn’t been a huge issue in
>>>> practice so far.
>>>>
>>>>>>
>>>>>> Swift strings support several memory unsafe initializers and
>>>>>> methods. How frequently are these used incorrectly?
>>>>
>>>> Many of these initializers come from NSString originally, and
>>>> developers migrating correct code to Swift maintain that
>>>> correctness. Rust has a similar situation, though they do
>>>> validation at creation-time and from_utf8_unchecked() voids
>>>> memory-safety if the contents are invalid.
>>>>
>>>>>> The Swift manifesto discussed three approaches to handling
>>>>>> substrings and Swift 4 changed from "same type, shared storage"
>>>>>> to "different type, shared storage". Any regrets?
>>>>
>>>> Having two types can be a bit of a pain, but we still think it was
>>>> the right thing to do. This is consistent with Swift treating
>>>> slices as a distinct type from the base collection.
>>>>
>>>>>>
>>>>>> How often do you find programmers doing work at the EGC level
>>>>>> that would be better performed at the code unit or code point level?
>>>>
>>>> Often, if a developer has strict requirements, they know what
>>>> they’re doing enough to operate at one of those lower levels.
>>>>
>>>> Not being able to random-access graphemes in a string is a common
>>>> source of frustration and confusion amongst new users.
>>>>
>>>>>> Likewise, how often do you find programmers working with
>>>>>> unicodeScalars, utf8, or utf16 views to do work better performed
>>>>>> at the EGC level? For what reasons does this occur? Perhaps to
>>>>>> work around differences in EGC boundaries across Unicode versions
>>>>>> or the underlying version of ICU in use?
>>>>
>>>> This was very prevalent in Swift’s early days. String wasn’t a
>>>> collection of graphemes by default prior to Swift 4,
>>>
>>> Well, it was. And then in Swift 2 or 3 it wasn't, due to the
>>> algebraic reasoning issue. Now it is again.
>>>
>>>> so without guidance many developers wrote code against the unicode
>>>> scalars view. We also didn’t have any fast-paths for common-case
>>>> situations back then, which further encouraged them to use one of
>>>> the other views.
>>>>
>>>> This is still done sometimes for performance-sensitive usage, or
>>>> someone wanting to handle Unicode themselves. However, as mentioned
>>>> previously, we don’t (yet) provide direct access to the actual storage.
>>>>
>>>> We haven’t seen much desire for reconciling behavior across Unicode
>>>> versions. This may be due to Swift being primarily an applications
>>>> level programming language for devices which only have one version
>>>> of Unicode that’s relevant (the current one).
>>>>
>>>>>> Has consideration been given to exposing Unicode character
>>>>>> database properties? CharacterSet exposes some of these
>>>>>> properties, but have more been requested?
>>>>
>>>> Yes, this was recently added to the language:
>>>> https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md.
>>>> We surface much of the UCD via ICU.
>>
>> Ah, nice. All kinds of fun to be had with that :)
>>
>>>>
>>>>>> How firmly is the Swift string implementation tied to ICU? If the
>>>>>> C++ standard library were to add suitable Unicode support, what
>>>>>> would motivate reimplementing Swift strings on top of it?
>>>>
>>>> Swift’s tie to ICU is less firm than it used to be. We use ICU for
>>>> the following:
>>>>
>>>> 1. Grapheme breaking
>>>> 2. Normalization
>>>> 3. Accessing UCD properties
>>>> 4. Case conversion
>>>>
>>>> Each of these are not too tightly entwined with string; they’re
>>>> cordoned-off as a couple of shims called on fallback slow-paths.
>>>>
>>>> If the C++ standard library provided these operations, sufficiently
>>>> up-to-date with Unicode version and comparable or better to ICU in
>>>> performance, we would be willing to switch. A big pain in
>>>> interacting with ICU is their limited support for UTF-8. Some users
>>>> who would like to use a “lighter-weight” Swift and are unhappy at
>>>> having to link against ICU, as it’s fairly large, and it can
>>>> complicate security audits.
>>
>> Got it. Increasing the size of the C++ standard library is a
>> definite concern for us as well. We imagine some C++ users would be
>> similarly unhappy if their standard library suddenly required linking
>> against ICU.
>>
>
> If you go the route of implementing Unicode operations without ICU,
> would it be possible to separately link against Unicode support
> without also pulling in all of libc++? If your implementation is
> lighter-weight, yet current, it would be very appealing for Swift to
> consider switching over.

It would be up to the implementation to determine how it is packaged,
but I suspect there will be sufficient motivation for separating out the
heavier parts. Whether those heavier parts could then be used
separately from the rest of the library I can't say. I think this is
something for us to keep in mind as a design point though.

Tom.

>
>>>>
>>>>>> Do Swift programmers tend to prefer string interpolation or
>>>>>> string formatting functions?
>>>>
>>>> Users tend to prefer string interpolation. However, Swift currently
>>>> does not have much in the way of formatting control in
>>>> interpolations, and this is something we’re currently working on.
>>>>
>>>>>> What enhancements would you most like to see in C++ to improve
>>>>>> Unicode support?
>>>>
>>>> Swift’s string is perhaps geared as a higher-level construct than
>>>> what you may want for C++, and Swift has Cocoa-interoperability
>>>> concerns where everything is UTF-16. Rust might provide a closer
>>>> model to what you’re looking for:
>>>>
>>>> * Strings are a sequence of (valid) UTF-8 code units
>>>> o Validation is done on creation
>>>> o Invalid contents (e.g. Windows file paths) can be handled
>>>> via something like WTF-8, which is not intended for interchange
>>>>
>>>> * String provides bidirectional iterators for:
>>>> o Transcoded and/or normalized code units
>>>> o Unicode scalar values (their “character” type)
>>>> o Grapheme clusters
>>>>
>>>
>>> Michael, I think you're not answering the question asked. They are
>>> asking what Swift would want from C++, e.g., to allow us to decouple
>>> from ICU. Wouldn't we like to be able to do that?
>>
>> This question was intended to ask you, as expert C++ programmers
>> independently from Swift, what additions to C++ you think would be
>> most helpful to improve our (very lacking) Unicode support. So,
>> Michael's response is on point (thank you; we'll take a closer look
>> at Rust), as are any comments regarding what would benefit Swift
>> specifically. Michael's earlier comments regarding what Swift
>> currently uses ICU for are suggestive of what Swift might want from
>> C++. But I imagine the form in which those features are provided
>> would matter greatly; devils and details.
>>
>> Tom.
>>
>>>
>>> -Dave
>>>
>>>
>>
>

Received on 2018-08-07 04:14:33