On Fri, Aug 3, 2018 at 10:10 AM Michael Ilseman <milseman@apple.com> wrote:

On Aug 2, 2018, at 10:26 PM, Tom Honermann <tom@honermann.net> wrote:

Thank you Michael and Dave!  I appreciate the time and detail.  All of your answers look to confirm our expectations, so I interpret this as a good sign we're thinking about the right things.

I added a few inline comments/clarifications below.

We had tentatively planned to meet Wednesday of next week, but it turns out that two of our core SG16 members are going to be on vacation so, at a minimum, I'd like to postpone.  I'm also feeling pretty content with the responses that we got from you and I think it would suffice for us to just follow up with any remaining thoughts via email.  While I'd love for any of you to attend one (or more) of our meetings (any time), I want to be sensitive to productive use of your time.  So, how about we play it by ear for now?

I’d be happy to meet up sometime. JF mentioned an in-person meeting sometime this fall. Feel free to grab me whenever you think I can add value.

I meant the upcoming San Diego meeting, November 5–10: http://open-std.org/JTC1/SC22/WG21/docs/papers/2018/n4715.pdf

On 08/02/2018 05:18 PM, Dave Abrahams wrote:

On Aug 1, 2018, at 12:04 PM, Michael Ilseman <milseman@apple.com> wrote:

Hello, I am the current maintainer of Swift’s String, and can speak to my thoughts on the status quo and future directions. Dave, who is on this thread, is much more familiar with the history behind this and can likely provide deeper insight into the reasoning.

Michael has done very well here; I only have a few things to add.

On Jul 23, 2018, at 7:39 PM, Tom Honermann <tom@honermann.net> wrote:

SG16 is seeking input from Swift and WebKit representatives to help inform our work towards enhancing support for Unicode in the C++ standard.  In particular, we recognize the significant amount of effort that went into the design of the Swift String type and would like to better understand the motivations that contributed to its current design and any pressures that might encourage further evolution or refinement; especially for any concerns that would be deemed significant enough to warrant backward incompatible changes.
Though most of these questions specifically mention Swift, that is an artifact of our being more familiar with Swift than the internal workings of WebKit.  Many of these questions would be applicable to any string type designed to support Unicode.  We are therefore also interested in hearing about the string types used by WebKit, the motivations that guided their design, and the trade offs that have been made.  Of particular interest would be the results of design decisions that are contrast with the design of Swift's String type.
Thank you in advance for any time and expertise you are willing and able to share with us.
The Swift string manifesto is about 1 1/2 years old. What have you learned since writing it?  What would you change?  What have you changed?

We haven’t really diverged from that manifesto. Some things are still in progress, minor details were tweaked, but the core arguments are still relevant.

Swift strings are extended grapheme cluster (EGC) based.  What have been the best and worst consequences of this choice?

I’ll use “grapheme” casually to mean EGC. Swift’s Character type represents a grapheme cluster, Unicode.Scalar represents a Unicode scalar value (non-surrogate code point).

Cocoa APIs are UTF-16 code unit oriented, and thus there’s always caution (via documentation) about making sure such indices align to grapheme boundaries. This is a frequent source of bugs, especially as part of internationalization. By making Swift strings be grapheme-based by default, developers first reach for the correct APIs.

Another good consequence is that people picking up Swift and playing with string, e.g. in a repl or Playground, see Swift’s notion of characters align with what is displayed. This includes complex multi-component emoji such as family emoji (👨‍👨‍👧‍👧), which is a single Character composed of 7 Unicode.Scalars.

This does have downsides. What is and is not a grapheme cluster changes with each version of Unicode, and thus grapheme breaking is inherently a run-time concern and can’t be checked at compile time. Another is that while code units can be random-access, graphemes cannot, which is confusing to developers used to UTF-16 code unit access mostly working (until their users use non-BMP scalars or emoji that is).

I'd say the biggest downside is that there are users who simply refuse to accept what we consider to be the fundamental non-random-access character of any efficient string representation.  They are upset that they can't index a string directly with an integer, and can't be talked out of it.  I still think we made the right decision in this regard; you'd have the same problem if your strings were unicode-scalar-based.

Are there common scenarios where programmers tend to be frustrated by lack of random access?  Perhaps most often when they are working with inputs known to be ASCII only?  Or is this mostly an education issue and these programmers are having a difficult time accepting that they've spent most of their career thus far writing bugs? :)

A lot of it is shaped by expectations coming from other languages, whose programming models do not prioritize operating on Unicode scalar values, let alone grapheme clusters. Objective-C’s default interface with Strings is random-access to UTF-16 code units, which “works” right up until you encounter an emoji or other scalar not on the BMP. It also “works” for graphemes right up until you encounter emoji or a language you didn’t test or a non-NFC-normalized contents in a language you did test.

This gets compounded by the prevalence of strings in teaching, interviews, programming puzzles, etc., where a string is treated like an array with a more visual representation.

Also note that even for fully ASCII strings we cannot provide random access to grapheme clusters, as “\r\n” is a single grapheme cluster. For pretty much every Unicode-correct operation we provide fast-paths for, there’s nasty corner cases that complicates the model.

Furthermore, few existing specifications are phrased in terms grapheme-clusters, so something like a validator wouldn’t want to run on grapheme-segmented text, but a lower abstraction level.

Also, graphemes can be funky. A string containing only, U+0301 (COMBINING ACUTE ACCENT) has one grapheme, but modifies the prior grapheme upon concatenation. Such degenerate graphemes violate algebraic reasoning in these corner cases.

We are not aware of generic algorithms that rely on concatenation of collections conserving element counts, so we decided to simply document this quirk rather than saying that string is-not-a collection.

SG16 has previously discussed cases like this and I'm happy to hear you haven't had to do anything special for it.  This is a good example of why we asked about inappropriate use of the String count property: programmers assuming s1.count + s2.count == s1.append(s2).count.

Unicode defines properties and most operations on scalars or code points, and very little on top of graphemes.

When porting code unit or code point based code to Swift strings (e.g., when rewriting Objective-C code, or rewriting Swift code to use String instead of NSString), has profiling revealed performance regressions due to the switch to EGC based processing?  If so, what action was taken to correct it?

We have many fast-paths in grapheme-breaking to identify common situations surrounding single-scalar graphemes. If a developer wants to work with Unicode at a lower level, String provides a UTF8View, a UTF16View, and a UnicodeScalarView. Those views lazily transcode/decode upon access.

Cool, it sounds like the answer to any such regressions was 1) optimization in terms of fast-paths, and 2) fall back to code unit/point processing otherwise.

There are also performance concerns and annoyances when working with ICU, but this is an implementation detail. If you’re interested in using ICU, we can discuss further what has worked best for us.

I think you're interested in (at least optionally) using ICU unless you have evidence of major investment in another open-source implementation of Unicode algorithms and tables.  Otherwise, C++ implementors could not afford to develop standard libraries.

Yes, definitely.  For the foreseeable future, I think we need to ensure that any interfaces we propose can be reasonably implemented using ICU.  However, Zach Laine has made impressive progress implementing many of the Unicode algorithms without use of ICU in his proposed Boost.Text library.  See https://github.com/tzlaine/text and https://tzlaine.github.io/text/doc/html/index.html.

Swift strings do not enforce storage in any particular Unicode normalization form.  Was consideration given to forcing storage in a particular form such as FCC or NFC?

Swift strings now sort with NFC (currently UTF-16 code unit order, but likely changed to Unicode scalar value order). We didn’t find FCC significantly more compelling in practice. Since NFC is far more frequent in the wild (why waste space if you don’t have to), strings are likely to already be in NFC. We have fast-paths to detect on-the-fly normal sections of strings (e.g. all ASCII, all < U+0300, NFC_QC=yes, etc.). We lazily normalize portions of string during comparison when needed.

As far as enforcing on creation, no. We do want to add an option to perform a linear scan to set a performance flag, perhaps at creation, so that comparison can take the memcmp-like fast-path.

Ok, my take away from this is that fast-pathing has been sufficient for lazy normalization (when needed) to not be (much of) a performance concern.  At least, not enough to want to take the normalization cost on every string construction up front.

Swift strings support comparison via normalization.  Has use of canonical string equality been a performance issue?  Or been a source of surprise to programmers?

This was a big performance issue on Linux, where we used to do UCA+DUCET based comparisons. We switch to lexicographical order of NFC-normalized UTF-16 code units (future: scalar values), and saw a very significant speed up there. The remaining performance work revolves around checking and tracking whether a string is known to already be in a normal form, so we can just memcmp.

This is very helpful, thank you.  We've suspected that full collation (with or without tailoring) would be too expensive for use as a default comparison operator, so it is good to hear that confirmed.

I'm curious why this was a larger performance issue for Linux than for (presumably) macOS and/or iOS.

There were two main factors. The first is that on Darwin platforms, CFString had an implementation that we used instead of UCA+DUCET which was faster. The second is that Darwin platforms are typically up-to-date and have very recent versions of ICU. On Linux, we still support Ubuntu LTS 14.04 which has a version of ICU which predates Swift and didn’t have any fast-paths for ASCII or mostly-ASCII text.

Switching to our own implementation based on NFC gave us many X improvement over CFString, which in turn was many X faster than UCA+DUCET (especially on older versions of ICU).

Swift strings are not locale sensitive.  Was any consideration given to creation of a distinct locale sensitive string type?

This is still up for debate and hasn’t been settled yet, but we think it makes a lot of sense. If an array of strings is sorted, we certainly don’t want a locale-change to violate programmer invariants. A distinct type from string could avoid a lot of common errors here, including forgetting to localize before presenting to a user as part of a UI.

Swift strings provide a count property as required to satisfy the Collection protocol.  How often do programmers use count (the number of EGCs in the string) inappropriately?

I’m not sure what would constitute inappropriate usage here. We do not currently provide access to the underlying stored code units, though this is a frequent request and we likely will in the future. I haven’t seen anyone baking in the assumption that count is the same for String and across all of Strings’s views (UTF-8, UTF-16, Unicode scalars).

One thing to consider is that as long as String is not random-access, count will be a worst-case O(N) operation.  An inappropriate usage might involve computing the length once per loop iteration.

In addition to the above and prior mention of algebraic concerns, other potential abuses we had in mind were using it to determine field widths for display or code unit/point based storage.

Display width is a whole other concern accounting for rendering environment, font, etc. I don’t have expertise here.

C++ container requirements specify that .size() be O(1).  For us to meet container requirements would require computing and caching the count during construction and mutation operations.  We could potentially get by just meeting range requirements though.

I mentioned degenerate graphemes breaking algebraic properties of the Collection protocol, but this hasn’t been a huge issue in practice so far.

Swift strings support several memory unsafe initializers and methods.  How frequently are these used incorrectly?

Many of these initializers come from NSString originally, and developers migrating correct code to Swift maintain that correctness. Rust has a similar situation, though they do validation at creation-time and from_utf8_unchecked() voids memory-safety if the contents are invalid.

The Swift manifesto discussed three approaches to handling substrings and Swift 4 changed from "same type, shared storage" to "different type, shared storage".  Any regrets?

Having two types can be a bit of a pain, but we still think it was the right thing to do. This is consistent with Swift treating slices as a distinct type from the base collection.

How often do you find programmers doing work at the EGC level that would be better performed at the code unit or code point level?

Often, if a developer has strict requirements, they know what they’re doing enough to operate at one of those lower levels.

Not being able to random-access graphemes in a string is a common source of frustration and confusion amongst new users.

Likewise, how often do you find programmers working with unicodeScalars, utf8, or utf16 views to do work better performed at the EGC level?  For what reasons does this occur?  Perhaps to work around differences in EGC boundaries across Unicode versions or the underlying version of ICU in use?

This was very prevalent in Swift’s early days. String wasn’t a collection of graphemes by default prior to Swift 4,

Well, it was.  And then in Swift 2 or 3 it wasn't, due to the algebraic reasoning issue.  Now it is again.

so without guidance many developers wrote code against the unicode scalars view. We also didn’t have any fast-paths for common-case situations back then, which further encouraged them to use one of the other views.

This is still done sometimes for performance-sensitive usage, or someone wanting to handle Unicode themselves. However, as mentioned previously, we don’t (yet) provide direct access to the actual storage.

We haven’t seen much desire for reconciling behavior across Unicode versions. This may be due to Swift being primarily an applications level programming language for devices which only have one version of Unicode that’s relevant (the current one).

Has consideration been given to exposing Unicode character database properties? CharacterSet exposes some of these properties, but have more been requested?

Yes, this was recently added to the language: https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md. We surface much of the UCD via ICU.

Ah, nice.  All kinds of fun to be had with that :)

How firmly is the Swift string implementation tied to ICU?  If the C++ standard library were to add suitable Unicode support, what would motivate reimplementing Swift strings on top of it?

Swift’s tie to ICU is less firm than it used to be. We use ICU for the following:

1. Grapheme breaking
2. Normalization
3. Accessing UCD properties
4. Case conversion

Each of these are not too tightly entwined with string; they’re cordoned-off as a couple of shims called on fallback slow-paths.

If the C++ standard library provided these operations, sufficiently up-to-date with Unicode version and comparable or better to ICU in performance, we would be willing to switch. A big pain in interacting with ICU is their limited support for UTF-8. Some users who would like to use a “lighter-weight” Swift and are unhappy at having to link against ICU, as it’s fairly large, and it can complicate security audits.

Got it.  Increasing the size of the C++ standard library is a definite concern for us as well.  We imagine some C++ users would be similarly unhappy if their standard library suddenly required linking against ICU.

If you go the route of implementing Unicode operations without ICU, would it be possible to separately link against Unicode support without also pulling in all of libc++? If your implementation is lighter-weight, yet current, it would be very appealing for Swift to consider switching over.

Do Swift programmers tend to prefer string interpolation or string formatting functions?

Users tend to prefer string interpolation. However, Swift currently does not have much in the way of formatting control in interpolations, and this is something we’re currently working on.

What enhancements would you most like to see in C++ to improve Unicode support?

Swift’s string is perhaps geared as a higher-level construct than what you may want for C++, and Swift has Cocoa-interoperability concerns where everything is UTF-16. Rust might provide a closer model to what you’re looking for:

  • Strings are a sequence of (valid) UTF-8 code units
    • Validation is done on creation
    • Invalid contents (e.g. Windows file paths) can be handled via something like WTF-8, which is not intended for interchange
  • String provides bidirectional iterators for:
    • Transcoded and/or normalized code units
    • Unicode scalar values (their “character” type)
    • Grapheme clusters

Michael, I think you're not answering the question asked.  They are asking what Swift would want from C++, e.g., to allow us to decouple from ICU.  Wouldn't we like to be able to do that?

This question was intended to ask you, as expert C++ programmers independently from Swift, what additions to C++ you think would be most helpful to improve our (very lacking) Unicode support.  So, Michael's response is on point (thank you; we'll take a closer look at Rust), as are any comments regarding what would benefit Swift specifically.  Michael's earlier comments regarding what Swift currently uses ICU for are suggestive of what Swift might want from C++.  But I imagine the form in which those features are provided would matter greatly; devils and details.



SG16 Unicode mailing list