sg16: Re: [SG16-Unicode] SG16 Unicode related questions for Swift and WebKit representatives

From: JF Bastien <cxx_at_[hidden]>
Date: Fri, 3 Aug 2018 10:12:28 -0700

On Fri, Aug 3, 2018 at 10:10 AM Michael Ilseman <milseman_at_[hidden]> wrote:

>
>
> On Aug 2, 2018, at 10:26 PM, Tom Honermann <tom_at_[hidden]> wrote:
>
> Thank you Michael and Dave! I appreciate the time and detail. All of
> your answers look to confirm our expectations, so I interpret this as a
> good sign we're thinking about the right things.
>
> I added a few inline comments/clarifications below.
>
> We had tentatively planned to meet Wednesday of next week, but it turns
> out that two of our core SG16 members are going to be on vacation so, at a
> minimum, I'd like to postpone. I'm also feeling pretty content with the
> responses that we got from you and I think it would suffice for us to just
> follow up with any remaining thoughts via email. While I'd love for any of
> you to attend one (or more) of our meetings (any time), I want to be
> sensitive to productive use of your time. So, how about we play it by ear
> for now?
>
>
> I’d be happy to meet up sometime. JF mentioned an in-person meeting
> sometime this fall. Feel free to grab me whenever you think I can add value.
>

I meant the upcoming San Diego meeting, November 5–10:
http://open-std.org/JTC1/SC22/WG21/docs/papers/2018/n4715.pdf

On 08/02/2018 05:18 PM, Dave Abrahams wrote:
>
>
>
> On Aug 1, 2018, at 12:04 PM, Michael Ilseman <milseman_at_[hidden]> wrote:
>
> Hello, I am the current maintainer of Swift’s String, and can speak to my
> thoughts on the status quo and future directions. Dave, who is on this
> thread, is much more familiar with the history behind this and can likely
> provide deeper insight into the reasoning.
>
>
> Michael has done very well here; I only have a few things to add.
>
>
> On Jul 23, 2018, at 7:39 PM, Tom Honermann <tom_at_[hidden]> wrote:
>
> SG16 is seeking input from Swift and WebKit representatives to help inform
> our work towards enhancing support for Unicode in the C++ standard. In
> particular, we recognize the significant amount of effort that went into
> the design of the Swift String type and would like to better understand the
> motivations that contributed to its current design and any pressures that
> might encourage further evolution or refinement; especially for any
> concerns that would be deemed significant enough to warrant backward
> incompatible changes.
> Though most of these questions specifically mention Swift, that is an
> artifact of our being more familiar with Swift than the internal workings
> of WebKit. Many of these questions would be applicable to any string type
> designed to support Unicode. We are therefore also interested in hearing
> about the string types used by WebKit, the motivations that guided their
> design, and the trade offs that have been made. Of particular interest
> would be the results of design decisions that are contrast with the design
> of Swift's String type.
> Thank you in advance for any time and expertise you are willing and able
> to share with us.
>
> The Swift string manifesto is about 1 1/2 years old. What have you learned
> since writing it? What would you change? What have you changed?
>
>
> We haven’t really diverged from that manifesto. Some things are still in
> progress, minor details were tweaked, but the core arguments are still
> relevant.
>
>
> Swift strings are extended grapheme cluster (EGC) based. What have been
> the best and worst consequences of this choice?
>
>
> I’ll use “grapheme” casually to mean EGC. Swift’s Character type
> represents a grapheme cluster, Unicode.Scalar represents a Unicode scalar
> value (non-surrogate code point).
>
> Cocoa APIs are UTF-16 code unit oriented, and thus there’s always caution
> (via documentation) about making sure such indices align to grapheme
> boundaries. This is a frequent source of bugs, especially as part of
> internationalization. By making Swift strings be grapheme-based by default,
> developers first reach for the correct APIs.
>
> Another good consequence is that people picking up Swift and playing with
> string, e.g. in a repl or Playground, see Swift’s notion of characters
> align with what is displayed. This includes complex multi-component emoji
> such as family emoji (👨‍👨‍👧‍👧), which is a single Character composed of
> 7 Unicode.Scalars.
>
> This does have downsides. What is and is not a grapheme cluster changes
> with each version of Unicode, and thus grapheme breaking is inherently a
> run-time concern and can’t be checked at compile time. Another is that
> while code units can be random-access, graphemes cannot, which is confusing
> to developers used to UTF-16 code unit access mostly working (until their
> users use non-BMP scalars or emoji that is).
>
>
> I'd say the biggest downside is that there are users who simply refuse to
> accept what we consider to be the fundamental non-random-access character
> of any efficient string representation. They are upset that they can't
> index a string directly with an integer, and can't be talked out of it. I
> still think we made the right decision in this regard; you'd have the same
> problem if your strings were unicode-scalar-based.
>
>
> Are there common scenarios where programmers tend to be frustrated by lack
> of random access? Perhaps most often when they are working with inputs
> known to be ASCII only? Or is this mostly an education issue and these
> programmers are having a difficult time accepting that they've spent most
> of their career thus far writing bugs? :)
>
>
> A lot of it is shaped by expectations coming from other languages, whose
> programming models do not prioritize operating on Unicode scalar values,
> let alone grapheme clusters. Objective-C’s default interface with Strings
> is random-access to UTF-16 code units, which “works” right up until you
> encounter an emoji or other scalar not on the BMP. It also “works” for
> graphemes right up until you encounter emoji or a language you didn’t test
> or a non-NFC-normalized contents in a language you did test.
>
> This gets compounded by the prevalence of strings in teaching, interviews,
> programming puzzles, etc., where a string is treated like an array with a
> more visual representation.
>
> Also note that even for fully ASCII strings we cannot provide random
> access to grapheme clusters, as “\r\n” is a single grapheme cluster. For
> pretty much every Unicode-correct operation we provide fast-paths for,
> there’s nasty corner cases that complicates the model.
>
>
> Furthermore, few existing specifications are phrased in terms
> grapheme-clusters, so something like a validator wouldn’t want to run on
> grapheme-segmented text, but a lower abstraction level.
>
> Also, graphemes can be funky. A string containing only, U+0301 (COMBINING
> ACUTE ACCENT) has one grapheme, but modifies the prior grapheme upon
> concatenation. Such degenerate graphemes violate algebraic reasoning in
> these corner cases.
>
>
> We are not aware of generic algorithms that rely on concatenation of
> collections conserving element counts, so we decided to simply document
> this quirk rather than saying that string is-not-a collection.
>
>
> SG16 has previously discussed cases like this and I'm happy to hear you
> haven't had to do anything special for it. This is a good example of why
> we asked about inappropriate use of the String count property: programmers
> assuming s1.count + s2.count == s1.append(s2).count.
>
>
> Unicode defines properties and most operations on scalars or code points,
> and very little on top of graphemes.
>
> When porting code unit or code point based code to Swift strings (e.g.,
> when rewriting Objective-C code, or rewriting Swift code to use String
> instead of NSString), has profiling revealed performance regressions due to
> the switch to EGC based processing? If so, what action was taken to
> correct it?
>
>
> We have many fast-paths in grapheme-breaking to identify common situations
> surrounding single-scalar graphemes. If a developer wants to work with
> Unicode at a lower level, String provides a UTF8View, a UTF16View, and a
> UnicodeScalarView. Those views lazily transcode/decode upon access.
>
>
> Cool, it sounds like the answer to any such regressions was 1)
> optimization in terms of fast-paths, and 2) fall back to code unit/point
> processing otherwise.
>
>
> There are also performance concerns and annoyances when working with ICU,
> but this is an implementation detail. If you’re interested in using ICU, we
> can discuss further what has worked best for us.
>
>
> I think you're interested in (at least optionally) using ICU unless you
> have evidence of major investment in another open-source implementation of
> Unicode algorithms and tables. Otherwise, C++ implementors could not
> afford to develop standard libraries.
>
>
> Yes, definitely. For the foreseeable future, I think we need to ensure
> that any interfaces we propose can be reasonably implemented using ICU.
> However, Zach Laine has made impressive progress implementing many of the
> Unicode algorithms without use of ICU in his proposed Boost.Text library.
> See https://github.com/tzlaine/text and
> https://tzlaine.github.io/text/doc/html/index.html.
>
>
>
>
> Swift strings do not enforce storage in any particular Unicode
> normalization form. Was consideration given to forcing storage in a
> particular form such as FCC or NFC?
>
>
> Swift strings now sort with NFC (currently UTF-16 code unit order, but
> likely changed to Unicode scalar value order). We didn’t find FCC
> significantly more compelling in practice. Since NFC is far more frequent
> in the wild (why waste space if you don’t have to), strings are likely to
> already be in NFC. We have fast-paths to detect on-the-fly normal sections
> of strings (e.g. all ASCII, all < U+0300, NFC_QC=yes, etc.). We lazily
> normalize portions of string during comparison when needed.
>
> As far as enforcing on creation, no. We do want to add an option to
> perform a linear scan to set a performance flag, perhaps at creation, so
> that comparison can take the memcmp-like fast-path.
>
>
> Ok, my take away from this is that fast-pathing has been sufficient for
> lazy normalization (when needed) to not be (much of) a performance
> concern. At least, not enough to want to take the normalization cost on
> every string construction up front.
>
>
> Swift strings support comparison via normalization. Has use of canonical
> string equality been a performance issue? Or been a source of surprise to
> programmers?
>
>
> This was a big performance issue on Linux, where we used to do UCA+DUCET
> based comparisons. We switch to lexicographical order of NFC-normalized
> UTF-16 code units (future: scalar values), and saw a very significant speed
> up there. The remaining performance work revolves around checking and
> tracking whether a string is known to already be in a normal form, so we
> can just memcmp.
>
>
> This is very helpful, thank you. We've suspected that full collation
> (with or without tailoring) would be too expensive for use as a default
> comparison operator, so it is good to hear that confirmed.
>
> I'm curious why this was a larger performance issue for Linux than for
> (presumably) macOS and/or iOS.
>
>
> There were two main factors. The first is that on Darwin platforms,
> CFString had an implementation that we used instead of UCA+DUCET which was
> faster. The second is that Darwin platforms are typically up-to-date and
> have very recent versions of ICU. On Linux, we still support Ubuntu LTS
> 14.04 which has a version of ICU which predates Swift and didn’t have any
> fast-paths for ASCII or mostly-ASCII text.
>
> Switching to our own implementation based on NFC gave us many X
> improvement over CFString, which in turn was many X faster than UCA+DUCET
> (especially on older versions of ICU).
>
>
> Swift strings are not locale sensitive. Was any consideration given to
> creation of a distinct locale sensitive string type?
>
>
> This is still up for debate and hasn’t been settled yet, but we think it
> makes a lot of sense. If an array of strings is sorted, we certainly don’t
> want a locale-change to violate programmer invariants. A distinct type from
> string could avoid a lot of common errors here, including forgetting to
> localize before presenting to a user as part of a UI.
>
> Swift strings provide a count property as required to satisfy the
> Collection protocol. How often do programmers use count (the number of
> EGCs in the string) inappropriately?
>
>
> I’m not sure what would constitute inappropriate usage here. We do not
> currently provide access to the underlying stored code units, though this
> is a frequent request and we likely will in the future. I haven’t seen
> anyone baking in the assumption that count is the same for String and
> across all of Strings’s views (UTF-8, UTF-16, Unicode scalars).
>
>
> One thing to consider is that as long as String is not random-access,
> count will be a worst-case O(N) operation. An inappropriate usage might
> involve computing the length once per loop iteration.
>
>
> In addition to the above and prior mention of algebraic concerns, other
> potential abuses we had in mind were using it to determine field widths for
> display or code unit/point based storage.
>
>
> Display width is a whole other concern accounting for rendering
> environment, font, etc. I don’t have expertise here.
>
> C++ container requirements specify that .size() be O(1). For us to meet
> container requirements would require computing and caching the count during
> construction and mutation operations. We could potentially get by just
> meeting range requirements though.
>
>
> I mentioned degenerate graphemes breaking algebraic properties of the
> Collection protocol, but this hasn’t been a huge issue in practice so far.
>
>
> Swift strings support several memory unsafe initializers and methods. How
> frequently are these used incorrectly?
>
>
> Many of these initializers come from NSString originally, and developers
> migrating correct code to Swift maintain that correctness. Rust has a
> similar situation, though they do validation at creation-time and
> from_utf8_unchecked() voids memory-safety if the contents are invalid.
>
> The Swift manifesto discussed three approaches to handling substrings and
> Swift 4 changed from "same type, shared storage" to "different type, shared
> storage". Any regrets?
>
>
> Having two types can be a bit of a pain, but we still think it was the
> right thing to do. This is consistent with Swift treating slices as a
> distinct type from the base collection.
>
>
> How often do you find programmers doing work at the EGC level that would
> be better performed at the code unit or code point level?
>
>
> Often, if a developer has strict requirements, they know what they’re
> doing enough to operate at one of those lower levels.
>
> Not being able to random-access graphemes in a string is a common source
> of frustration and confusion amongst new users.
>
> Likewise, how often do you find programmers working with unicodeScalars,
> utf8, or utf16 views to do work better performed at the EGC level? For
> what reasons does this occur? Perhaps to work around differences in EGC
> boundaries across Unicode versions or the underlying version of ICU in use?
>
>
> This was very prevalent in Swift’s early days. String wasn’t a collection
> of graphemes by default prior to Swift 4,
>
>
> Well, it was. And then in Swift 2 or 3 it wasn't, due to the algebraic
> reasoning issue. Now it is again.
>
> so without guidance many developers wrote code against the unicode scalars
> view. We also didn’t have any fast-paths for common-case situations back
> then, which further encouraged them to use one of the other views.
>
> This is still done sometimes for performance-sensitive usage, or someone
> wanting to handle Unicode themselves. However, as mentioned previously, we
> don’t (yet) provide direct access to the actual storage.
>
> We haven’t seen much desire for reconciling behavior across Unicode
> versions. This may be due to Swift being primarily an applications level
> programming language for devices which only have one version of Unicode
> that’s relevant (the current one).
>
> Has consideration been given to exposing Unicode character database
> properties? CharacterSet exposes some of these properties, but have more
> been requested?
>
>
> Yes, this was recently added to the language:
> https://github.com/apple/swift-evolution/blob/master/proposals/0211-unicode-scalar-properties.md.
> We surface much of the UCD via ICU.
>
>
> Ah, nice. All kinds of fun to be had with that :)
>
>
> How firmly is the Swift string implementation tied to ICU? If the C++
> standard library were to add suitable Unicode support, what would motivate
> reimplementing Swift strings on top of it?
>
>
> Swift’s tie to ICU is less firm than it used to be. We use ICU for the
> following:
>
> 1. Grapheme breaking
> 2. Normalization
> 3. Accessing UCD properties
> 4. Case conversion
>
> Each of these are not too tightly entwined with string; they’re
> cordoned-off as a couple of shims called on fallback slow-paths.
>
> If the C++ standard library provided these operations, sufficiently
> up-to-date with Unicode version and comparable or better to ICU in
> performance, we would be willing to switch. A big pain in interacting with
> ICU is their limited support for UTF-8. Some users who would like to use a
> “lighter-weight” Swift and are unhappy at having to link against ICU, as
> it’s fairly large, and it can complicate security audits.
>
>
> Got it. Increasing the size of the C++ standard library is a definite
> concern for us as well. We imagine some C++ users would be similarly
> unhappy if their standard library suddenly required linking against ICU.
>
>
> If you go the route of implementing Unicode operations without ICU, would
> it be possible to separately link against Unicode support without also
> pulling in all of libc++? If your implementation is lighter-weight, yet
> current, it would be very appealing for Swift to consider switching over.
>
>
> Do Swift programmers tend to prefer string interpolation or string
> formatting functions?
>
>
> Users tend to prefer string interpolation. However, Swift currently does
> not have much in the way of formatting control in interpolations, and this
> is something we’re currently working on.
>
> What enhancements would you most like to see in C++ to improve Unicode
> support?
>
>
> Swift’s string is perhaps geared as a higher-level construct than what you
> may want for C++, and Swift has Cocoa-interoperability concerns where
> everything is UTF-16. Rust might provide a closer model to what you’re
> looking for:
>
>
> - Strings are a sequence of (valid) UTF-8 code units
> - Validation is done on creation
> - Invalid contents (e.g. Windows file paths) can be handled via
> something like WTF-8, which is not intended for interchange
>
>
> - String provides bidirectional iterators for:
> - Transcoded and/or normalized code units
> - Unicode scalar values (their “character” type)
> - Grapheme clusters
>
>
> Michael, I think you're not answering the question asked. They are asking
> what Swift would want from C++, e.g., to allow us to decouple from ICU.
> Wouldn't we like to be able to do that?
>
>
> This question was intended to ask you, as expert C++ programmers
> independently from Swift, what additions to C++ you think would be most
> helpful to improve our (very lacking) Unicode support. So, Michael's
> response is on point (thank you; we'll take a closer look at Rust), as are
> any comments regarding what would benefit Swift specifically. Michael's
> earlier comments regarding what Swift currently uses ICU for are suggestive
> of what Swift might want from C++. But I imagine the form in which those
> features are provided would matter greatly; devils and details.
>
> Tom.
>
>
> -Dave
>
>
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2018-08-03 19:12:42