On 08/02/2018 05:18 PM, Dave Abrahams wrote:
Hello, I am the current maintainer of
Swift’s String, and can speak to my thoughts on the
status quo and future directions. Dave, who is on this
thread, is much more familiar with the history behind
this and can likely provide deeper insight into the
Michael has done very well here; I only have a few things to
On Jul 23, 2018, at 7:39 PM, Tom
SG16 is seeking input from Swift and
WebKit representatives to help inform our work
towards enhancing support for Unicode in the
C++ standard. In particular, we recognize the
significant amount of effort that went into
the design of the Swift String type and would
like to better understand the motivations that
contributed to its current design and any
pressures that might encourage further
evolution or refinement; especially for any
concerns that would be deemed significant
enough to warrant backward incompatible
Though most of these questions specifically
mention Swift, that is an artifact of our
being more familiar with Swift than the
internal workings of WebKit. Many of these
questions would be applicable to any string
type designed to support Unicode. We are
therefore also interested in hearing about the
string types used by WebKit, the motivations
that guided their design, and the trade offs
that have been made. Of particular interest
would be the results of design decisions that
are contrast with the design of Swift's String
Thank you in advance for any time and
expertise you are willing and able to share
The Swift string manifesto is
about 1 1/2 years old. What have you
learned since writing it? What would
you change? What have you changed?
We haven’t really diverged from that
manifesto. Some things are still in progress, minor
details were tweaked, but the core arguments are
Swift strings are extended grapheme
cluster (EGC) based. What have been the
best and worst consequences of this
I’ll use “grapheme” casually to mean EGC.
Swift’s Character type represents a grapheme
cluster, Unicode.Scalar represents a Unicode scalar
value (non-surrogate code point).
Cocoa APIs are UTF-16 code unit oriented, and
thus there’s always caution (via documentation)
about making sure such indices align to grapheme
boundaries. This is a frequent source of bugs,
especially as part of internationalization. By
making Swift strings be grapheme-based by default,
developers first reach for the correct APIs.
Another good consequence is that people
picking up Swift and playing with string, e.g. in a
repl or Playground, see Swift’s notion of characters
align with what is displayed. This includes complex
multi-component emoji such as family emoji
(👨👨👧👧), which is a single Character composed
of 7 Unicode.Scalars.
This does have downsides. What is and is not
a grapheme cluster changes with each version of
Unicode, and thus grapheme breaking is inherently a
run-time concern and can’t be checked at compile
time. Another is that while code units can be
random-access, graphemes cannot, which is confusing
to developers used to UTF-16 code unit access mostly
working (until their users use non-BMP scalars or
emoji that is).
I'd say the biggest downside is that there are users who
simply refuse to accept what we consider to be the fundamental
non-random-access character of any efficient string
representation. They are upset that they can't index a string
directly with an integer, and can't be talked out of it. I
still think we made the right decision in this regard; you'd
have the same problem if your strings were
Are there common scenarios where programmers tend to be frustrated
by lack of random access? Perhaps most often when they are working
with inputs known to be ASCII only? Or is this mostly an education
issue and these programmers are having a difficult time accepting
that they've spent most of their career thus far writing bugs? :)
A lot of it is shaped by expectations coming from other languages, whose programming models do not prioritize operating on Unicode scalar values, let alone grapheme clusters. Objective-C’s default interface with Strings is random-access to UTF-16 code units, which “works” right up until you encounter an emoji or other scalar not on the BMP. It also “works” for graphemes right up until you encounter emoji or a language you didn’t test or a non-NFC-normalized contents in a language you did test.
This gets compounded by the prevalence of strings in teaching, interviews, programming puzzles, etc., where a string is treated like an array with a more visual representation.
Also note that even for fully ASCII strings we cannot provide random access to grapheme clusters, as “\r\n” is a single grapheme cluster. For pretty much every Unicode-correct operation we provide fast-paths for, there’s nasty corner cases that complicates the model.
Furthermore, few existing specifications
are phrased in terms grapheme-clusters, so something
like a validator wouldn’t want to run on
grapheme-segmented text, but a lower abstraction
Also, graphemes can be funky. A string
containing only, U+0301 (COMBINING ACUTE ACCENT) has
one grapheme, but modifies the prior grapheme upon
concatenation. Such degenerate graphemes violate
algebraic reasoning in these corner cases.
We are not aware of generic algorithms that rely on
concatenation of collections conserving element counts, so we
decided to simply document this quirk rather than saying that
string is-not-a collection.
SG16 has previously discussed cases like this and I'm happy to hear
you haven't had to do anything special for it. This is a good
example of why we asked about inappropriate use of the String count
property: programmers assuming s1.count + s2.count ==
Unicode defines properties and most
operations on scalars or code points, and very
little on top of graphemes.
When porting code unit or code
point based code to Swift strings (e.g.,
when rewriting Objective-C code, or
rewriting Swift code to use String
instead of NSString), has profiling
revealed performance regressions due to
the switch to EGC based processing? If
so, what action was taken to correct it?
We have many fast-paths in grapheme-breaking
to identify common situations surrounding
single-scalar graphemes. If a developer wants to
work with Unicode at a lower level, String provides
a UTF8View, a UTF16View, and a UnicodeScalarView.
Those views lazily transcode/decode upon access.
Cool, it sounds like the answer to any such regressions was 1)
optimization in terms of fast-paths, and 2) fall back to code
unit/point processing otherwise.
There are also performance concerns and
annoyances when working with ICU, but this is an
implementation detail. If you’re interested in using
ICU, we can discuss further what has worked best for
I think you're interested in (at least optionally) using ICU
unless you have evidence of major investment in another
open-source implementation of Unicode algorithms and tables.
Otherwise, C++ implementors could not afford to develop
Yes, definitely. For the foreseeable future, I think we need to
ensure that any interfaces we propose can be reasonably implemented
using ICU. However, Zach Laine has made impressive progress
implementing many of the Unicode algorithms without use of ICU in
his proposed Boost.Text library. See
Swift strings do not enforce storage in
any particular Unicode normalization
form. Was consideration given to
forcing storage in a particular form
such as FCC or NFC?
Swift strings now sort with NFC (currently
UTF-16 code unit order, but likely changed to
Unicode scalar value order). We didn’t find FCC
significantly more compelling in practice. Since NFC
is far more frequent in the wild (why waste space if
you don’t have to), strings are likely to already be
in NFC. We have fast-paths to detect on-the-fly
normal sections of strings (e.g. all ASCII, all <
U+0300, NFC_QC=yes, etc.). We lazily normalize
portions of string during comparison when needed.
As far as enforcing on creation, no. We do
want to add an option to perform a linear scan to
set a performance flag, perhaps at creation, so that
comparison can take the memcmp-like fast-path.
Ok, my take away from this is that fast-pathing has been sufficient
for lazy normalization (when needed) to not be (much of) a
performance concern. At least, not enough to want to take the
normalization cost on every string construction up front.
Swift strings support
comparison via normalization. Has use
of canonical string equality been a
performance issue? Or been a source of
surprise to programmers?
This was a big performance issue on Linux,
where we used to do UCA+DUCET based comparisons. We
switch to lexicographical order of NFC-normalized
UTF-16 code units (future: scalar values), and saw a
very significant speed up there. The remaining
performance work revolves around checking and
tracking whether a string is known to already be in
a normal form, so we can just memcmp.
This is very helpful, thank you. We've suspected that full
collation (with or without tailoring) would be too expensive for use
as a default comparison operator, so it is good to hear that
I'm curious why this was a larger performance issue for Linux than
for (presumably) macOS and/or iOS.
There were two main factors. The first is that on Darwin platforms, CFString had an implementation that we used instead of UCA+DUCET which was faster. The second is that Darwin platforms are typically up-to-date and have very recent versions of ICU. On Linux, we still support Ubuntu LTS 14.04 which has a version of ICU which predates Swift and didn’t have any fast-paths for ASCII or mostly-ASCII text.
Switching to our own implementation based on NFC gave us many X improvement over CFString, which in turn was many X faster than UCA+DUCET (especially on older versions of ICU).
Swift strings are not locale
sensitive. Was any consideration given
to creation of a distinct locale
sensitive string type?
This is still up for debate and hasn’t been
settled yet, but we think it makes a lot of sense.
If an array of strings is sorted, we certainly don’t
want a locale-change to violate programmer
invariants. A distinct type from string could avoid
a lot of common errors here, including forgetting to
localize before presenting to a user as part of a
Swift strings provide a count
property as required to satisfy the
Collection protocol. How often do
programmers use count (the number of
EGCs in the string) inappropriately?
I’m not sure what would constitute
inappropriate usage here. We do not currently
provide access to the underlying stored code units,
though this is a frequent request and we likely will
in the future. I haven’t seen anyone baking in the
assumption that count is the same for String and
across all of Strings’s views (UTF-8, UTF-16,
One thing to consider is that as long as String is not
random-access, count will be a worst-case O(N) operation. An
inappropriate usage might involve computing the length once per
In addition to the above and prior mention of algebraic concerns,
other potential abuses we had in mind were using it to determine
field widths for display or code unit/point based storage.
Display width is a whole other concern accounting for rendering environment, font, etc. I don’t have expertise here.
C++ container requirements specify that .size() be O(1). For us to
meet container requirements would require computing and caching the
count during construction and mutation operations. We could
potentially get by just meeting range requirements though.
I mentioned degenerate graphemes
breaking algebraic properties of the Collection
protocol, but this hasn’t been a huge issue in
practice so far.
Swift strings support several memory
unsafe initializers and methods. How
frequently are these used incorrectly?
Many of these initializers come from NSString
originally, and developers migrating correct code to
Swift maintain that correctness. Rust has a similar
situation, though they do validation at
creation-time and from_utf8_unchecked() voids
memory-safety if the contents are invalid.
The Swift manifesto discussed
three approaches to handling substrings
and Swift 4 changed from "same type,
shared storage" to "different type,
shared storage". Any regrets?
Having two types can be a bit of a pain, but
we still think it was the right thing to do. This is
consistent with Swift treating slices as a distinct
type from the base collection.
How often do you find programmers doing
work at the EGC level that would be
better performed at the code unit or
code point level?
Often, if a developer has strict
requirements, they know what they’re doing enough to
operate at one of those lower levels.
Not being able to random-access graphemes in
a string is a common source of frustration and
confusion amongst new users.
Likewise, how often do you find
programmers working with unicodeScalars,
utf8, or utf16 views to do work better
performed at the EGC level? For what
reasons does this occur? Perhaps to
work around differences in EGC
boundaries across Unicode versions or
the underlying version of ICU in use?
This was very prevalent in Swift’s early
days. String wasn’t a collection of graphemes by
default prior to Swift 4,
Well, it was. And then in Swift 2 or 3 it wasn't, due to the
algebraic reasoning issue. Now it is again.
so without guidance many developers
wrote code against the unicode scalars view. We also
didn’t have any fast-paths for common-case
situations back then, which further encouraged them
to use one of the other views.
This is still done sometimes for
performance-sensitive usage, or someone wanting to
handle Unicode themselves. However, as mentioned
previously, we don’t (yet) provide direct access to
the actual storage.
We haven’t seen much desire for reconciling
behavior across Unicode versions. This may be due to
Swift being primarily an applications level
programming language for devices which only have one
version of Unicode that’s relevant (the current
Has consideration been given to
exposing Unicode character database
properties? CharacterSet exposes some of
these properties, but have more been
Yes, this was recently added to the
We surface much of the UCD via ICU.
Ah, nice. All kinds of fun to be had with that :)
How firmly is the Swift string
implementation tied to ICU? If the C++
standard library were to add suitable
Unicode support, what would motivate
reimplementing Swift strings on top of
Swift’s tie to ICU is less firm than it used to be.
We use ICU for the following:
1. Grapheme breaking
3. Accessing UCD properties
4. Case conversion
Each of these are not too tightly entwined
with string; they’re cordoned-off as a couple of
shims called on fallback slow-paths.
If the C++ standard library provided these
operations, sufficiently up-to-date with Unicode
version and comparable or better to ICU in
performance, we would be willing to switch. A big
pain in interacting with ICU is their limited
support for UTF-8. Some users who would like to use
a “lighter-weight” Swift and are unhappy at having
to link against ICU, as it’s fairly large, and it
can complicate security audits.
Got it. Increasing the size of the C++ standard library is a
definite concern for us as well. We imagine some C++ users would be
similarly unhappy if their standard library suddenly required
linking against ICU.
If you go the route of implementing Unicode operations without ICU, would it be possible to separately link against Unicode support without also pulling in all of libc++? If your implementation is lighter-weight, yet current, it would be very appealing for Swift to consider switching over.
Do Swift programmers tend to
prefer string interpolation or string
Users tend to prefer string interpolation. However,
Swift currently does not have much in the way of
formatting control in interpolations, and this is
something we’re currently working on.
What enhancements would you
most like to see in C++ to improve
Swift’s string is perhaps geared as a higher-level
construct than what you may want for C++, and Swift
has Cocoa-interoperability concerns where everything
is UTF-16. Rust might provide a closer model to what
you’re looking for:
- Strings are a sequence of (valid) UTF-8
- Validation is done on creation
- Invalid contents (e.g. Windows file
paths) can be handled via something like WTF-8,
which is not intended for interchange
- String provides bidirectional iterators
- Transcoded and/or normalized code
- Unicode scalar values (their
- Grapheme clusters
Michael, I think you're not answering the question asked.
They are asking what Swift would want from C++, e.g., to allow
us to decouple from ICU. Wouldn't we like to be able to do
This question was intended to ask you, as expert C++ programmers
independently from Swift, what additions to C++ you think would be
most helpful to improve our (very lacking) Unicode support. So,
Michael's response is on point (thank you; we'll take a closer look
at Rust), as are any comments regarding what would benefit Swift
specifically. Michael's earlier comments regarding what Swift
currently uses ICU for are suggestive of what Swift might want from
C++. But I imagine the form in which those features are provided
would matter greatly; devils and details.