sg16: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Wed, 24 Apr 2019 20:16:57 +0300

At the turn of the year, I commented on Text_view on Slack, and Tom
Honermann asked me to write my comments up in long form to this
mailing list. I have now written my comments at
https://hsivonen.fi/non-unicode-in-cpp/ (also pasted below for on-list
quotability.) My apologies for it taking me this long to get this
written.

# It’s Time to Stop Adding New Features for Non-Unicode Execution
Encodings in C++

Henri Sivonen, 2019-04-24

Disclosure: I work Mozilla, and my professional activity includes
being the Gecko module owner for character encodings.

Disclaimer: Even though this document links to code and documents
written as part of my Mozilla actitivities, this document is written
in personal capacity.

## Summary

Text processing facilities in the C++ standard library have been
mostly agnostic of the actual character encoding of text. The few
operations that are sensitive to the actual character encoding are
defined to behave according to the implementation-defined “narrow
execution encoding” (for buffers of `char`) and the
implementation-defined “wide execution encoding” (for buffers of
`wchar_t`).

Meanwhile, over the last two decades, a different dominant design has
arisen for text processing in other programming languages as well as
in C and C++ usage _despite_ what the C and C++ standard-library
facilities provide: Representing text as Unicode, and _only_ Unicode,
_internally_ in the application even if some other representation is
required _externally_ for backward compatibility.

I think the C++ standard should adopt the approach of “Unicode-only
internally” for _new_ text processing facilities and should not
support non-Unicode execution encodings in newly-introduced features.
This allows new features to have less abstraction obfuscation for
Unicode usage, avoids digging legacy applications deeper into
non-Unicode commitment, and avoids the specification and
implementation effort of adapting new features to make sense for
non-Unicode execution encodings.

Concretely, I suggest:

* In _new_ features, do not support numbers other than Unicode
scalar values as a numbering scheme for abstract characters, and
design new APIs to be aware of Unicode scalar values as appropriate
instead of allowing other numbering schemes. (I.e. make Unicode the
only coded character set supported for new features.)
* Use `char32_t` directly as the concrete type for an _individual_
Unicode scalar value without allowing for parametrization of the type
that conceptually represents a Unicode scalar value. (For sequences of
Unicode scalar values, UTF-8 is preferred.)
* When introducing _new_ text processing facilities (other than the
next item on this list), support only UTF in-memory text
representations: UTF-8 and, potentially, depending on feature, also
UTF-16 or also UTF-16 and UTF-32\. That is, do not seek to make _new_
text processing features applicable to non-UTF execution encodings.
(This document should not be taken as a request to add features for
UTF-16 or UTF-32 beyond iteration over string views by scalar value.
To avoid distraction from the main point, this document should also
not be taken as advocating against providing any particular feature
for UTF-16 or UTF-32.)
* Non-UTF character encodings may be supported in a conversion API
whose purpose is to convert from a legacy encoding into a UTF-only
representation near the IO boundary or at the boundary between a
legacy part (that relies on execution encoding) and a new part (that
uses Unicode) of an application. Such APIs should be `std::span`-based
instead of iterator-based.
* When an operation logically requires a valid sequence of Unicode
scalar values, the API must either define the operation to fail upon
encountering invalid UTF-8/16/32 or must replace each error with a
U+FFFD REPLACEMENT CHARACTER as follows: What constitutes a single
error in UTF-8 is [defined in the WHATWG Encoding
Standard](https://encoding.spec.whatwg.org/#utf-8-decoder) (which
matches the “best practice” from the Unicode Standard). In UTF-16,
each unpaired surrogate is an error. In UTF-32, each code unit whose
numeric value isn’t a valid Unicode scalar value is an error.
* Instead of standardizing Text_view as proposed, standardize a way
to obtain a Unicode scalar value iterator from `std::u8string_view`,
`std::u16string_view`, and `std::u32string_view`.

## Context

This write-up is in response to (and in disagreement with) the
[“Character Types”
section](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0244r2.html#char_types)
in the P0244R2 Text_view paper:

> This library defines a character class template parameterized by character set type used to represent character values. The purpose of this class template is to make explicit the association of a code point value and a character set.
>
> It has been suggested that `char32_t` be supported as a character type that is implicitly associated with the Unicode character set and that values of this type always be interpreted as Unicode code point values. This suggestion is intended to enable UTF-32 string literals to be directly usable as sequences of character values (in addition to being sequences of code unit and code point values). This has a cost in that it prohibits use of the `char32_t` type as a code unit or code point type for other encodings. Non-Unicode encodings, including the encodings used for ordinary and wide string literals, would still require a distinct character type (such as a specialization of the character class template) so that the correct character set can be inferred from objects of the character type.
>
> This suggestion raises concerns for the author. To a certain degree, it can be accommodated by removing the current members of the character class template in favor of free functions and type trait templates. However, it results in ambiguities when enumerating the elements of a UTF-32 string literal; are the elements code point or character values? Well, the answer would be both (and code unit values as well). This raises the potential for inadvertently writing (generic) code that confuses code points and characters, runs as expected for UTF-32 encodings, but fails to compile for other encodings. The author would prefer to enforce correct code via the type system and is unaware of any particular benefits that the ability to treat UTF-32 string literals as sequences of character type would bring.
>
> It has also been suggested that `char32_t` might suffice as the only character type; that decoding of any encoded string include implicit transcoding to Unicode code points. The author believes that this suggestion is not feasible for several reasons:
>
> 1. Some encodings use character sets that define characters such that round trip transcoding to Unicode and back fails to preserve the original code point value. For example, Shift-JIS (Microsoft code page 932) defines duplicate code points for the same character for compatibility with IBM and NEC character set extensions.
> [https://support.microsoft.com/en-us/kb/170559](https://support.microsoft.com/en-us/kb/170559) [sic; dead link]
> 2. Transcoding to Unicode for all non-Unicode encodings would carry non-negligible performance costs and would pessimize platforms such as IBM’s z/OS that use EBCIDC by default for the non-Unicode execution character sets.

To summarize, it raises three concerns:

1. Ambiguity between code units and scalar values (the paper says
“code points”, but I say “scalar values” to emphasize the exclusion of
surrogates) in the UTF-32 case.
2. Some encodings, particularly Microsoft code page 932, can
represent one Unicode scalar value in more than one way, so the
distinction of which way does not round-trip.
3. Transcoding non-Unicode execution encodings to has a performance
cost that pessimizes particularly IBM z/OS.

## Terminology and Background

(This section and the next section should not be taken as ’splaining
to SG16 what they already know. The over-explaining is meant to make
this document more coherent for a broader audience of readers who
might be interested in C++ standardization without full familiarity
with text processing terminology or background, or the details of
Microsoft code page 932.)

An _abstract character_ is an atomic unit of text. Depending on
writing system, the analysis of what constitutes an atomic unit may
differ, but a given implementation on a computer has to identify some
things as atomic units. Unicode’s opinion of what is an abstract
character is the most widely applied opinion. In fact, Unicode itself
has multiple opinions on this, and [Unicode Normalization
Forms](https://unicode.org/reports/tr15/) bridge these multiple
opinions.

A _character set_ is a set of abstract characters. In principle, a set
of characters can be defined without assigning numbers to them.

A _coded character set_ assigns numbers, _code points_, to the items
in the character set to each abstract character.

When the Unicode code space was extended beyond the Basic Multilingual
Plane, some code points were set aside for the UTF-16 surrogate
mechanism and, therefore, do not represent abstract characters. A
Unicode _scalar value_ is a Unicode code point that is not a surrogate
code point. For consistency with Unicode, I use the term scalar value
below when referring to non-Unicode coded character sets, too.

A _character encoding_ is a way to represent a conceptual sequence of
scalar values from one _or more_ coded character sets as a concrete
sequence of bytes. The bytes are called _code units_. Unicode defines
in-memory _Unicode encoding forms_ whose code unit is not a byte:
UTF-16 and UTF-32\. (For these Unicode encoding forms, there are
corresponding _Unicode encoding schemes_ that use byte code units and
represent a non-byte code unit from a correspoding encoding form as
multiple bytes and, therefore, could be used in byte-oriented IO even
though UTF-8 is preferred for interchange. UTF-8, of course, uses byte
code units as both a Unicode encoding form and as a Unicode encoding
scheme.)

Coded character sets that assign scalar values in the range 0...255
(decimal) can be considered to trivially imply a character encoding
for themselves: You just store the scalar value as an unsigned byte
value. (Often such coded character sets import US-ASCII as the lower
half.)

However, it is possible to define less obvious encodings even for
character sets that only have up to 256 characters. IBM has
[several](https://en.wikipedia.org/wiki/EBCDIC_code_pages#Code_pages_with_Latin-1_character_sets)
EBCDIC character encodings for the set of characters defined in
ISO-8859-1\. That is, compared to the trivial ISO-8859-1 encoding (the
original, not the Web alias for windows-1252), these EBCDIC encodings
permute the byte value assignments.

Unicode is the universal coded character set that _by design_ includes
abstract characters from all notable legacy coded character sets such
that character encodings for legacy coded character sets can be
redefined to represent Unicode scalar values. Consider representing ż
in the ISO-8859-2 encoding. When we treat the ISO-8859-2 encoding as
an encoding for the Unicode coded character set (as opposed treating
it as an encoding for the ISO-8859-2 coded character set), byte 0xBF
decodes to Unicode scalar value U+017C (and not as scalar value 0xBF).

A _compatibility character_ is a character that according to Unicode
principles should not be a distinct abstract character but that
Unicode nonetheless codes as a distinct abstract character because
some legacy coded character set treated it as distinct.

## The Microsoft Code Page 932 Issue

Usually in C++ a “character type” refers to a code unit type, but the
Text_view paper uses the term “character type” to refer to a Unicode
scalar value when the encoding is a Unicode encoding form. The paper
implies that an analogous non-Unicode type exists for Microsoft code
page 932 (Microsoft’s version of Shift_JIS), but does one really
exist?

Microsoft code page 932 takes the 8-bit encoding of JIS X 0201 coded
character set, whose upper half is half-width katakana and lower half
is ASCII-based, and replaces the lower half with actual US-ASCII
(moving the difference between US-ASCII and the lower half of
8-bit-encoded JIS X 0201 into a font problem!). It then takes the JIS
X 0208 coded character set and represents it with two-byte sequences
(for the lead byte making use of the unassigned range of JIS X 0201).
JIS X 0208 code points aren’t really one-dimensional scalars, but
instead two-dimensional row and column numbers in a 94 by 94 grid.
(See the first 94 rows of [the
visualization](https://encoding.spec.whatwg.org/jis0208.html) supplied
with the Encoding Standard; avoid opening the link on RAM-limited
device!) Shift_JIS / Microsoft code page 932 does not put these two
numbers into bytes directly, but conceptually arranges each two rows
of 94 columns into one row of a 188 columns and then transforms these
new row and column numbers into bytes with some offsetting.

While the JIS X 0208 grid is rearranged into 47 rows of a 188-column
grid, the full 188-column grid has 60 rows. The last 13 rows are used
for IBM extensions and for private use. The private use area maps to
the (start of the) Unicode Private Use Area. (See a [visualization of
the rearranged grid](https://encoding.spec.whatwg.org/shift_jis.html)
with the private use part showing up as unassigned; again avoid
opening the link on a RAM-limited device.)

The extension part is where the concern that the Text_view paper seeks
to address comes in. NEC and IBM came up with some characters that
they felt JIS X 0208 needed to be extended with. NEC’s own extensions
go onto row 13 (in one-based numbering) of the 94 by 94 JIS X 0208
grid (unallocated in JIS X 0208 proper), so that extension can safely
be treated as if it had always been part of JIS X 0208 itself. The IBM
extension, however, goes onto the last 3 rows of the 60-row Shift_JIS
grid, i.e. outside the space that the JIS X 0208 94 by 94 grid maps
to. However, US-ASCII, the half-width katakana part of JIS X 0201, and
JIS X 0208 are also encoded, in a different way, by EUC-JP. EUC-JP can
only encode the 94 by 94 grid of JIS X 0208\. To make the IBM
extensions fit into the 94 by 94 grid, NEC relocated the IBM
extensions within the 94 by 94 grid in space that the JIS X 0208
standard left unallocated.

When considering IBM Shift_JIS and NEC EUC-JP (without later JIS X
0213 extension), both encode the same set of characters, but in a
different way. Furthermore, both can round-trip via Unicode. Unicode
principles analyze some of the IBM extension kanji as duplicates of
kanji that were already in the original JIS X 0208\. However, to
enable round-tripping (which was thought worthwhile to achieve at the
time), Unicode treats the IBM duplicates as compatibility characters.
(Round-tripping is lost, of course, if the text decoded into Unicode
is normalized such that compatibility characters are replaced with
their canonical equivalents before re-encoding.)

This brings us to the issue that the Text_view paper treats as
significant: Since Shift_JIS can represent the whole 94 by 94 JIS X
0208 grid and NEC put the IBM extension there, a naïve conversion from
EUC-JP to Shift_JIS can fail to relocate the IBM extension characters
to the end of the Shift_JIS code space and can put them in the
position where they land if the 94 by 94 grid is simply transformed as
the first 47 rows of the 188-column-wide Shift_JIS grid. When
_decoding_ to Unicode, Microsoft code page 932 supports both locations
for the IBM extensions, but when _encoding_ from Unicode, it has to
pick one way of doing things, and it picks the end of the Shift_JIS
code space.

That is, Unicode does not assign another set of compatibility
characters to Microsoft code page 932’s duplication of the IBM
extensions, so despite NEC EUC-JP and IBM Shift_JIS being
round-trippable via Unicode, Microsoft code page 932, i.e. Microsoft
Shift_JIS, is not. This makes sense considering that there is no
analysis that claims the IBM and NEC instances of the IBM extensions
as semantically different: They clearly have provenance that indicates
that the duplication isn’t an attempt to make a distinction in
meaning. The Text_view paper takes the position that C++ should
round-trip the NEC instance of the IBM extensions in Microsoft code
page 932 as distinct from the IBM instance of the IBM extensions even
though Microsoft’s own implementation does not. In fact, the whole
point of the Text_view paper mentioning Microsoft code page 932 is to
give an example of a legacy encoding that doesn’t round-trip via
Unicode, despite Unicode generally having been designed to round-trip
legacy encodings, and to opine that it ought to round-trip in C++.

So:

* The Text_view paper wants there to exist a non-transcoding-based,
non-Unicode analog for what for UTF-8 would be a Unicode scalar value
but for Microsoft code page 932 instead.
* The standards that Microsoft code page 932 has been built on do
not give us such a scalar.
* Even if the private use space and the extensions are
considered to occupy a consistent grid with the JIS X 0208 characters,
the US-ASCII plus JIS X 0201 part is not placed on the same grid.
* The canonical way of referring to JIS X 0208 independently of
bytes isn’t a reference by one-dimensional scalar but a reference by
two (one-based) numbers identifying a cell on the 94 by 94 grid.
* The Text_view paper wants the scalar to be defined such that a
distinction between the IBM instance of the IBM extensions and the NEC
instance of the IBM extensions is maintained even though Microsoft,
the originator of the code page, does not treat these two instances as
meaningfully distinct.

## Inferring a Coded Character Set from an Encoding

(This section is based on the constraints imposed by Text_view paper
instead of being based on what the [reference
implementation](https://github.com/tahonermann/text_view) does for
Microsoft code page 932\. From code inspection, it appears that
support for multi-byte narrow execution encodings is unimplemented,
and when trying to verify this experimentally, I timed out trying to
get it running due to an internal compiler error when trying to build
with a newer GCC and a GCC compilation error when trying to build the
known-good GCC revision.)

While the standards don’t provide a scalar value definition for
Microsoft code page 932, it’s easy to make one up based on tradition:
Traditionally, the two-byte characters in CJK legacy encodings have
been referred to by interpreting the two bytes as 16-bit big-endian
unsigned number presented as hexadecimal (and single-byte characters
as a 8-bit unsigned number).

As an example, let’s consider 猪 (which Wiktionary translates as wild
boar). Its canonical Unicode scalar value is U+732A. That’s what the
JIS X 0208 instance decodes to when decoding Microsoft code page 932
into Unicode. The compatibility character for the IBM kanji purpose is
U+FA16\. That’s what both the IBM instance of the IBM extension and
the NEC instance of the IBM extension decode to when decoding
Microsoft code page 932 into Unicode. (For reasons unknown to me,
Unicode couples U+FA16 with the IBM kanji compatibility purpose and
assigns _another_ compatibility character, U+FAA0, for compatibility
with North Korean KPS 10721-2000 standard, which is irrelevant to
Microsoft code page 932\. Note that not all IBM kanji have
corresponding DPRK compatibility characters, so we couldn’t repurpose
the DPRK compatibility characters for distinguishing the IBM and NEC
instances of the IBM extensions even if we wanted to.)

When interpreting the Microsoft code page 932 bytes as a big-endian
integer, the JIS X 0208 instance of 猪 would be 0x9296, the IBM
instance would be 0xFB5E, and the NEC instance would be 0xEE42\. To
highlight how these “scalars” are coupled with the encoding instead of
the standard character sets that the encodings originally encode, in
EUC-JP the JIS X 0208 instance would be 0xC3F6 and the NEC instance
would be 0xFBA3\. Also, for illustration, if the same rule was applied
to UTF-8, the scalar would be 0xE78CAA instead of U+732A. Clearly, we
don’t want the scalars to be different between UTF-8, UTF-16, and
UTF-32, so it is at least theoretically unsatisfactory for Microsoft
code page 932 and EUC-JP to get different scalars for what are clearly
the same characters in the underlying character sets.

It would be possible to do something else that’d give the same scalar
values for Shift_JIS and EUC-JP without a lookup table. We could
number the characters on the two-dimensional grid starting with 256
for the top left cell to reserve the scalars 0…255 for the JIS X 0201
part. It’s worth noting, though, that this approach wouldn’t work well
for Korean and Simplified Chinese encodings that take inspiration from
the 94 by 94 structure of JIS X 0208\. KS X 1001 and GB2312 also
define a 94 by 94 grid like JIS X 0208\. However, while Microsoft code
page 932 extends the grid down, so a consecutive numbering would just
add greater numbers to the end, Microsoft code pages 949 and 936
extend the KS X 1001 and GB2312 grids above and to the left, which
means that a consecutive numbering of the extended grid would be
totally different from the consecutive numbering of the unextended
grid. On the other hand, interpreting each byte pair as a big-endian
16-bit integer would yield the same values in the extended and
unextended Korean and Simplified Chinese cases. (See visualizations
for [949](https://encoding.spec.whatwg.org/euc-kr.html) and
[936](https://encoding.spec.whatwg.org/gb18030.html); again avoid
opening on a RAM-limited device. Search for “U+3000” to locate the top
left corner of the original 94 by 94 grid.)

## What About EBCDIC?

Text_view wants to avoid transcoding overhead on z/OS, but z/OS has
multiple character encodings for the ISO-8859-1 character set. It
seems conceptually bogus for all these to have different scalar values
for the same character set. However, for all of them to have the same
scalar values, a lookup table-based permutation would be needed. If
that table permuted to the ISO-8859-1 order, it would be the same as
the Unicode order, at which point the scalar values might as well be
Unicode scalar values, which Text_view wanted to avoid on z/OS citing
performance concerns. (Of course, z/OS also has EBCDIC encodings whose
character set is not ISO-8859-1.)

## What About GB18030?

The whole point of GB18030 is that in encodes Unicode scalar values in
a way that makes the encoding byte-compatible with GBK (Microsoft code
page 936) and GB2312\. This operation is inherently lookup
table-dependent. Inventing a scalar definition for GB18030 that
achieved the Text_view goal of avoiding lookup tables would break the
design goal of GB18030 that it encodes all Unicode scalar values. (In
the Web Platform, due to legacy reasons, [all but one scalar value and
representing one scalar value
twice](https://encoding.spec.whatwg.org/#ref-for-index-gb18030%E2%91%A0).)

## What’s Wrong with This?

Let’s evaluate the above in the light of P1238R0, the [_SG16: Unicode
Direction_](http://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p1238r0.html)
paper.

The reason why Text_view tries to fit Unicode-motivated operations
onto legacy encodings is that, as noted by “1.1 Constraint: The
ordinary and wide execution encodings are implementation defined”,
non-UTF execution encodings _exist_. This is, obviously, true.
However, I disagree with the conclusion of making new features apply
to these pre-existing execution encodings. I think there is _no
obligation_ to adapt _new features_ to make sense for non-UTF
execution encodings. It should be sufficient to keep existing legacy
code running, i.e. not removing existing features should be
sufficient. On the topic of `wchar_t` the Unicode Direction paper,
says “1.4\. Constraint: wchar_t is a portability deadend”. I think
`char` with non-UTF-8 execution encoding should also be declared as a
deadend whereas the Unicode Direction paper merely notes “1.3\.
Constraint: There is no portable primary execution encoding”. Making
new features work with deadend foundation lures applications deeper
into deadends, which is bad.

While inferring scalar values for an encoding by interpreting the
encoded bytes for each character as a big-endian integer (thereby
effectively inferring a, potentially non-standard, coded character set
from an encoding) might be argued to be traditional enough to fit
“2.1\. Guideline: Avoid excessive inventiveness; look for existing
practice”, it is a bad fit for “1.6\. Constraint: Implementors cannot
afford to rewrite ICU”. If there is concern about implementors not
having the bandwidth to implement text processing features from
scratch and, therefore, should be prepared to delegate to ICU, it
makes no sense make implementations or the C++ standard come up with
non-Unicode numberings for abstract characters, since such numberings
aren’t supported by ICU and necessarily would require writing new code
for anachronistic non-Unicode schemes.

Aside: Maybe analyzing the approach of using byte sequences
interpreted as big-endian numbers looks like attacking a straw man and
there could be some other non-Unicode numbering instead, such as the
consecutive numbering outlined above. Any alternative non-Unicode
numbering would still fail “1.6\. Constraint: Implementors cannot
afford to rewrite ICU” and would _also_ fail “2.1\. Guideline: Avoid
excessive inventiveness; look for existing practice”.

Furthermore, I think the Text_view paper’s aspiration of
distinguishing between the IBM and NEC instances of the IBM extensions
in Microsoft code page 932 fails “2.1\. Guideline: Avoid excessive
inventiveness; look for existing practice”, because it effectively
amounts to inventing additional compatibility characters that aren’t
recognized as distinct by Unicode or the originator of the code page
(Microsoft).

Moreover, iterating over a buffer of text by scalar value is a
relatively simple operation when considering the range of operations
that make sense to offer for Unicode text but that may not obviously
fit non-UTF execution encodings. For example, in the light of “4.2\.
Directive: Standardize generic interfaces for Unicode algorithms” it
would be reasonable and expected to provide operations for performing
Unicode Normalization on strings. What does it mean to normalize a
string to Unicode Normalization Form D under the ISO-8859-1 execution
encoding? What does it mean to apply _any_ Unicode Normalization Form
under the windows-1258 execution encoding, which represents Vietnamese
in a way that doesn’t match any Unicode Normalization Form? If the
answer just is to make these no-ops for non-UTF encodings, would that
be the right answer for GB18030? Coming up with answers other than
just saying that new text processing operations shouldn’t try to fit
non-UTF encodings at all would very quickly violate the guideline to
“Avoid excessive inventiveness”.

Looking at other programming languages in the light of “2.1\.
Guideline: Avoid excessive inventiveness; look for existing practice”
provides the way forward. Notable other languages have settled on not
supporting coded character sets other than Unicode. That is, only the
Unicode way of assigning scalar values to abstract characters is
supported. Interoperability with legacy character _encodings_ is
achieved by decoding into Unicode upon input and, if non-UTF-8 output
is truly required for interoperability, by encoding into legacy
encoding upon output. The Unicode Direction paper already acknowledges
this dominant design in “4.4\. Directive: Improve support for
transcoding at program boundaries”. I think C++ should consider the
boundary between non-UTF-8 `char` and non-UTF-16/32 `wchar_t` on one
hand and Unicode (preferably represented as UTF-8) on the other hand
as a similar transcoding boundary between legacy code and new code
such that new text processing features (other than the encoding
conversion feature itself!) are provided on the
`char8_t`/`char16_t`/`char32_t` side but not on the non-UTF execution
encoding side. That is, while the Text_view paper says “Transcoding to
Unicode for all non-Unicode encodings would carry non-negligible
performance costs and would pessimize platforms such as IBM’s z/OS
that use EBCIDC [sic] by default for the non-Unicode execution
character sets.”, I think it’s more appropriate to impose such a cost
at the boundary of legacy and future parts of z/OS programs than to
contaminate all new text processing APIs with the question “What does
this operation even mean for non-UTF encodings generally and EBCDIC
encodings specifically?”. (In the case of Windows, the system already
works in UTF-16 internally, so all narrow execution encodings already
involve transcoding at the system interface boundary. In that context,
it seems inappropriate to pretend that the legacy narrow execution
encodings on Windows were somehow free of transcoding cost to begin
with.)

To avoid a distraction from my main point, I’m explicitly not opining
_in this document_ on whether new text processing features should be
available for sequences of `char` when the narrow execution encoding
is UTF-8, for sequences of `wchar_t` when `sizeof(wchar_t)` is 2 and
the wide execution encoding is UTF-16, or for sequences of `wchar_t`
when `sizeof(wchar_t)` is 4 and the wide execution encoding is UTF-32.

## The Type for a Unicode Scalar Value Should Be `char32_t`

The conclusion of the previous section is that new C++ facilities
should not support number assignments to abstract characters other
than Unicode, i.e. should not support coded character sets (either
standardized or inferred from an encoding) other than Unicode. The
conclusion makes it unnecessary to abstract type-wise over Unicode
scalar values and some other kinds of scalar values. It just leaves
the question of what the concrete type for a Unicode scalar value
should be.

The Text_view paper says:

> “It has been suggested that `char32_t` be supported as a character type that is implicitly associated with the Unicode character set and that values of this type always be interpreted as Unicode code point values. This suggestion is intended to enable UTF-32 string literals to be directly usable as sequences of character values (in addition to being sequences of code unit and code point values). This has a cost in that it prohibits use of the `char32_t` type as a code unit or code point type for other encodings.

I disagree with this and am firmly in the camp that `char32_t` should
be the type for a Unicode scalar value.

The sentence “This has a cost in that it prohibits use of the
`char32_t` type as a code unit or code point type for other
encodings.” is particularly alarming. Seeking to use `char32_t` as a
code unit type for encodings other than UTF-32 would dilute the
meaning of `char32_t` into another `wchar_t` mess. (I’m happy to see
that P1041R4 “Make char16_t/char32_t string literals be UTF-16/32” was
voted into C++20.)

As for the appropriateness of using the same type both for a UTF-32
code unit and a Unicode scalar value, the _whole point_ of UTF-32 is
that its code unit value is directly the Unicode scalar value. That is
what UTF-32 is all about, and UTF-32 has nothing else to offer: The
value space that UTF-32 can represent is more compactly represented by
UTF-8 and UTF-16 both of which are more commonly needed for
interoperation with existing interfaces. When having the code units be
directly the scalar values is UTF-32’s whole point, it would be
unhelpful to distinguish type-wise between UTF-32 code units and
Unicode scalar values. (Also, considering that buffers of UTF-32 are
rarely useful but iterators yielding Unicode scalar values make sense,
it would be sad to make the iterators have a complicated type.)

To provide interfaces that are generic across `std::u8string_view`,
`std::u16string_view`, and `std::u32string_view` (and, thereby,
strings for which these views can be taken), all of these should have
a way to obtain a scalar value iterator that yields `char32_t` values.
To make sure such iterators really yield only Unicode scalar values in
an interoperable way, the iterator should yield U+FFFD upon error.
What constitutes a single error in UTF-8 is defined in the WHATWG
Encoding Standard (matches the “best practice” from the Unicode
Standard). In UTF-16, each unpaired surrogate is an error. In UTF-32,
each code unit whose numeric value isn’t a valid Unicode scalar value
is an error. (The last sentence might be taken as admission that
UTF-32 code units and scalar values are not the same after all. It is
not. It is merely an acknowledgement that C++ does not statically
prevent programs that could erroneously put an invalid value into a
buffer that is supposed to be UTF-32.)

In general, new APIs should be defined to handle invalid UTF-8/16/32
either according to the replacement behavior described in the previous
paragraph or by stopping and signaling error on the first error. In
particular, the replacement behavior should not be left as
implementation-defined, considering that differences in the
replacement behavior between V8 and Blink lead to a
[bug](https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13).
(See [another write-up on this
topic](https://hsivonen.fi/broken-utf-8/).)

## Transcoding Should Be `std::span`-Based Instead of Iterator-Based

Since the above contemplates a conversion facility between legacy
encodings and Unicode encoding forms, it seems on-topic to briefly
opine on what such an API should look like. The Text_view paper says:

> Transcoding between encodings that use the same character set is currently possible. The following example transcodes a UTF-8 string to UTF-16.
>
> > std::string in = get_a_utf8_string();
> > std::u16string out;
> > std::back_insert_iterator<std::u16string> out_it{out};
> > auto tv_in = make_text_view<utf8_encoding>(in);
> > auto tv_out = make_otext_iterator<utf16_encoding>(out_it);
> > std::copy(tv_in.begin(), tv_in.end(), tv_out);
>
> Transcoding between encodings that use different character sets is not currently supported due to lack of interfaces to transcode a code point from one character set to the code point of a different one.
>
> Additionally, naively transcoding between encodings using std::copy() works, but is not optimal; techniques are known to accelerate transcoding between some sets of encoding. For example, SIMD instructions can be utilized in some cases to transcode multiple code points in parallel.
>
> Future work is intended to enable optimized transcoding and transcoding between distinct character sets.

I agree with the assessment that iterator and `std::copy()`-based
transcoding is not optimal due to SIMD considerations. To enable the
use of SIMD, the input and output should be `std::span`s, which,
unlike iterators, allow the converter to look at more than one element
of the `std::span` at a time. I have designed and implemented such an
[API for C++](https://github.com/hsivonen/encoding_c/blob/905e4e336bb57e7103696971d1c75840df840508/include/encoding_rs_cpp.h),
and I invite SG16 to adopt its general API design. I have a written a
document that covers the [API design
problems](https://hsivonen.fi/encoding_rs/#problems) that I sought to
address and [design of the API](https://hsivonen.fi/encoding_rs/#api)
(in Rust but directly applicable to C++). (Please don’t be distracted
by the implementation internals being Rust instead of C++. The API
design is still valid for C++ even if the design constraint of the
implementation internals being behind C linkage is removed. Also,
please don’t be distracted by the API predating `char8_t`.)

## Implications for Text_view

Above I’ve opined that only UTF-8, UTF-16, and UTF-32 (as Unicode
encoding forms—not as Unicode encoding schemes!) should be supported
for iteration by scalar value and that legacy encodings should be
addressed by a conversion facility. Therefore, I think that Text_view
should not be standardized as proposed. Instead, I think
`std::u8string_view`, `std::u16string_view`, and `std::u32string_view`
should gain a way to obtain a Unicode scalar value iterator (that
yields values of type `char32_t`), and a `std::span`-based encoding
conversion API should be provided as a distinct feature (as opposed to
trying to connect Unicode scalar value iterators with `std::copy()`).

-- 
Henri Sivonen
hsivonen_at_[hidden]
https://hsivonen.fi/

Received on 2019-04-24 21:49:47