sg16: [SG16-Unicode] Comments on D1629R1 Standard Text Encoding

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Sat, 17 Aug 2019 22:25:57 +0300

I read through https://thephd.github.io/vendor/future_cxx/papers/d1629.html
. I'm happy to see work on this topic. Thank you! However, I have
various questions, comments, and concerns. They are inline. Quotes are
from the paper.

> Abstract
>
> The standard lacks facilities for transliterating

Why is transliterating mentioned? It seems serious scope creep
compared to character encoding conversion.

> 2.1. The Goal
>
> int main () {
> using namespace std::literals;
> std::text::u8text my_text = std::text::transcode<std::text::utf8>(“안녕하세요 👋”sv);
> std::cout :< my_text :< std::endl; // prints 안녕하세요 👋 to a capable console

This does not look like a compelling elevator pitch, since with modern
terminal emulators, merely `fwrite` to `stdout` with a u8 string
literal works.

Here's what I'd like to see sample code for:

There exist some functions `void parse_text(std::u8string_view
buffer)` and `void finish_parse()` such that to incrementally parse a
text resource, `parse_text` must be called zero or more times to pass
the text resource in chunks of UTF-8 such that each chunk is valid
UTF-8 and then the end of the resource must be signaled by calling
`finish_parse` once. (We don't see the implementations of these.)

There is an HTTP networking library (which we also don't need to see)
that takes a listener object with three methods:
void on_start(std::optional<std::span<const std::byte>> charset);
void on_data(std::span<const std::byte> data);
void on_end();

The networking library, upon having received the HTTP headers calls
with `on_start` with the value of the `charset` parameter of the
`Content-Type` HTTP header if one exists. E.g. if the header was
`Content-Type: text/plain ; charset = sjis ; foo=bar`, the span will
contain the ASCII bytes for " sjis ".

Then, (from an event loop) the networking library calls `on_data` with
a chunk of bytes every time it receives a chunk of bytes from the
network stream from the kernel. Once the stream is finished, the
networking library calls `on_end`.

I'd like to see the code for `on_start`, `on_data`, and `on_end` such
that the WHATWG ["get an
encoding"](https://encoding.spec.whatwg.org/#concept-encoding-get)
algorithm is applied to `charset` in `on_start` and the resulting
encoding is used or windows-1252 is used as the fallback (either if
`charset` wasn't present or "get an encoding" failed) and the content
is decoded using the WHATWG
["decode"](https://encoding.spec.whatwg.org/#decode) algorithm (i.e.
the UTF-8, UTF-16LE, or UTF-16BE BOM takes precedence over HTTP
charset and malformed byte sequences are replaced with the REPLACEMENT
CHARACTER) and is passed without heap allocations to `parse_text`.
`on_end` is appropriately routed to `finish_parse()`.

This is would allow evaluating the following:
1. Does incremental decode work when the decoder doesn't get to pull
from the source but chunks are pushed by an external event loop?
2. Is it easy and ergonomic to decide the external encoding to decode
from at run time based on a WHATWG label or from the BOM?
3. Is it easy to handle BOM sniffing in a scenario where the BOM might
trickle in over up to three input buffers? (I.e. is the BOM sniffing
state machine baked into the decoder so that the application developer
doesn't need to worry about it?)
4. Is it easy to handle the case where a multi-byte sequence in the
input is split across input buffers? (I.e. does the start of the
sequence get consumed into the decoder's state so that the application
developer doesn't need to worry about it?)
5. Is it easy to ensure that the output chunks don't split multi-code
unit Unicode scalar values? (I.e. does the decoder ensure that it
always outputs complete Unicode scalar values so that the application
developer doesn't need to worry about it?)
6. Does the decoder make it easy to handle end of the stream correctly
in the case where the decoder state contains an incomplete byte
sequence per point 4 above? (I.e. are the API ergonomics such that the
application developer writes the right code for this case?)

> 2.2. Abstract

> from Chrome to Firefox, Qt to Copperspice, and more -- all have their own variations of hand-crafted text processing

This implies that this paper intends to provide a facility that would
address the use cases that Chrome, Firefox, Qt, and Copperspice
address on their own. However, at present, the paper lacks an
evaluation of whether what it proposes covers these use cases.

While I don't expect Chrome or Firefox to migrate to a
standard-library facility, it would still be a good idea to evaluate
the proposed facility from the point of view of whether it enables the
consumption a Web content (implies support for the WHATWG Encoding
Standard) and email (implies support for the WHATWG Encoding Standard
plus UTF-7 and, for increased compatibility, non-WHATWG pre-java.nio
Java encoding names as labels).

> while also building on the concepts and design choices found in both [range-v3] and provably fast text processing such as Windows’s WideCharToMultiByte interfaces, *nix utility iconv, and more.

I think it's fair to characterize the kernel32.dll conversion
functions as "provably fast", but I find it weird to put iconv in that
category. The most prominent implementation is the GNU one based on
the glibc function of the same name, which prioritizes extensibility
by shared object over performance. Based on the benchmarking that I
conducted (https://hsivonen.fi/encoding_rs/#results), I would not
characterize iconv as "provably fast".

> 3. Design

> study of ICU’s interface

Considering that Firefox was mentioned in the abstract, it would make
sense to study its internal API.

You can find the Firefox-internal API for dealing with external encodings at
https://searchfox.org/mozilla-central/source/intl/Encoding.h#82

You can find the underlying C-linkage API at
https://searchfox.org/mozilla-central/source/third_party/rust/encoding_c/include/encoding_rs.h

If you'd like to experiment with a C++ wrapper that doesn't depend on
mozilla:: types, there's one at
https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h
with a demo application at
https://github.com/hsivonen/recode_cpp/

Apologies for the libc I/O; the C++ version is a port of the demo app
for the C API and the parts that don't deal with the encoding API
remain C-ish. The C API demo app is at
https://github.com/hsivonen/recode_c/

For the time being (I'm in the process of relocating it), the Firefox
internal API for converting between internal encodings is at:
https://searchfox.org/mozilla-central/rev/15cff10fa2d10fcc763b22e909412dcd6e9c4e88/xpcom/string/nsReadableUtils.h#58

The underlying C-linkage API is at
https://searchfox.org/mozilla-central/source/third_party/rust/encoding_c_mem/include/encoding_rs_mem.h

Notably, the conversions between internal encodings are a set of free
functions that, by design, don't participate in the framework for
external encodings. In the case of conversions between internal
encodings, the conversion pair is known statically. When an external
encoding is involved, the framework generally assumes that the
external encoding is decided dynamically at runtime.

The design decisions are extensively documented at
https://hsivonen.fi/encoding_rs/
and
https://hsivonen.fi/modern-cpp-in-rust/

Notably, the reason why encoders and decoders are behind a pointer
instead of being movable values is explained at the end of the second
writeup. There is no fundamental barrier from handling them by value
in C++. However, making the C++ compiler aware of the alignment and
size of these objects is tricky enough as a build system matter that I
have not done it.

> Consider the usage of the low-level facilities laid out below:

I think decoding to UTF-32 is not a good example if we want to promote
UTF-8 as the application internal encoding. Considering that Shift_JIS
tends to come up as a reason not to go UTF-8-only in various
situations, I think showing conversion from Shift_JIS (preferably
discovered dynamically at runtime) to UTF-8 would make more sense.

> if (std::empty(result.input)) {
> break;
> }

How does this take into account the input ending with an incomplete
byte sequence?

> On top of eagerly consuming free functions, there needs to be views that allow a person to walk some view of storage with a specified encoding.

I doubt that this is really necessary. I think post-decode Unicode
needs to be iterable by Unicode scalar value (i.e. std::u8string_view
and std::u16string_view should by iterable by char32_t), but I very
much doubt that it's worthwhile to provide such iteration directly
over legacy encodings. Providing such iteration competes over
implementor attention with SIMD-accelerated conversion of contiguous
buffers, and I think it's much more important to give attention to the
letter.

To the extent other programming languages that have encoding
conversion in their standard library, such as Java, focus on
contiguous buffers rather than iteration, it's worthwhile to study if
application developers really feel that something important is
missing.

> 3.2.1. Error Codes

> // input contains ill-formed sequences
> invalid_sequence = 0x01,
> // input contains incomplete sequences
> incomplete_sequence = 0x02,
> // input contains overlong encoding sequence
> // (e.g. for utf8)
> overlong_sequence = 0x03,
> // output cannot receive all the completed
> // code units
> insufficient_output_space = 0x04,
> // sequence can be encoded but resulting
> // code point is invalid (e.g., encodes a lone surrogate)
> invalid_output = 0x05,
> // leading code unit is wrong
> invalid_leading_sequence = 0x06,
> // leading code units were correct, trailing
> // code units were wrong
> invalid_trailing_sequence = 0x07

Is there really a use case for distinguishing between the types of
errors beyond saying that the input was malformed and perhaps
providing identification of which bytes were in error? Historically,
specs have been pretty bad at giving proper error definitions for
character encodings. The WHATWG Encoding Standard defines what,
precisely, constitutes an error but doesn't categorize the errors.
What authority would you use for specifying requirements of what kind
of error we what kind of encoding yields which of these error codes?
The comments suggest that overlong_sequence is for UTF-8, but if you
view UTF-8 decoding as a matter of matching with a DFA, you get no
distinction between overlong_sequece and invalid_leading_sequence.
Notably, the way the WHATWG Encoding Standard requires identifying
UTF-8 errors effectively requires taking the view of UTF-8 as a
regular grammar. See https://hsivonen.fi/broken-utf-8/ for extensive
discussion of the topic.

I recommend not attempting to categorize malformed sequences either
via error codes or human-readable std::error_condition message
strings.

> State& state;

I think the paper could use more explanation of why it uses free
functions with the state argument instead of encoder and decoder
objects whose `this` pointer provides the state argument via the
syntactic sugar associated with methods.

> 3.2.2.2. Implementation Challenge: Ranges are not the Sum of their Parts

The paper doesn't go into detail into why Ranges are needed instead of
spans. Spans are naturally friendly to SIMD-accelerated
implementations. Ranges are the new hammer, but is this use case
really the right nail for them?

> class assume_valid_handler;

Is this kind of UB invitation really necessary?

> The interface for an error handler will look as such:
>
> namespace std { namespace text {
>
> class an_error_handler {
> template <typename Encoding, typename InputRange,
> typename OutputRange, typename State>
> constexpr auto operator()(const Encoding& encoding,
> encode_result<InputRange, OutputRange, State> result) const {
> /* morph result or throw error */
> return result;
> }

I think this part needs a lot more explanation of how the error
handler is allowed to modify the ranges and what happens if the output
doesn't fit.

> template <typename Encoding, typename InputRange,
> typename OutputRange, typename State>
> constexpr auto operator()(const Encoding& encoding,
> decode_result<InputRange, OutputRange, State> result) const {
> /* morph result or throw error */
> return result;
> }

Custom error handlers for decoding seem unnecessary. Are there truly
use cases for behaviors other than replacing malformed sequences with
the REPLACEMENT CHARACTER or stopping conversion upon discovering the
first malformed sequence?

> Throwing is explicitly not recommended by default by prominent vendors and
> implementers (Mozilla, Apple, the Unicode Consortium, WHATWG, etc.)

I don't want to advocate throwing, but I'm curious: What Mozilla and
Apple advice is this referring to? Or is this referring to the Gecko
and WebKit code bases prohibiting C++ exceptions in general?

> For performance reasons and flexibility, the error callable must have a way
> to ensure that the user and implementation can agree on whether or not we
> invoke Undefined Behavior and assume that the text is valid.

The ability to opt into UB seems dangerous. Are there truly compelling
use cases for this?

> 3.2.3. The Encoding Object

> using code_point = char32_t;

This looks bad. As I've opined previously
(https://hsivonen.fi/non-unicode-in-cpp/), I think this should not be
a parameter. Instead, all encodings should be considered to be
conceptually decoding to or encoding from Unicode and char32_t should
be the type for a Unicode scalar value.

> using code_unit = char;

I think it would be better to make the general facility deal with
decode from bytes and encode two bytes only and then to provide
conversion from wchar_t, char16_t, or char32_t to UTF-8 and from
wchar_t and char32_t to UTF-16 as separate non-streaming functions.

> using state = __ex_state;

Does this imply that the same state type is used for encode and
decode? That's odd.

Also, conceptually, it seems odd that the state is held in an
"encoding" as opposed to "decoder" and "encoder".

> static constexpr size_t max_code_unit_sequence = MB_LEN_MAX;

I take it that the use case for this parameter is knowing the minimum
buffer size that allows forward progress.

Does there exist an encoding that is worthwhile to support and for
which this parameter exceeds 4? Does this value need to be
parameterized instead of being fixed at 4?

Why aren't there methods for querying for the worst-case output size
given input size and the current conversion state?

> static constexpr size_t max_code_point_sequence = 1;

Is this relevant if decode is not supported to UTF-32 and is only
supported to UTF-8 and to UTF-16? A single Big5 byte sequence can
decode into two Unicode scalar values, but it happens that a single
Big5 byte sequence cannot to decode into more than 4 UTF-8 code units
or into more than 2 UTF-16 code units, which are the normal limits for
single Unicode scalar values in these encoding forms.

> // optional
> using is_encoding_injective = std::false_type;

Does this have a compelling use case?

> // optional
> using is_decoding_injective = std::true_type;

Does this have a compelling use case?

> // optional
> code_point replacement_code_point = '0xFFFD';

What's the use case for this? Is there ever a legitimate reason to
specify something else? (I'm aware of kernel32.dll and Internet
Explorer using a middle dot for code page 932, but WebKit followed by
Blink and Gecko get away with using U+FFFD for Shift_JIS on the Web.)

> // optional
> code_unit replacement_code_unit = '?';

To the extent the application developer doesn't want to request
replacement with a numeric escape sequence, such as ☃, does who
wants a constant replacement ever want anything other than the
representation of U+FFFD if the encoding can represented or the
question mark otherwise? That is, what's the use case for this?

> // encodes exactly one full code unit sequence
> // into one full code point sequence
> template <typename In, typename Out, typename Handler>
> encode_result<In, Out, state> encode(
> In&& in_range,
> Out&& out_range,
> state& current_state,
> Handler&& handler
> );
>
> // decodes exactly one full code point sequence
> // into one full code unit sequence
> template <typename In, typename Out, typename Handler>
> decode_result<In, Out, state> decode(
> In&& in_range,
> Out&& out_range,
> state& current_state,
> Handler&& handler
> );

How do these integrate with SIMD acceleration?

> static void reset(state&);

What's the use case for this as opposed to constructing a new object?
Note that if you support baking the BOM sniffing state machine into
the state of the object, you won't be able to reset the object without
reserving space for information about what to reset to.

> 3.2.3.1. Encodings Provided by the Standard

> namespace std { namespace text {
>
> class ascii;
> class utf8;
> class utf16;
> class utf32;
> class narrow_execution;
> class wide_execution;

This is rather underwhelming for an application developer wishing to
consume Web content or email.

On the other hand, the emphasis of the design presented in this paper
being compile-time specializable seems weird in connection to
`narrow_execution`, whose implementation needs to be dispatched at
runtime. Presenting a low-level compile-time specializable interface
but then offering unnecessarily runtime-dispatched encoding through it
seems like a layering violation.

> If an individual knows their text is in purely ASCII ahead of time and they work in UTF8, this information can be used to bit-blast (memcpy) the data from UTF8 to ASCII.

Does one need this API to live dangerously with memcpy? (As opposed to
living dangerously with memcpy directly.)

> 3.2.3.2. UTF Encodings: variants?

> both CESU-8 and WTF-8 are documented and used internally for legacy reasons

This applies also to wide_execution, utf16, and utf32. (I wouldn't be
surprised if WTF-8 surpassed UTF-32 in importance in the future.)

I'm not saying that CESU-8 or WTF-8 should be included, but I think
non-byte-code-unit encodings don't have good justification for being
in the same interface that is used for consuming external data

> More pressingly, there is a wide body of code that operates with char as the code unit for their UTF8 encodings. This is also subtly wrong, because on a handful of systems char is not unsigned, but signed.

This is a weird remark. Signed char as a misfeature generally and bad
for a lot of processing besides UTF-8.

> template <typename CharT, bool encode_null, bool encode_lone_surrogates>

I don't think it's all clear that it's a good idea in terms of
application developer understanding of the issues to enable "UTF-8" to
be customized like this is supposed to having WTF-8 as a separate
thing.

> This is a transformative encoding type that takes the source (network) endianness

Considering that the "network byte order" is IETF speak for "big
endian", I think it's confusing to refer to whatever you get from an
external source in this manner. (Of UTF-16BE and UTF-16LE, you're more
likely to receive the latter.)

> This paper disagrees with that supposition and instead goes the route of providing a wrapping encoding scheme

FWIW, I believe that conversion from wchar_t, char16_t, or char32_t
into UTF-8 should not be forced into the same API as conversion from
external byte-oriented data sources, and I believe that it's
conceptually harmful to conflate the char16_t-oriented UTF-16 with the
byte-oriented UTF-16BE and UTF-16LE external encodings.

> 3.2.4. Stateful Objects, or Stateful Parameters?

> maintains that encoding objects can be cheap to construct, copy and move;

Surely objects that have methods (i.e. state is taken via `this`
syntactic sugar) can be cheap to construct, copy, and move.

> improves the general reusability of encoding objects by allowing state to be massaged into certain configurations by users;

It seems to me that allowing application developers to "massage" the
state is an anti-feature. What's the use case for this?

> and, allows users to set the state in a public way without having to prescribe a specific API for all encoders to do that.

Likewise, what's the use case for this?

> As a poignant example: consider the case of execution encoding character
> sets today, which often defer to the current locale. Locale is inherently
> expensive to construct and use: if the standard has to have an encoding
> that grabs or creates a codecvt or locale member, we will immediately lose
> a large portion of users over the performance drag during construction of
> higher-level abstractions that rely on the encoding. It is also notable that
> this is the same mistake std::wstring_convert shipped with and is one of
> the largest contributing reasons to its lack of use and subsequent
> deprecation (on top of its poor implementation in several libraries, from
> the VC++ standard library to libc++).

As noted, trying to provide a compile-time specialized API that
provides access to inherently runtime-discovered encodings seems like
a layering violation. Maybe the design needs to surface the
dynamically dispatched nature of these encodings and to see what that
leads to in terms of the API design.

> 3.2.4.1. Self-Synchronizing State

> If an encoding is self-synchronizing, then at no point is there a need to refer to an "potentially correct but need to see more" state: the input is either wholly correct, or it is not.

Is this trying to say that a UTF-8 decoder wouldn't be responsible for
storing the prefix of buffer-boundary-crossing byte sequences into its
internal state and it would be the responsibility of the caller to
piece the parts together?

> 3.3.1. Transcoding Compatibility

What are the use cases for this? I suggest to treating generalized
transcoding as a YAGNI matter, and if someone really needs it, letting
them pivot via UTF-8 or UTF-16.

> 3.3.2. Eager, Fast Functions with Customizability

> Users should be able to write fast transcoding functions that the standard picks up for their own encoding types. From GB1032

Is 1032 the intended number here?

> WHATWG encodings

If it is the responsibility of the application developer to supply an
implementation of the WHATWG Encoding Standard, to my knowledge the
fastest and most correct option is to use encoding_rs via C linkage.

In that case, what's the value proposition for wrapping it in the API
proposed by this paper as opposed to using the API from
https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h
updated with C++20 types (notably std::span and char8_t) directly?

Regardless of the value proposition, writing the requisite glue code
could be a useful exercise to validate the API proposed in this paper:
If the glue code can't be written, is there a good reason why not?
Does the glue code foil the SIMD acceleration?

Also, to validate the API proposed here, it would be a good exercise
to encode, "with replacement" in the Encoding Standard sense, a string
consisting of the three Unicode scalar values U+000F, U+2603, and
U+3042 into ISO-2022-JP and to see what it takes API-wise to get the
Encoding Standard-compliant result.

> 4. Implementation

> This paper’s r2 hopes to contain more benchmarks

I'd be interested in seeing encoding_rs (built with SIMD enabled, both
on x86_64 and aarch64) included in the benchmarks. (You can grab build
code from https://github.com/hsivonen/recode_cpp/ , but `cargo
--release` needs to be replaced with `cargo --release --features
simd-accel`, which requires a nightly compiler, to enabled SIMD.)

> [WTF8]
> Simon Sapin. The WTF-8 encoding. September 26th, 2019. URL: https://simonsapin.github.io/wtf-8/

That date can't be right.

-- 
Henri Sivonen
hsivonen_at_[hidden]
https://hsivonen.fi/

Received on 2019-08-17 21:51:24