sg16: Re: [SG16-Unicode] Comments on D1629R1 Standard Text Encoding

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Sat, 5 Oct 2019 13:33:03 +0300

Hi,

Sorry about the slow reply.

On Fri, Sep 6, 2019 at 6:40 AM JeanHeyd Meneide <phdofthehouse_at_[hidden]>
wrote:

> On Sat, Aug 17, 2019 at 3:51 PM Henri Sivonen <hsivonen_at_[hidden]>
> wrote:
> > 2.1. The Goal
>
>> >
>> > int main () {
>> > using namespace std::literals;
>> > std::text::u8text my_text =
>> std::text::transcode<std::text::utf8>(“안녕하세요 👋”sv);
>> > std::cout :< my_text :< std::endl; // prints 안녕하세요 👋 to a capable
>> console
>>
>> This does not look like a compelling elevator pitch, since with modern
>> terminal emulators, merely `fwrite` to `stdout` with a u8 string
>> literal works.
>>
>> Here's what I'd like to see sample code for:
>>
>> ...
>>
>
> I certainly need a wide body of examples, but that's not going to fit in
> the initial proposal. At least, not in that version; the next version
> (which will probably be published post-Belfast) will have much more
> implementation and projects behind it.
>

The reason why I'd like to see this particular example is to be able to
understand how the API works with discontiguous input where byte sequences
are split across buffers that come in over time.

Here's the program that I asked for using encoding_rs (though not using
C++20 types):
https://github.com/hsivonen/sg16demo/blob/b2b948d5731a88efcaf44c14e95406f9e0345985/sg16demo.cpp#L60-L118

Notably:

* Dynamic lookup by run-time label works.
* No need for application code to handle BOM-sniffing (actual use not
demoed).
* No need for the application to deal with input buffers ending in the
middle of a byte sequence.
* Can convert with on-stack buffer even if input is too large to fit in it
at once.
* If the stream ends inside an incomplete byte sequence, U+FFFD is
generated instead of the error getting lost.
* There's no need for the converter to be able to pull all input from a
source (e.g. an iterator) in one go.

I still don't understand how your API proposal handles the above points,
which is why seeing code corresponding to this example would help.

> > 3. Design
>>
>> > study of ICU’s interface
>>
>> Considering that Firefox was mentioned in the abstract, it would make
>> sense to study its internal API.
>>
>
> ...
>
>
> Thank you for the links. I have read some of these before, but not all of
> them. Most of what I have read is CopperSpice's API for encoding,
>

Does CopperSpice have an encoding API other than the one inherited from Qt?
I don't see anything else in the repo. The Qt one is not a good role model,
because its streaming mode allocates on the heap instead of supporting a
caller-provider output buffer.

> libogonek's documentation and source, text_view's source and examples,
>

I didn't see support for legacy CJK encodings in either of those. Do they
have legacy CJK support? I wouldn't trust an API design to handle legacy
CJK before verifying that it can handle GB18030 decode when there are three
bytes that constitute a valid prefix of a 4-byte GB18030 sequence followed
by a fourth byte that's not valid as the last byte of the 4-byte sequence,
that it can handle decoding of the Big5 byte sequences that output two
Unicode scalars for a single Big5 byte pair, and that it can handle the
ISO-2022-JP encode case that I mentioned previously in a WHATWG-compliant
manner. (If those cases work, EUC-KR, EUC-JP, and Shift_JIS will be fine.)

> I'll make sure to give a good lookover the Firefox internals plus the
> rust transcoder you built.
>

The in-RAM conversions in Firefox that don't convert into an XPCOM string
target have now been relocated to:
https://searchfox.org/mozilla-central/rev/2f29d53865cb895bf16c91336cc575aecd996a17/mfbt/Utf8.h#280
https://searchfox.org/mozilla-central/rev/2f29d53865cb895bf16c91336cc575aecd996a17/mfbt/TextUtils.h
https://searchfox.org/mozilla-central/rev/2f29d53865cb895bf16c91336cc575aecd996a17/mfbt/Latin1.h

> > On top of eagerly consuming free functions, there needs to be views that
> allow a person to walk some view of storage with a specified encoding.
>
>>
>> I doubt that this is really necessary. I think post-decode Unicode
>> needs to be iterable by Unicode scalar value (i.e. std::u8string_view
>> and std::u16string_view should by iterable by char32_t), but I very
>> much doubt that it's worthwhile to provide such iteration directly
>> over legacy encodings. Providing such iteration competes over
>> implementor attention with SIMD-accelerated conversion of contiguous
>> buffers, and I think it's much more important to give attention to the
>> letter.
>>
>
> The goal is not to provide iteration over legacy encodings. The goal is to
> separate the algorithm (decoding/encoding text) from the storage
> (std::basic_string<char8_t>, __gnu_cxx::rope<char>, boost::unencoded_rope,
> trial::circular_buffer, sg14::ringspan, etc.). The basic algorithm -- if it
> does not require more than forward iterators and friend -- should work on
> those class and iterators and ranges.
>

The main objection to the encoding_rs API design stated at
https://youtu.be/BdUipluIf1E?t=1445 was that it's "pointer-based" (I'd
rather describe it as span-based, but yeah, the C API indeed has bare
pointers), so it's hard to use with exotic storage, such as a deque or
ropes. (Thank you for mentioning encoding_rs in the talk!)

Why is the API a problem with ropes? In encoding_rs, and a Decoder or an
Encoder processes a temporal sequence of spans, so you could iterate over a
rope and on each rope segment wrap it as span, pass to encoding_rs, and
make a segmented rope of the output, too. There are three caveats:

1. The API does not do the rope walking, you'd need to add that layer
specific to your rope type.
2. When encoding from UTF-16 or UTF-8, it would be up to you to ensure
Unicode scalar values aren't split across rope segments. (If you have ropes
that can split UTF-16 surrogate pairs or UTF-8 byte sequences across rope
segments, you're going to have a bad time in terms of complexity in your
code in general.) However, I want to emphasize that, by design, the Decoder
side accepts arbitrary boundaries, so you can feed a Decoder one-byte spans
and it works.
3. The encoding_rs::mem API, which is for in-RAM conversions, expects the
whole logical input stream to be in a contiguous span, since its point is
to optimize away objects that hold state. (Amusingly, this API is used in
Firefox in the one place where we do have rope strings: In SpiderMonkey. So
there the bulk of the calling function ends up being code for dealing with
UTF-16 surrogate pairs split across rope segments, because JavaScript
developers can don't see the ropes directly and can do whatever:
https://searchfox.org/mozilla-central/rev/2f29d53865cb895bf16c91336cc575aecd996a17/js/src/vm/StringType.cpp#142
)

> This provides a greater flexibility of storage options for users and a
> robust composition story for algorithms. Having spent a small amount of
> time contributing to one standard library and observing the optimizations
> and specializations put into many of the already existing standard
> algorithms, I can assure you that no implementation will spend their time
> with only the default base encoding versions. (And if not, I have every
> intention on making sure at least the libraries I have the power to modify
> -- libstdc++ and libc++ -- are improved.)
>
> There were also previous chances at potential optimization with things
> like wstring_convert, which took either pointers or just a
> basic_string<CharT> outright. The complaints about these functions were
> rife and heavy (most of it due to its dependence on std::locale and much of
> its virtualized interface, but many of the implementations did not
> implement it correctly (
> https://github.com/OpenMPT/openmpt/blob/master/common/mptString.cpp#L587
> | https://sourceforge.net/p/mingw-w64/bugs/538/), let alone with speed in
> mind).
>
> Finally, I certainly agree that we want to focus on contiguous interfaces.
> But providing nothing at all for other storage types leaves a lot of users
> and use cases out in the cold and would require them to manually chunk,
> iterate, and poke at their code unit storage facilities.
>

With the kind of span-based interface that encoding_rs Decoder has you can
always pass in one-byte spans to accommodate weird storage with which
slower outcomes are acceptable. It's probably slower than a from-scratch
iterator-based implementation, but given limited development resources and
the acceptable of an iterator over unusual storage being slower anyway, it
seems acceptable to me to make long spans fast and input using one-byte
spans correct.

>
>
>
>> > 3.2.2.2. Implementation Challenge: Ranges are not the Sum of their Parts
>>
>> The paper doesn't go into detail into why Ranges are needed instead of
>> spans.
>>
>
> This part of the paper was cataloguing an implementation issue that has
> since been transferred to a different paper and likely to be solved soon:
> https://wg21.link/p1664 | https://thephd.github.io/reconstructible-ranges
>
> Ranges are used here because working with individual iterators will have
> consequences for encoding and decoding iterators. libogonek explored
> stacking such iterators on top of iterators for decoding and normalization:
> the result was not very register or cache friendly due to the sheer size of
> the resulting iterators (256+ bytes for a range in some cases). Ranges
> allow us to fix this with its concept of a "sentinel".
>

New Range sentinels are better than old sentinels that had to be of the
same type as the main iterator, yes.

However, SIMD requires data to be in contiguous buffers. The whole input
doesn't need to be in _one_ contiguous buffer, but having an arbitrary
iterator that yields one byte at a time won't work. std::span makes it
obvious that there's contiguous memory backing it, so conversion input and
output as a series of spans makes this explicit. AFAICT, in general a Range
erases information about whether the storage is contigous. It seems
complicated to have an interface that seemingly erases this information and
then behind the scenes tries to re-discover that the Range is actually
contiguous. It seems to me that this makes it less obvious to the user what
the performance characteristics are.

Furthermore, with spans the caller knows that the code size cost of a span
is on the order of pointer and length. With Ranges being a compile-time
thing, how much caller-side monomorphization will result from the interface
using Ranges and discoving that the Ranges are
contiguous/unit-based/rope-like and adapting to that?

Is it really common enough for encoding conversion sources and sinks not be
made of (series of) contiguous buffers to design for the case where the
source yields bytes in an arbitrary way and the sink accepts char32_t in an
arbitrary way? That is, why should the interface be suggestive of iterators
with tricks to recover the contiguous nature of input and output instead of
the contiguous nature being explicit and discontiguous sources having
deploy the trick of wrapping their bytes in one-byte spans?

>
>> > class assume_valid_handler;
>>
>> Is this kind of UB invitation really necessary?
>>
>
> I did something similar in my first private implementation and it had its
> use cases there as well. I've been told and shown that not re-checking
> invariants on things people know are clean was useful and provided
> meaningful performance improvements in their codebases. I think if I write
> more examples showing where error handlers can be used, it would show that
> choosing such an error handler is an incredibly conscious decision at the
> end of a very verbose function call or template parameter: the cognitive
> cost for asking for UB is extraordinarily high for when you want it (as it
> should be):
>
> std::u8string i_know_its_fine = std::text::transcode("abc",
> std::text::latin1{}, std::text::utf8{}, std::text::assume_valid_
> handler{});
>

UTF-8 to UTF-16 conversion becomes faster when assuming validity, but the
structure of the code can change more than by just compiling out some
things. E.g. in encoding_rs, the way lead bytes are checked changes from a
lookup table to plain less-than checks. While it might make sense to
provide this for in-RAM UTF-8 to UTF-16 specifically, I'm skeptical about
providing it generically as an error handler.

>
> > 3.2.3. The Encoding Object
>>
>> > using code_point = char32_t;
>>
>> This looks bad. As I've opined previously
>> (https://hsivonen.fi/non-unicode-in-cpp/), I think this should not be
>> a parameter. Instead, all encodings should be considered to be
>> conceptually decoding to or encoding from Unicode and char32_t should
>> be the type for a Unicode scalar value.
>>
>
> A lot of people have this comment. I am more okay with having code_point
> be a parameter, with the explicit acknowledgement that if someone uses
> not-char32_t (not a Unicode Code Point), then nothing above the encoding
> level in the standard will work for them (no normalization, no segmentation
> algorithms, etc.). I have spoken to enough people who want to provide very
> specific encoding stories for legacy applications where this would help.
> Even if the encoding facilities work for them, I am very okay with letting
> them know that -- if they change this fundamental tenant -- they will lock
> themselves out of the rest of the algorithms, the ability to transcode with
> much else, and basically the rest of text handling in the Standard.
>
> They get to decide whether or not that's a worthwhile trade.
>

Letting them decide is not free for the ecosystem. It leaves cruft and
mental overhead for everyone else to see and understand.

> > static constexpr size_t max_code_unit_sequence = MB_LEN_MAX;
>
>>
>> Does there exist an encoding that is worthwhile to support and for
>> which this parameter exceeds 4? Does this value need to be
>> parameterized instead of being fixed at 4?
>>
>
> MB_LEN_MAX on Windows reports 5, but that might be because it includes
> the null terminator, so maybe there is no implementation where it exceeds 4?
>

Could be either 4 plus zero terminator or 2 for one Japanese character plus
3 for a state transition in ISO-2022-JP if their implementation insist on
outputting both in one go.

>
> Why aren't there methods for querying for the worst-case output size
>> given input size and the current conversion state?
>>
>
> This was commented on before, and I need to add it. Almost all encoding
> functionality today has it (usually by passing nullptr into the function).
>

My preference for API design, which can also be justified performance-wise
by optimizing out one branch, is to have separate entry point for query and
not have a bogus buffer mean someting magic.

> > // optional
>> > using is_encoding_injective = std::false_type;
>>
>> Does this have a compelling use case?
>>
>> > // optional
>> > using is_decoding_injective = std::true_type;
>>
>> Does this have a compelling use case?
>>
>
> This is part of a system wherein users will be errored at compile-time for
> any lossy transcoding they do. As in the example, ASCII is perfectly fine
> decoding into Unicode Scalar Values. Unicode Scalar Values are NOT fine
> with being encoded into ASCII. Therefore, the following should loudly yell
> at you at compile-time, not run-time:
>
> auto this_is_not_fine = std::text::transcode(U"☢️☢️", std::text::ascii{});
> // static assertion failed: blah blah blah
>
> The escape hatch is to provide the non-default text encoding handler:
>
> auto still_not_fine_but_whatever = std::text::transcode(U"☢️☢️",
> std::text::utf32{}, std::text::ascii{},
> std::text::replacement_error_handler{});
> // alright, it's your funeral...
>
> This is powered by the typedefs noted above.
> is_(decoding/encoding)_injective informs the implementation whether or not
> your implementation can perfectly encoding from the code point to code
> units and vice versa. If it can't, it will be sure to loudly scold you if
> you use a top-level API that does not explicitly pass an error handler,
> which is your way of saying "I know what I'm doing, shut up stdlib".
>
> I have programmed this in before to an API and it was helpful to stop
> people from automatically converting text that was bad. See an old
> presentation I did on the subject when I first joined SG16 while it was
> still informal-text-wg: https://thephd.github.io/presentations/unicode/sg16/2018.03.07
> - ThePhD - a rudimentary unicode abstraction.pdf. It was mildly
> successful in smacking people's hands when they wanted to do e.g. utf8 ->
> latin1, and made them think twice. The feedback was generally positive. I
> had a different API back then, using converting constructors. I don't think
> anyone in the standard would be happy with converting constructors, but the
> same principles apply to the encode/decode/transcode functions.
>

Considering that pretty much anything but the UTFs say "Nope" if you ask of
you can encode from an UTF into them, I'm very skeptical of the benefit of
this API. Java has a fancy API on this theme. Does anyone use it for
anything useful? For encoding_rs, I added a method point
can_encode_everything(), which I thought would say true for UTF-8 and
GB18030. Now the documentation remarks "(Only true if the output encoding
is UTF-8.)", since there's _one_ scalar value that (Web-flavored) GB18030
cannot encode...

> > // encodes exactly one full code unit sequence
>
>> > // into one full code point sequence
>> > template <typename In, typename Out, typename Handler>
>> > encode_result<In, Out, state> encode(
>> > In&& in_range,
>> > Out&& out_range,
>> > state& current_state,
>> > Handler&& handler
>> > );
>> >
>> > // decodes exactly one full code point sequence
>> > // into one full code unit sequence
>> > template <typename In, typename Out, typename Handler>
>> > decode_result<In, Out, state> decode(
>> > In&& in_range,
>> > Out&& out_range,
>> > state& current_state,
>> > Handler&& handler
>> > );
>>
>> How do these integrate with SIMD acceleration?
>>
>
> They don't. std::text::decode/encode/transcode free functions is where
> specializations for fast processing are meant to kick in. These are the
> basic, one-by-one encodings.
>

What is one-by-one, though? In your CppCon talk you talked about a code
point as the unit of decoder output (https://youtu.be/BdUipluIf1E?t=2142)
but without discussing Big5 having indivisible decode steps that output two
Unicode scalar values at once. That is, it is not the case that pushing one
more byte into a decoder either outputs nothing (a multibyte sequence isn't
complete) or outputs one scalar value. As I mentioned previously, this
doesn't matter if you only expose decoding to UTF-8 and decoding to UTF-16,
since one scalar value can be up to 4 code units in UTF-8 and up two code
units in UTF-16, and Big5 _does_ fit the model that pushing one more byte
of input results in up to 4 code units of UTF-8 output or up to 2 code
units of UTF-16 output.

Does your design handle Big5 decode?

I'm worried that trying to expose what look like the primitive building
blocks will result in needless complexity compared to exposing only the
more packaged operations that decode to UTF-8 or UTF-16 (but not to
individual "code points" or scalar values or to UTF-32).

>
> > static void reset(state&);
>>
>> What's the use case for this as opposed to constructing a new object?
>>
>
> Not any good use case, really: I should likely change this to just let
> someone default-construct and copy over the state. I have to reconcile
> encoding objects and states which should conceivably be
> non-default-constructible (e.g., they hold a string or enumeration value
> that contains some indication of which encoding to use plus any
> intermediate conversion state) and this design. The minimum API of "state"
> needs to be more fleshed out.
>

On the topic of state: In the talk (https://youtu.be/BdUipluIf1E?t=2173)
you said that for the three UTFs the state object is an empty struct and
that the state is for ISO-2022-JP and the like.

Let's say the emoji 🤔 arrives over the network as UTF-8 such that the
first two bytes "\xF0\x9F" at the end of one buffer and the last two
"\xA4\x94" are at the start of another buffer. How do I decode this? I
would hope that there'd be a decoder object that I pass "\xF0\x9F" to and
it bakes them into its state so that when I pass "\xA4\x94" next, I get 🤔
out.

>>
>> On the other hand, the emphasis of the design presented in this paper
>> being compile-time specializable seems weird in connection to
>> `narrow_execution`, whose implementation needs to be dispatched at
>> runtime. Presenting a low-level compile-time specializable interface
>> but then offering unnecessarily runtime-dispatched encoding through it
>> seems like a layering violation.
>>
>> > If an individual knows their text is in purely ASCII ahead of time and
>> they work in UTF8, this information can be used to bit-blast (memcpy) the
>> data from UTF8 to ASCII.
>>
>> Does one need this API to live dangerously with memcpy? (As opposed to
>> living dangerously with memcpy directly.)
>>
>
> The idea is that the implementation can safely memcpy, because they have
> compile-time information that indicates they can do so. If they don't, they
> can't make that guarantee; I want to provide the implementation that
> guarantee.
>

How often do the requirements line up just right that you can do that? Does
it happen often enough that someone knows that their GB2312 input is valid
and they happen to be converting into GB18030? Note that you have to
special-case the euro sign when going from real-world GBK to GB18030. This
seems like a YAGNI thing that's going to have a very low success rate.

>
> > 3.2.4. Stateful Objects, or Stateful Parameters?
>>
>> > maintains that encoding objects can be cheap to construct, copy and
>> move;
>>
>> Surely objects that have methods (i.e. state is taken via `this`
>> syntactic sugar) can be cheap to construct, copy, and move.
>>
>> > improves the general reusability of encoding objects by allowing
>> state to be massaged into certain configurations by users;
>>
>> It seems to me that allowing application developers to "massage" the
>> state is an anti-feature. What's the use case for this?
>>
>> > and, allows users to set the state in a public way without having to
>> prescribe a specific API for all encoders to do that.
>>
>> Likewise, what's the use case for this?
>>
>
> The goal here is for when locale dependency or dynamic encodings come
> into play. We want to keep the encoding object itself cheap to create and
> use, while putting any heavy lifting inside of the state object which will
> get passed around explicitly by reference in the low-level API. Encoding
> tags, incredibly expensive locale objects, and more can all be placed onto
> the state itself, while the encoding object serves as the cheap handle that
> allows working with such a state.
>
> I would be interested in pursuing the alternate design where the
> encoding object just holds all the state, all the time. This means spinning
> up a fresh encoding object anytime state needs to be changed, but I can
> imagine it still amounting to the same level of work in many cases. I will
> be mildly concerned that doing so will front-load things like change in
> locale and such to filling the Encoding Object's API, or mandate certain
> constructor forms. This same thing happened to wstring_convert, codect_X
> and friends, so I am trying to do my best to avoid that pitfall. This also
> brings up a very concerning point: if "state" has special members that
> can't be easily reconstructed or even copied, how do we handle copying one
> encoding from one text object to another? Separating the state means it's
> tractable and controllable, not separating it means all of the copyability,
> movability, etc. becomes the encoding's concern now.
>

This all seems very odd and suspiciously complex coming from the point of
view that Encoding has no state but can create Decoder objects that have
state. (Encoder has state only for ISO-2022-JP.)

>
> > 3.2.4.1. Self-Synchronizing State
>>
>> > If an encoding is self-synchronizing, then at no point is there a need
>> to refer to an "potentially correct but need to see more" state: the input
>> is either wholly correct, or it is not.
>>
>> Is this trying to say that a UTF-8 decoder wouldn't be responsible for
>> storing the prefix of buffer-boundary-crossing byte sequences into its
>> internal state and it would be the responsibility of the caller to
>> piece the parts together?
>>
>
> The purpose of this is to indicate that a state has no "holdover"
> between encoding calls. Whatever it encodes or decodes results in a
> complete sequence, and incomplete sequences are left untouched (or encodes
> a replacement character and gets skipped over depending on the error
> handler, etc. etc.). This means that function calls end up being "pure"
> from the point of the encoder and decoder.
>
>>
Making UTF-8 decode pure at the expense of the application programmer
having to take the last incomplete prefix that the previous buffer ended
with and arranging it to be at the start of the next buffer is a serious
ergonomic bug and also a performance bug due to the byte copying/moving
involved.

See the fourth bullet point under https://hsivonen.fi/encoding_rs/#problems
.

> > 3.3.1. Transcoding Compatibility
>>
>> What are the use cases for this? I suggest to treating generalized
>> transcoding as a YAGNI matter, and if someone really needs it, letting
>> them pivot via UTF-8 or UTF-16.
>>
>
> The point of this section is to allow for encodings to clue the
> implementation in as to whether or not just doing `memcpy` or similar is
> acceptable. If my understanding is correct, a lot of the GBK encodings are
> bitwise compatible with GB18030. It would make sense for an implementation
> to speedily copy this into storage rather than have to roundtrip through
> transcoding.
>
>>
GBK to GB18030 has to special-case the euro sign even if you accept UB in
the case the input isn't valid.

In general, I'm continuing to be of the opinion that transcoding from a
non-UTF encoding directly to another non-UTF without being able to bear the
cost of pivoting via either UTF-8 or UTF-16 is a niche use case and
shouldn't be a design target for the standard library.

If it is the responsibility of the application developer to supply an
>> implementation of the WHATWG Encoding Standard, to my knowledge the
>> fastest and most correct option is to use encoding_rs via C linkage.
>>
>> In that case, what's the value proposition for wrapping it in the API
>> proposed by this paper as opposed to using the API from
>>
>> https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h
>> updated with C++20 types (notably std::span and char8_t) directly?
>>
>
> The purpose of wrapping this API is to make it standard so that
> everyone doesn't have to keep reimplementing it.
>

Is the value in the API or in the converters? Stroustrup says "What [C++
features] you do use, you couldn’t hand code any better." From this, it
seems to me that at least for obviously wide-applicability operations like
UTF-8 validation, gathering potentially-invalid segments of UTF-8 from an
external source into a valid internal in-RAM UTF-8 representation and
conversions between UTF-8 and UTF-16, which obviously belong in the
standard library, the implementations in the standard library should be the
best ones available. In this light, the suggestion that Bob Steagall plug
in his UTF-8 decoder (https://youtu.be/BdUipluIf1E?t=3198) seems like an
inversion of the "you couldn’t hand code any better" promise.

-- 
Henri Sivonen
hsivonen_at_[hidden]
https://hsivonen.fi/

Received on 2019-10-05 12:33:19