C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Comments on D1629R1 Standard Text Encoding

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Thu, 5 Sep 2019 23:40:34 -0400
Dear Henri,

     Apologies for taking so long to get back to you; thank you so much for
the detailed feedback. I'll do my best to answer everything. Thoughts are a
bit scattered, so feel free to ask if something doesn't make any sense.

     Thank you for taking the time to go through everything. This has been
very helpful and I have a lot of work to do!

Best Wishes,
JeanHeyd

On Sat, Aug 17, 2019 at 3:51 PM Henri Sivonen <hsivonen_at_[hidden]> wrote:

> Why is transliterating mentioned? It seems serious scope creep
> compared to character encoding conversion.
>

Sorry; I need to go through and use the proper term -- transcoding -- for
what is being done here. This paper intends to only concern itself with
transcoding and a tiny bit of generalized text transformation (e.g.,
normalization).


>
> > 2.1. The Goal
> >
> > int main () {
> > using namespace std::literals;
> > std::text::u8text my_text =
> std::text::transcode<std::text::utf8>(“안녕하세요 👋”sv);
> > std::cout :< my_text :< std::endl; // prints 안녕하세요 👋 to a capable
> console
>
> This does not look like a compelling elevator pitch, since with modern
> terminal emulators, merely `fwrite` to `stdout` with a u8 string
> literal works.
>
> Here's what I'd like to see sample code for:
>
> ...
>

I certainly need a wide body of examples, but that's not going to fit in
the initial proposal. At least, not in that version; the next version
(which will probably be published post-Belfast) will have much more
implementation and projects behind it.


> I think it's fair to characterize the kernel32.dll conversion
> functions as "provably fast", but I find it weird to put iconv in that
> category. The most prominent implementation is the GNU one based on
> the glibc function of the same name, which prioritizes extensibility
> by shared object over performance. Based on the benchmarking that I
> conducted (https://hsivonen.fi/encoding_rs/#results), I would not
> characterize iconv as "provably fast".
>

That's fair, but it does cover a wide variety of encodings and is the
backbone of many *nix programs and systems, including GCC. I should split
that sentence up into "provably fast and full of features".


> > 3. Design
>
> > study of ICU’s interface
>
> Considering that Firefox was mentioned in the abstract, it would make
> sense to study its internal API.
>

...


Thank you for the links. I have read some of these before, but not all of
them. Most of what I have read is CopperSpice's API for encoding,
libogonek's documentation and source, text_view's source and examples,
Boost.Text's documentation and source, my own work, and many of the
proposals that have come before this. I'll make sure to give a good
lookover the Firefox internals plus the rust transcoder you built.

> Consider the usage of the low-level facilities laid out below:
>
> I think decoding to UTF-32 is not a good example if we want to promote
> UTF-8 as the application internal encoding. Considering that Shift_JIS
> tends to come up as a reason not to go UTF-8-only in various
> situations, I think showing conversion from Shift_JIS (preferably
> discovered dynamically at runtime) to UTF-8 would make more sense.
>

Agreed, more examples are good.

> if (std::empty(result.input)) {
> > break;
> > }
>
> How does this take into account the input ending with an incomplete
> byte sequence?
>

Error reporting is done by the error handler, which you commented on below.
An incomplete sequence is handled by the encoding error check. The example
in the paper does not include lots of handling: the default text error
handler is the replacement text error handler, and it would blow up the
assertion with a failure. I should include more examples of such.


> > On top of eagerly consuming free functions, there needs to be views that
> allow a person to walk some view of storage with a specified encoding.
>
> I doubt that this is really necessary. I think post-decode Unicode
> needs to be iterable by Unicode scalar value (i.e. std::u8string_view
> and std::u16string_view should by iterable by char32_t), but I very
> much doubt that it's worthwhile to provide such iteration directly
> over legacy encodings. Providing such iteration competes over
> implementor attention with SIMD-accelerated conversion of contiguous
> buffers, and I think it's much more important to give attention to the
> letter.
>

The goal is not to provide iteration over legacy encodings. The goal is to
separate the algorithm (decoding/encoding text) from the storage
(std::basic_string<char8_t>, __gnu_cxx::rope<char>, boost::unencoded_rope,
trial::circular_buffer, sg14::ringspan, etc.). The basic algorithm -- if it
does not require more than forward iterators and friend -- should work on
those class and iterators and ranges. This provides a greater flexibility
of storage options for users and a robust composition story for algorithms.
Having spent a small amount of time contributing to one standard library
and observing the optimizations and specializations put into many of the
already existing standard algorithms, I can assure you that no
implementation will spend their time with only the default base encoding
versions. (And if not, I have every intention on making sure at least the
libraries I have the power to modify -- libstdc++ and libc++ -- are
improved.)

There were also previous chances at potential optimization with things like
wstring_convert, which took either pointers or just a basic_string<CharT>
outright. The complaints about these functions were rife and heavy (most of
it due to its dependence on std::locale and much of its virtualized
interface, but many of the implementations did not implement it correctly (
https://github.com/OpenMPT/openmpt/blob/master/common/mptString.cpp#L587 |
https://sourceforge.net/p/mingw-w64/bugs/538/), let alone with speed in
mind).

Finally, I certainly agree that we want to focus on contiguous interfaces.
But providing nothing at all for other storage types leaves a lot of users
and use cases out in the cold and would require them to manually chunk,
iterate, and poke at their code unit storage facilities. I plan to write a
paper to the C Committee about providing at the very least low-level
conversion utilities for mbs|w -> u8/16/32 and vice-versa.
Unicode-to-Unicode will probably remain the user's responsibility, since
the C libs only own the narrow and wide locale conversions with their
hidden state and thus should be responsible for providing nice conversions
in and out of those. wchar_t is slightly problematic for a UTF16-wchar_t
(Windows, IBM) because wchar_t cannot be a multi-width encoding; it must be
single-width by the standard and many of its functions bake that assumption
implicitly into their out params and return types.


> > 3.2.1. Error Codes
>
> > ...
>
> Is there really a use case for distinguishing between the types of
> errors beyond saying that the input was malformed and perhaps
> providing identification of which bytes were in error? Historically,
> specs have been pretty bad at giving proper error definitions for
> character encodings. The WHATWG Encoding Standard defines...
>

I sought to create something that would be useful, but I realize that since
the error handler receives the full state of the encoder/decoder it can
likely do its own callouts for specific types of errors. I can agree that
overlong_sequence, etc. might be too much, but the rest of them
(insufficiently sized output buffer, incomplete code unit sequence, etc.)
are all necessary, so a future revision might axe the "informational" error
codes but keep the necessary ones.


> > State& state;
>
> I think the paper could use more explanation of why it uses free
> functions with the state argument instead of encoder and decoder
> objects whose `this` pointer provides the state argument via the
> syntactic sugar associated with methods.
>

This likely needs a lot more explanation and example. But it might change
in the future as I really wrestle with dynamic encodings and
non-default-constructible state types.


>
> > 3.2.2.2. Implementation Challenge: Ranges are not the Sum of their Parts
>
> The paper doesn't go into detail into why Ranges are needed instead of
> spans.
>

This part of the paper was cataloguing an implementation issue that has
since been transferred to a different paper and likely to be solved soon:
https://wg21.link/p1664 | https://thephd.github.io/reconstructible-ranges

Ranges are used here because working with individual iterators will have
consequences for encoding and decoding iterators. libogonek explored
stacking such iterators on top of iterators for decoding and normalization:
the result was not very register or cache friendly due to the sheer size of
the resulting iterators (256+ bytes for a range in some cases). Ranges
allow us to fix this with its concept of a "sentinel".


> > class assume_valid_handler;
>
> Is this kind of UB invitation really necessary?
>

I did something similar in my first private implementation and it had its
use cases there as well. I've been told and shown that not re-checking
invariants on things people know are clean was useful and provided
meaningful performance improvements in their codebases. I think if I write
more examples showing where error handlers can be used, it would show that
choosing such an error handler is an incredibly conscious decision at the
end of a very verbose function call or template parameter: the cognitive
cost for asking for UB is extraordinarily high for when you want it (as it
should be):

std::u8string i_know_its_fine = std::text::transcode("abc",
std::text::latin1{}, std::text::utf8{}, std::text::assume_valid_handler{});

     I can imagine a world where adding "ub" to that name might make it
more obvious what you're potentially opening the door for.


> > The interface for an error handler will look as such:
> >
> > namespace std { namespace text {
> >
> > class an_error_handler {
> > template <typename Encoding, typename InputRange,
> > typename OutputRange, typename State>
> > constexpr auto operator()(const Encoding& encoding,
> > encode_result<InputRange, OutputRange, State> result) const {
> > /* morph result or throw error */
> > return result;
> > }
>
> I think this part needs a lot more explanation of how the error
> handler is allowed to modify the ranges and what happens if the output
> doesn't fit.
>

The implementation does it better, but it's ugly:
https://github.com/ThePhD/phd/blob/master/include/phd/text/error_handler.hpp#L44

You can roll back the range's consumption if you like, you can insert
characters into the stream then return, you can change the returned error
code after inserting replacement characters, etc. It's a very flexible
interface and it was designed to allow for custom behaviors without loss of
(much) information when a person really wanted to dig into what happened.
Templates make it look verbose and ugly, but I am working on a "simpler
error handler" that just takes one or two callables and does something
extremely simple (like returns an optional<code_point>, or lets you throw).


>
> > template <typename Encoding, typename InputRange,
> > typename OutputRange, typename State>
> > constexpr auto operator()(const Encoding& encoding,
> > decode_result<InputRange, OutputRange, State> result) const {
> > /* morph result or throw error */
> > return result;
> > }
>
> Custom error handlers for decoding seem unnecessary. Are there truly
> use cases for behaviors other than replacing malformed sequences with
> the REPLACEMENT CHARACTER or stopping conversion upon discovering the
> first malformed sequence?
>
> > Throwing is explicitly not recommended by default by prominent vendors
> and
> > implementers (Mozilla, Apple, the Unicode Consortium, WHATWG, etc.)
>
> I don't want to advocate throwing, but I'm curious: What Mozilla and
> Apple advice is this referring to? Or is this referring to the Gecko
> and WebKit code bases prohibiting C++ exceptions in general?
>

>From looking at both private and public codebases and from speaking to
implementers and developers, in SG16 telecons and otherwise. But I should
probably provide more direct citation and quotes here, rather than just
throwing the information out there; apologies, I'll make sure to improve
that for r1 so its more properly sourced and accurate.


> > For performance reasons and flexibility, the error callable must have a
> way
> > to ensure that the user and implementation can agree on whether or not we
> > invoke Undefined Behavior and assume that the text is valid.
>
> The ability to opt into UB seems dangerous. Are there truly compelling
> use cases for this?
>

I have a handful of experiences where avoiding the checks (and
encoding/decoding without those checks) provided a measurable speedup. At
the time, we had not optimized our functions using vectorized instructions:
maybe doing so would have made such a thing moot. I'll see what the
benchmarks say.

> 3.2.3. The Encoding Object
>
> > using code_point = char32_t;
>
> This looks bad. As I've opined previously
> (https://hsivonen.fi/non-unicode-in-cpp/), I think this should not be
> a parameter. Instead, all encodings should be considered to be
> conceptually decoding to or encoding from Unicode and char32_t should
> be the type for a Unicode scalar value.
>

A lot of people have this comment. I am more okay with having code_point be
a parameter, with the explicit acknowledgement that if someone uses
not-char32_t (not a Unicode Code Point), then nothing above the encoding
level in the standard will work for them (no normalization, no segmentation
algorithms, etc.). I have spoken to enough people who want to provide very
specific encoding stories for legacy applications where this would help.
Even if the encoding facilities work for them, I am very okay with letting
them know that -- if they change this fundamental tenant -- they will lock
themselves out of the rest of the algorithms, the ability to transcode with
much else, and basically the rest of text handling in the Standard.

They get to decide whether or not that's a worthwhile trade. A goal of
working on all this is to make it so they are extremely squeamish about
making that choice, but if they are informed enough and make the decision
or have a special use case that they can make the trade-off knowingly.
Encoding objects are incredibly low-level and once the dust settles will
likely only be written by one person or one team in a given org, or in a
publicly available library (e.g., a WHATWG-encoding library for C++). They
should be able to make the tradeoff if they care, but the standard won't
support them: it's a fairly steep punishment.


> > using code_unit = char;
>
> I think it would be better to make the general facility deal with
> decode from bytes and encode two bytes only and then to provide
> conversion from wchar_t, char16_t, or char32_t to UTF-8 and from
> wchar_t and char32_t to UTF-16 as separate non-streaming functions.
>
> > using state = __ex_state;
>
> Does this imply that the same state type is used for encode and
> decode? That's odd.

Also, conceptually, it seems odd that the state is held in an
> "encoding" as opposed to "decoder" and "encoder".
>

I do need to look into having a clear delineation, or perhaps even
separating all encoding objects into encoder and decoder. I haven't had the
time to justify a full split, so maybe just separating the types will be
best.


> > static constexpr size_t max_code_unit_sequence = MB_LEN_MAX;
>
> Does there exist an encoding that is worthwhile to support and for
> which this parameter exceeds 4? Does this value need to be
> parameterized instead of being fixed at 4?
>

MB_LEN_MAX on Windows reports 5, but that might be because it includes the
null terminator, so maybe there is no implementation where it exceeds 4?


> Why aren't there methods for querying for the worst-case output size
> given input size and the current conversion state?
>

This was commented on before, and I need to add it. Almost all encoding
functionality today has it (usually by passing nullptr into the function).

> static constexpr size_t max_code_point_sequence = 1;
>
> Is this relevant if decode is not supported to UTF-32 and is only
> supported to UTF-8 and to UTF-16? A single Big5 byte sequence can
> decode into two Unicode scalar values, but it happens that a single
> Big5 byte sequence cannot to decode into more than 4 UTF-8 code units
> or into more than 2 UTF-16 code units, which are the normal limits for
> single Unicode scalar values in these encoding forms.
>
> > // optional
> > using is_encoding_injective = std::false_type;
>
> Does this have a compelling use case?
>
> > // optional
> > using is_decoding_injective = std::true_type;
>
> Does this have a compelling use case?
>

This is part of a system wherein users will be errored at compile-time for
any lossy transcoding they do. As in the example, ASCII is perfectly fine
decoding into Unicode Scalar Values. Unicode Scalar Values are NOT fine
with being encoded into ASCII. Therefore, the following should loudly yell
at you at compile-time, not run-time:

auto this_is_not_fine = std::text::transcode(U"☢️☢️", std::text::ascii{});
// static assertion failed: blah blah blah

The escape hatch is to provide the non-default text encoding handler:

auto still_not_fine_but_whatever = std::text::transcode(U"☢️☢️",
std::text::utf32{}, std::text::ascii{},
std::text::replacement_error_handler{});
// alright, it's your funeral...

This is powered by the typedefs noted above.
is_(decoding/encoding)_injective informs the implementation whether or not
your implementation can perfectly encoding from the code point to code
units and vice versa. If it can't, it will be sure to loudly scold you if
you use a top-level API that does not explicitly pass an error handler,
which is your way of saying "I know what I'm doing, shut up stdlib".

I have programmed this in before to an API and it was helpful to stop
people from automatically converting text that was bad. See an old
presentation I did on the subject when I first joined SG16 while it was
still informal-text-wg:
https://thephd.github.io/presentations/unicode/sg16/2018.03.07
- ThePhD - a rudimentary unicode abstraction.pdf. It was mildly successful
in smacking people's hands when they wanted to do e.g. utf8 -> latin1, and
made them think twice. The feedback was generally positive. I had a
different API back then, using converting constructors. I don't think
anyone in the standard would be happy with converting constructors, but the
same principles apply to the encode/decode/transcode functions.


> > // optional
> > code_point replacement_code_point = '0xFFFD';
>
> What's the use case for this? ...
>
> > // optional
> > code_unit replacement_code_unit = '?';
>
> ... That is, what's the use case for this?
>

Not everyone uses ? or U+FFFD as their replacement, and that's pretty much
the sole reason. Whether or not we want to care about those use cases is
another question, and it certainly makes my life easier to toss it out the
window.


> > // encodes exactly one full code unit sequence
> > // into one full code point sequence
> > template <typename In, typename Out, typename Handler>
> > encode_result<In, Out, state> encode(
> > In&& in_range,
> > Out&& out_range,
> > state& current_state,
> > Handler&& handler
> > );
> >
> > // decodes exactly one full code point sequence
> > // into one full code unit sequence
> > template <typename In, typename Out, typename Handler>
> > decode_result<In, Out, state> decode(
> > In&& in_range,
> > Out&& out_range,
> > state& current_state,
> > Handler&& handler
> > );
>
> How do these integrate with SIMD acceleration?
>

They don't. std::text::decode/encode/transcode free functions is where
specializations for fast processing are meant to kick in. These are the
basic, one-by-one encodings. This is to ensure any given storage can be
iterated over by a basic encoding object. A bit more information can be
found about the different optimization paths in a small presentation I gave
the Committee about what this paper was trying to do and a few other
things:
https://thephd.github.io/presentations/unicode/sg16/K%C3%B6ln/ThePhD%20-%20K%C3%B6ln%202019%20Standards%20C++%20Meeting%20-%20Catch%20Up.pdf

> static void reset(state&);
>
> What's the use case for this as opposed to constructing a new object?
>

Not any good use case, really: I should likely change this to just let
someone default-construct and copy over the state. I have to reconcile
encoding objects and states which should conceivably be
non-default-constructible (e.g., they hold a string or enumeration value
that contains some indication of which encoding to use plus any
intermediate conversion state) and this design. The minimum API of "state"
needs to be more fleshed out.

> 3.2.3.1. Encodings Provided by the Standard
>
> > namespace std { namespace text {
> >
> > class ascii;
> > class utf8;
> > class utf16;
> > class utf32;
> > class narrow_execution;
> > class wide_execution;
>
> This is rather underwhelming for an application developer wishing to
> consume Web content or email.
>

I agree, but providing the entirety of what the WHATWG wants is something
that should likely be provided by an external library who can keep up with
changes and the cadence of changes. The standard moves slower than a ball
of molasses going uphill, and people are even slower to port despite C++
gaining significant speed in recent standardization efforts and releases.
(The number of people on GCC 4.x and old no-longer-LTS versions of many
Linux distributions for "legacy reasons" is eye-popping and staggering.) We
pick the encodings that:

1) the standard is already responsible for in its entirety (narrow/wide);
2) are old-as-dirt standard and will not change anytime in my lifetime
(utf8, utf32, and utf16 if Aliens don't show up and overflow the allotted
21 bits with their new languages);
3) and, are old-as-dirt standard and provide reasonable speed gains if the
standard can optimize for them (ascii).

Having only these encodings also means that optimizations are much more
feasible for standard library developers once this paper lands, rather than
implementing the full suite of WHATWG encodings.


> On the other hand, the emphasis of the design presented in this paper
> being compile-time specializable seems weird in connection to
> `narrow_execution`, whose implementation needs to be dispatched at
> runtime. Presenting a low-level compile-time specializable interface
> but then offering unnecessarily runtime-dispatched encoding through it
> seems like a layering violation.
>
> > If an individual knows their text is in purely ASCII ahead of time and
> they work in UTF8, this information can be used to bit-blast (memcpy) the
> data from UTF8 to ASCII.
>
> Does one need this API to live dangerously with memcpy? (As opposed to
> living dangerously with memcpy directly.)
>

The idea is that the implementation can safely memcpy, because they have
compile-time information that indicates they can do so. If they don't, they
can't make that guarantee; I want to provide the implementation that
guarantee.


> > 3.2.3.2. UTF Encodings: variants?
>
> > both CESU-8 and WTF-8 are documented and used internally for legacy
> reasons
>
> This applies also to wide_execution, utf16, and utf32. (I wouldn't be
> surprised if WTF-8 surpassed UTF-32 in importance in the future.)
>
> I'm not saying that CESU-8 or WTF-8 should be included, but I think
> non-byte-code-unit encodings don't have good justification for being
> in the same interface that is used for consuming external data
>

I will try to think of ways to separate the two APIs. I don't have many
ideas for this yet.


> > More pressingly, there is a wide body of code that operates with char as
> the code unit for their UTF8 encodings. This is also subtly wrong, because
> on a handful of systems char is not unsigned, but signed.
>
> This is a weird remark. Signed char as a misfeature generally and bad
> for a lot of processing besides UTF-8.
>

> > template <typename CharT, bool encode_null, bool
> encode_lone_surrogates>
>
> I don't think it's all clear that it's a good idea in terms of
> application developer understanding of the issues to enable "UTF-8" to
> be customized like this is supposed to having WTF-8 as a separate
> thing.
>

That's a fair assessment. Some individuals in the meeting where I did a
presentation about this were very keen that they had the ability to
customize how UTF8 is handled without having to rewrite the entirety of it
themselves.

char is -- much to my great compiler aliasing analysis pains -- still being
used. Some people have std::is_unsigned_v<char> == true for their platform,
and all they care about is their platform, and they write all their UTF8
code using char, and they were already lining up to throw a hissy fit for
their big legacy application and interoperable codebases if we forced the
encoding object to use char8_t exclusively, all the time. I think char is
used far too much and should be swiftly burned out of a lot of APIs, but
the sheer magnitude of current-generation code would mean not giving these
individuals an escape hatch is the swiftest way to burning a compatibility
bridge.

But maybe we should burn it...


> > This is a transformative encoding type that takes the source (network)
> endianness
>
> Considering that the "network byte order" is IETF speak for "big
> endian", I think it's confusing to refer to whatever you get from an
> external source in this manner.
>

I will change the wording to just keep "source".


> > This paper disagrees with that supposition and instead goes the route of
> providing a wrapping encoding scheme
>
> FWIW, I believe that conversion from wchar_t, char16_t, or char32_t
> into UTF-8 should not be forced into the same API as conversion from
> external byte-oriented data sources, and I believe that it's
> conceptually harmful to conflate the char16_t-oriented UTF-16 with the
> byte-oriented UTF-16BE and UTF-16LE external encodings.
>

     encoding_scheme<utfX> will require the input value_type to be
std::byte for decoding, and the output value_type will be std::byte for
encoding. std::byte has no implicit conversions, so it's a hard error to
give it anything that's not exactly an input or output range of exactly
std::byte. (For the default implementation, anyhow. The template allows
you to change the Byte type used in encoding_scheme, but at that point
you've asked for the lack of strict safety and it's your problem now.) That
alleviated my and other people's safety concerns, and encoding_scheme has
already specifically seen implementation experience with success.


> > 3.2.4. Stateful Objects, or Stateful Parameters?
>
> > maintains that encoding objects can be cheap to construct, copy and
> move;
>
> Surely objects that have methods (i.e. state is taken via `this`
> syntactic sugar) can be cheap to construct, copy, and move.
>
> > improves the general reusability of encoding objects by allowing
> state to be massaged into certain configurations by users;
>
> It seems to me that allowing application developers to "massage" the
> state is an anti-feature. What's the use case for this?
>
> > and, allows users to set the state in a public way without having to
> prescribe a specific API for all encoders to do that.
>
> Likewise, what's the use case for this?
>

     The goal here is for when locale dependency or dynamic encodings come
into play. We want to keep the encoding object itself cheap to create and
use, while putting any heavy lifting inside of the state object which will
get passed around explicitly by reference in the low-level API. Encoding
tags, incredibly expensive locale objects, and more can all be placed onto
the state itself, while the encoding object serves as the cheap handle that
allows working with such a state.

     I would be interested in pursuing the alternate design where the
encoding object just holds all the state, all the time. This means spinning
up a fresh encoding object anytime state needs to be changed, but I can
imagine it still amounting to the same level of work in many cases. I will
be mildly concerned that doing so will front-load things like change in
locale and such to filling the Encoding Object's API, or mandate certain
constructor forms. This same thing happened to wstring_convert, codect_X
and friends, so I am trying to do my best to avoid that pitfall. This also
brings up a very concerning point: if "state" has special members that
can't be easily reconstructed or even copied, how do we handle copying one
encoding from one text object to another? Separating the state means it's
tractable and controllable, not separating it means all of the copyability,
movability, etc. becomes the encoding's concern now.


> > As a poignant example: consider the case of execution encoding character
> > sets today, which often defer to the current locale. Locale is inherently
> > expensive to construct and use: if the standard has to have an encoding
> > that grabs or creates a codecvt or locale member, we will immediately
> lose
> > a large portion of users over the performance drag during construction of
> > higher-level abstractions that rely on the encoding. It is also notable
> that
> > this is the same mistake std::wstring_convert shipped with and is one of
> > the largest contributing reasons to its lack of use and subsequent
> > deprecation (on top of its poor implementation in several libraries, from
> > the VC++ standard library to libc++).
>
> As noted, trying to provide a compile-time specialized API that
> provides access to inherently runtime-discovered encodings seems like
> a layering violation. Maybe the design needs to surface the
> dynamically dispatched nature of these encodings and to see what that
> leads to in terms of the API design.
>

     The encoding objects API is a concept. What member types and
definitions are required at compile-time are the ones relating to code
units and code points, but nothing prevents the user from making code_unit
= byte; and code_point = unicode_code_point; -- this is how the desired
encoding_scheme<...> type will work to serialize between an Encoding Object
and a byte-based representation suitable for network transmission. Nothing
stops the API from being pushed to runtime by making all the functions
virtual functions on the Encoding Object; in fact, that is exactly how I
plan to implement an iconv-based example.

     We'll see how it pans out. :D

> 3.2.4.1. Self-Synchronizing State
>
> > If an encoding is self-synchronizing, then at no point is there a need
> to refer to an "potentially correct but need to see more" state: the input
> is either wholly correct, or it is not.
>
> Is this trying to say that a UTF-8 decoder wouldn't be responsible for
> storing the prefix of buffer-boundary-crossing byte sequences into its
> internal state and it would be the responsibility of the caller to
> piece the parts together?
>

     The purpose of this is to indicate that a state has no "holdover"
between encoding calls. Whatever it encodes or decodes results in a
complete sequence, and incomplete sequences are left untouched (or encodes
a replacement character and gets skipped over depending on the error
handler, etc. etc.). This means that function calls end up being "pure"
from the point of the encoder and decoder. There were some useful bits here
in detecting when state can be thrown out the window and created on the fly
by the implementation, rather than needing to preserve. A
micro-optimization, at best, and likely something that won't be pursued
until most of the other concerns the paper is trying to tackle are polished
up.


> > 3.3.1. Transcoding Compatibility
>
> What are the use cases for this? I suggest to treating generalized
> transcoding as a YAGNI matter, and if someone really needs it, letting
> them pivot via UTF-8 or UTF-16.
>

     The point of this section is to allow for encodings to clue the
implementation in as to whether or not just doing `memcpy` or similar is
acceptable. If my understanding is correct, a lot of the GBK encodings are
bitwise compatible with GB18030. It would make sense for an implementation
to speedily copy this into storage rather than have to roundtrip through
transcoding.


>
> > 3.3.2. Eager, Fast Functions with Customizability
>
> > Users should be able to write fast transcoding functions that the
> standard picks up for their own encoding types. From GB1032
>
> Is 1032 the intended number here?
>

     Nope; this should be GB18030. Thanks.


> > WHATWG encodings
>
> If it is the responsibility of the application developer to supply an
> implementation of the WHATWG Encoding Standard, to my knowledge the
> fastest and most correct option is to use encoding_rs via C linkage.
>
> In that case, what's the value proposition for wrapping it in the API
> proposed by this paper as opposed to using the API from
>
> https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h
> updated with C++20 types (notably std::span and char8_t) directly?
>

     The purpose of wrapping this API is to make it standard so that
everyone doesn't have to keep reimplementing it. It means that everyone can
write the code in one way and everyone gets the same optimizations,
similarly to how I've demonstrated that by having a single
bit_iterator/bit_view range abstraction, the standard library (or the user)
can optimize it and everyone else can benefit:
https://thephd.github.io/seize-bits-production-gsoc-2019

     The reason we don't just want to have span<T> interfaces is because of
the same flexibility iterators have bought us overtime. That doesn't mean
your encoding object must be templated or deal with non-contiguous storage:
I wrote a (mock) encoding object here that only works with vector<T>,
basic_string<T>, and other contiguous containers by hardcoding span:
http://www.open-std.org/pipermail/unicode/2019-August/000633.html

     As noted in that e-mail, hard-coding such things means you can't have
deque<T> or rope<T> or gap_buffer<T> or whatever else kind of
non-contiguous, but if you know your payload is always contiguous then
you'll never hit an error. When you do hit a compiler error, you can make
the decision about where you want to apply the flexibility. The standard
library should serve everyone's needs, but there should be room -- and
there will be room -- to slim things down to just what you're interested in.


> Regardless of the value proposition, writing the requisite glue code
> could be a useful exercise to validate the API proposed in this paper:
> If the glue code can't be written, is there a good reason why not?
> Does the glue code foil the SIMD acceleration?
>
> Also, to validate the API proposed here, it would be a good exercise
> to encode, "with replacement" in the Encoding Standard sense, a string
> consisting of the three Unicode scalar values U+000F, U+2603, and
> U+3042 into ISO-2022-JP and to see what it takes API-wise to get the
> Encoding Standard-compliant result.
>

    I will add an issue to attempt exactly that (after I move the
implementation to a more easily-accessible standalone repository).

> 4. Implementation
>
> > This paper’s r2 hopes to contain more benchmarks
>
> I'd be interested in seeing encoding_rs (built with SIMD enabled, both
> on x86_64 and aarch64) included in the benchmarks. (You can grab build
> code from https://github.com/hsivonen/recode_cpp/ , but `cargo
> --release` needs to be replaced with `cargo --release --features
> simd-accel`, which requires a nightly compiler, to enabled SIMD.)
>

I will make sure that's part of the benchmarks when I move the
implementation to a standalone repository.

> [WTF8]
> > Simon Sapin. The WTF-8 encoding. September 26th, 2019. URL:
> https://simonsapin.github.io/wtf-8/
>
> That date can't be right.
>

     Yep, thanks for catching that. I'll fix it in the latest draft.

      Hopefully this was informative enough. I've read this over a few
times, but I might have dropped a sentence or word or two here and there.
I'll do my best to furnish a new paper including all of the feedback and
changing the APIs were applicable.

     Thank you, so much, for your time and effort in this.

Received on 2019-09-06 05:40:49