sg16: Re: [SG16] P1629 and replacement code units vs replacement code points

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Thu, 6 Feb 2020 15:45:36 -0500

Dear SG16,

On Thu, Feb 6, 2020 at 1:09 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> > This also leaves room for a more performant language underneath.
>
> I'm reading this as "someone could provide a mapping class from EBCDIC to
> ASCII without doing EBCDIC -> char32_t -> ASCII, which is faster".
>
> Fair enough.
>
> Given that additional dimension of freedom, what should
> be standardized here?
>

     I talk about it in the larger presentations here (
https://www.youtube.com/watch?v=BdUipluIf1E&t=2663, timed link).

     In short, we recognize that encoding and decoding one by one is a slow
process. Even if it is slow, it enables us to pivot through Unicode Code
Points from one encoding to any other encoding on the planet (modulo
cutting edge insanity that does not encode/decode from/to Unicode). That
flexibility comes at the price of performance, but that does not mean we
give up the performance.

     The fix is to provide customization points that an end-user can use to
specialize in the cases they are interested in. For example, encoding
between utf_ebcdic and utf8 can be done much faster because utf_ebcdic has
an intermediary encoding step called "utf-8 mod" described in its
specification that can be much faster than converting code points for utf8.
The customization points are as follows:

For "decode( input, encoding, output, error_handler)":
    - text_decode( input_range, output_range, encoding, error_handler )
For "encode( input, encoding, output, error_handler)":
    - text_encode( input_range, output_range, encoding, error_handler )
For "decode( input, from_encoding, output, to_encoding,
from_error_handler, to_error_handler)":
    - text_transcode( input_range, from_encoding, output_range,
to_encoding, from_error_handler, to_error_handler )

When writing these functions, the encoding types act as the "strongly
typed" tags used to catch the overload. The order of operations in my
implementation is as follows:

     - if text_{foo} can be called via ADL, call it (will appear in
standard)
     - if __internal_text_{foo} can be called via ADL, call it (this is an
implementation-specific hook, will not appear in Paper Specification)
     - otherwise, default implementation (will appear in standard)

     We always priotitize the user's hooks over the default implementation
or implementation-specific hooks. This allows the standard library to
optimize for ranges and routines that are commonly deployed or for which
they already have optimized implementation. For example, Windows has
WideCharToMultiByte and vice-versa: in my implementation I detect if a call
on Windows uses "contiguous ranges", and if it does with 2 encodings that
are recognized by the implementation, we drop into using
WideCharToMultiByte rather than any default implementation.

      This is also the motivation for writing the C functions: if we detect
contiguous ranges, we used my personal optimized implementation of the C
functions I showed in the presentation at the beginning. (This is also part
of the reason for writing the C functions in the first place: we need
optimized implementations for encodings that the implementation is aware of
and that they will probably optimized anyways.) Platforms with optimized
ICU or iconv or Bink or etc. etc. implementations can hook the code in much
the same way.

     There are other customization points as well, but these are the core
ones.

>> Oh, and the use of "code units" for one end in the code seems a bit
> misguided.
> >
> > I'm missing the context for this statement.
>
> In the demo implementation, there were some types referring to "code
> points"
> and others referring to "code units". I thought "decode_one" always deals
> with one code point and consumes/produces as many code units as necessary.
> (Note that, in general, both "in" and "out" might need more than one code
> unit to represent a code point.)
>

     I use "code units" to signify the units of encoded text. "code points"
refers to the other end, which is the units of decoded text. When someone
encodes, they want to use the replacement code unit, if possible, because
that is guaranteed to fit. A code point of � may NOT fit in the output
stream of encoded text. Additionally, there are encodings where "?" does
not exist as a value either, and so we cannot just pick these two code
points ( � and ? ) and declare them the only replacement units that can be
used. It is inherently a property of the encoding: therefore, we let users
who write their own (wonky) encodings get a say by having those constexpr
variables.

     That being said, `encode_replacement` and `decode_replacement` are
fundamentally better names, and I will make sure to change those in the
implementation ASAP.

     Is this a bit more clear?

Sincerely,
JeanHeyd

Received on 2020-02-06 14:48:27