> This also leaves room for a more performant language underneath.
I'm reading this as "someone could provide a mapping class from EBCDIC to
ASCII without doing EBCDIC -> char32_t -> ASCII, which is faster".
Fair enough.
Given that additional dimension of freedom, what should
be standardized here?
In short, we recognize that encoding and decoding one by one is a slow process. Even if it is slow, it enables us to pivot through Unicode Code Points from one encoding to any other encoding on the planet (modulo cutting edge insanity that does not encode/decode from/to Unicode). That flexibility comes at the price of performance, but that does not mean we give up the performance.
The fix is to provide customization points that an end-user can use to specialize in the cases they are interested in. For example, encoding between utf_ebcdic and utf8 can be done much faster because utf_ebcdic has an intermediary encoding step called "utf-8 mod" described in its specification that can be much faster than converting code points for utf8. The customization points are as follows:
For "decode( input, encoding,
output,
error_handler)":
- text_decode( input_range,
output_range,
encoding, error_handler )
For "encode( input, encoding,
output,
error_handler)":
- text_encode( input_range, output_range, encoding, error_handler )
For "decode( input, from_encoding,
output, to_encoding, from_error_handler,
to_error_handler)":
- text_transcode( input_range, from_encoding, output_range, to_encoding, from_error_handler, to_error_handler )
When writing these functions, the encoding types act as the "strongly typed" tags used to catch the overload. The order of operations in my implementation is as follows:
- if text_{foo} can be called via ADL, call it
(will appear in standard)
- if __internal_text_{foo} can be called via ADL, call it (this is an implementation-specific hook, will not appear in Paper Specification)
- otherwise, default implementation
(will appear in standard)
We always priotitize the user's hooks over the default implementation or implementation-specific hooks. This allows the standard library to optimize for ranges and routines that are commonly deployed or for which they already have optimized implementation. For example, Windows has WideCharToMultiByte and vice-versa: in my implementation I detect if a call on Windows uses "contiguous ranges", and if it does with 2 encodings that are recognized by the implementation, we drop into using WideCharToMultiByte rather than any default implementation.
This is also the motivation for writing the C functions: if we detect contiguous ranges, we used my personal optimized implementation of the C functions I showed in the presentation at the beginning. (This is also part of the reason for writing the C functions in the first place: we need optimized implementations for encodings that the implementation is aware of and that they will probably optimized anyways.) Platforms with optimized ICU or iconv or Bink or etc. etc. implementations can hook the code in much the same way.
There are other customization points as well, but these are the core ones.
>> Oh, and the use of "code units" for one end in the code seems a bit misguided.
>
> I'm missing the context for this statement.
In the demo implementation, there were some types referring to "code points"
and others referring to "code units". I thought "decode_one" always deals
with one code point and consumes/produces as many code units as necessary.
(Note that, in general, both "in" and "out" might need more than one code
unit to represent a code point.)
I use "code units" to signify the units of encoded text. "code points" refers to the other end, which is the units of decoded text. When someone encodes, they want to use the replacement code unit, if possible, because that is guaranteed to fit. A code point of
� may NOT fit in the output stream of encoded text. Additionally, there are encodings where "?" does not exist as a value either, and so we cannot just pick these two code points (
� and ? ) and declare them the only replacement units that can be used. It is inherently a property of the encoding: therefore, we let users who write their own (wonky) encodings get a say by having those constexpr variables.
That being said, `encode_replacement` and `decode_replacement` are fundamentally better names, and I will make sure to change those in the implementation ASAP.
Is this a bit more clear?
Sincerely,
JeanHeyd