sg16: Re: [SG16] P1629 and replacement code units vs replacement code points

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 6 Feb 2020 11:38:32 -0500

On 2/6/20 11:11 AM, Jens Maurer via SG16 wrote:
> On 06/02/2020 16.03, Corentin Jabot via SG16 wrote:
>>
>> On Thu, Feb 6, 2020, 15:56 Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> In our discussion of P1629 <https://wg21.link/p1629> yesterday (a meeting summary will appear here <https://github.com/sg16-unicode/sg16-meetings#february-5th-2020> in the next few days), I raised the question of why encoding objects provided the ability to specify both replacement code units and replacement code points. I'm afraid I didn't follow the discussion well (I was distracted by kids and pizza delivery...). I'd like to better understand the motivation for both.
>>
>> My expectation is that only replacement code points should be required. This is based on the following observations:
>>
>> 1. When encoding, if a provided code point cannot be encoded, a replacement code point (that is guaranteed to be encodeable) is encoded. (Encoding can never produce an ill-formed code unit sequence, at least not without a contract violation).
>> 2. When decoding, if a code unit sequence is ill-formed, a replacement code point is produced (and the ill-formed code unit sequence skipped in an encoding dependent way subject to synchronization capabilities).
>> 3. When transcoding, if a decoded code point cannot be encoded, the replacement code point from the target encoding is encoded (and that is guaranteed to produce a well-formed code unit sequence).
>>
>> I don't see where a replacement code unit sequence fits in to the above except as a possible optimization to avoid the overhead of encoding the replacement code point (in which case, the replacement code unit sequence better match how a replacement code point sequence would be encoded).
>>
>> Could someone please enlighten me? When would a replacement code unit sequence be used?
>>
>>
>> I want to agree with everything.
>> I would add that we have a naming problem and would suggest:
>>
>> * encode_replacement
>> * decode_replacement
>>
>> I noticed the same thing with many names, up to the name of the encoding.
>> Is it always implied that one end of the encoder object is Unicode and as such "ascii" implies "ascii<->Unicode codepoints" ?
> I also wondered about that.
>
> Since we now have char32_t in the standard, which is guaranteed UTF-32 / Unicode,
We now guarantee that char32_t character and string literals are UTF-32
encoded, but we don't have wording stating that char32_t objects always
hold a Unicode code point value (though I would recommend against using
them for other purposes).
> it would seem to be a simplification to always have one end as Unicode.
> (Further, Unicode is guaranteed-lossless, so this doesn't lose anything.)

That would be a simplification in some respects, but would be strange in
others.

const char *ebcdic_text = "A";
assert(decode_one(ebcdic_text, std::text::ebcdic) == 'A'); // Fails
assert(decode_one(ebcdic_text, std::text::ebcdic) == U'A'); // Succeeds

This also leaves room for a more performant language underneath.

> Oh, and the use of "code units" for one end in the code seems a bit misguided.

I'm missing the context for this statement.

Tom.

>
> Jens

Received on 2020-02-06 10:41:10