C++ Logo

sg16

Advanced search

Re: [SG16] P1629 and replacement code units vs replacement code points

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 6 Feb 2020 19:09:29 +0100
On 06/02/2020 17.38, Tom Honermann wrote:
> On 2/6/20 11:11 AM, Jens Maurer via SG16 wrote:
>> On 06/02/2020 16.03, Corentin Jabot via SG16 wrote:
>>>
>>> On Thu, Feb 6, 2020, 15:56 Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>>
>>> In our discussion of P1629 <https://wg21.link/p1629> yesterday (a meeting summary will appear here <https://github.com/sg16-unicode/sg16-meetings#february-5th-2020> in the next few days), I raised the question of why encoding objects provided the ability to specify both replacement code units and replacement code points. I'm afraid I didn't follow the discussion well (I was distracted by kids and pizza delivery...). I'd like to better understand the motivation for both.
>>>
>>> My expectation is that only replacement code points should be required. This is based on the following observations:
>>>
>>> 1. When encoding, if a provided code point cannot be encoded, a replacement code point (that is guaranteed to be encodeable) is encoded. (Encoding can never produce an ill-formed code unit sequence, at least not without a contract violation).
>>> 2. When decoding, if a code unit sequence is ill-formed, a replacement code point is produced (and the ill-formed code unit sequence skipped in an encoding dependent way subject to synchronization capabilities).
>>> 3. When transcoding, if a decoded code point cannot be encoded, the replacement code point from the target encoding is encoded (and that is guaranteed to produce a well-formed code unit sequence).
>>>
>>> I don't see where a replacement code unit sequence fits in to the above except as a possible optimization to avoid the overhead of encoding the replacement code point (in which case, the replacement code unit sequence better match how a replacement code point sequence would be encoded).
>>>
>>> Could someone please enlighten me? When would a replacement code unit sequence be used?
>>>
>>>
>>> I want to agree with everything.
>>> I would add that we have a naming problem and would suggest:
>>>
>>> * encode_replacement
>>> * decode_replacement
>>>
>>> I noticed the same thing with many names, up to the name of the encoding.
>>> Is it always implied that one end of the encoder object is Unicode and as such "ascii" implies "ascii<->Unicode codepoints" ?
>> I also wondered about that.
>>
>> Since we now have char32_t in the standard, which is guaranteed UTF-32 / Unicode,
> We now guarantee that char32_t character and string literals are UTF-32
> encoded, but we don't have wording stating that char32_t objects always
> hold a Unicode code point value (though I would recommend against using
> them for other purposes).

Sure, you can also hold random stuff in a "char", but you shouldn't.

>> it would seem to be a simplification to always have one end as Unicode.
>> (Further, Unicode is guaranteed-lossless, so this doesn't lose anything.)
>
> That would be a simplification in some respects, but would be strange in
> others.
>
> const char *ebcdic_text = "A";
> assert(decode_one(ebcdic_text, std::text::ebcdic) == 'A'); // Fails
> assert(decode_one(ebcdic_text, std::text::ebcdic) == U'A'); // Succeeds

(Assuming an EBCDIC execution character set.)

> This also leaves room for a more performant language underneath.

I'm reading this as "someone could provide a mapping class from EBCDIC to
ASCII without doing EBCDIC -> char32_t -> ASCII, which is faster".

Fair enough.

Given that additional dimension of freedom, what should
be standardized here?

>> Oh, and the use of "code units" for one end in the code seems a bit misguided.
>
> I'm missing the context for this statement.

In the demo implementation, there were some types referring to "code points"
and others referring to "code units". I thought "decode_one" always deals
with one code point and consumes/produces as many code units as necessary.
(Note that, in general, both "in" and "out" might need more than one code
unit to represent a code point.)

Jens

Received on 2020-02-06 12:12:11