C++ Logo

sg16

Advanced search

Re: [SG16] P1629 and replacement code units vs replacement code points

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 6 Feb 2020 17:11:35 +0100
On 06/02/2020 16.03, Corentin Jabot via SG16 wrote:
>
>
> On Thu, Feb 6, 2020, 15:56 Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> In our discussion of P1629 <https://wg21.link/p1629> yesterday (a meeting summary will appear here <https://github.com/sg16-unicode/sg16-meetings#february-5th-2020> in the next few days), I raised the question of why encoding objects provided the ability to specify both replacement code units and replacement code points. I'm afraid I didn't follow the discussion well (I was distracted by kids and pizza delivery...). I'd like to better understand the motivation for both.
>
> My expectation is that only replacement code points should be required. This is based on the following observations:
>
> 1. When encoding, if a provided code point cannot be encoded, a replacement code point (that is guaranteed to be encodeable) is encoded. (Encoding can never produce an ill-formed code unit sequence, at least not without a contract violation).
> 2. When decoding, if a code unit sequence is ill-formed, a replacement code point is produced (and the ill-formed code unit sequence skipped in an encoding dependent way subject to synchronization capabilities).
> 3. When transcoding, if a decoded code point cannot be encoded, the replacement code point from the target encoding is encoded (and that is guaranteed to produce a well-formed code unit sequence).
>
> I don't see where a replacement code unit sequence fits in to the above except as a possible optimization to avoid the overhead of encoding the replacement code point (in which case, the replacement code unit sequence better match how a replacement code point sequence would be encoded).
>
> Could someone please enlighten me? When would a replacement code unit sequence be used?
>
>
> I want to agree with everything.
> I would add that we have a naming problem and would suggest:
>
> * encode_replacement
> * decode_replacement
>
> I noticed the same thing with many names, up to the name of the encoding.
> Is it always implied that one end of the encoder object is Unicode and as such "ascii" implies "ascii<->Unicode codepoints" ?

I also wondered about that.

Since we now have char32_t in the standard, which is guaranteed UTF-32 / Unicode,
it would seem to be a simplification to always have one end as Unicode.
(Further, Unicode is guaranteed-lossless, so this doesn't lose anything.)

Oh, and the use of "code units" for one end in the code seems a bit misguided.

Jens

Received on 2020-02-06 10:14:14