C++ Logo


Advanced search

Re: [SG16] P1629 and replacement code units vs replacement code points

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 6 Feb 2020 16:03:10 +0100
On Thu, Feb 6, 2020, 15:56 Tom Honermann via SG16 <sg16_at_[hidden]>

> In our discussion of P1629 <https://wg21.link/p1629> yesterday (a meeting
> summary will appear here
> <https://github.com/sg16-unicode/sg16-meetings#february-5th-2020> in the
> next few days), I raised the question of why encoding objects provided the
> ability to specify both replacement code units and replacement code
> points. I'm afraid I didn't follow the discussion well (I was distracted
> by kids and pizza delivery...). I'd like to better understand the
> motivation for both.
> My expectation is that only replacement code points should be required.
> This is based on the following observations:
> 1. When encoding, if a provided code point cannot be encoded, a
> replacement code point (that is guaranteed to be encodeable) is encoded.
> (Encoding can never produce an ill-formed code unit sequence, at least not
> without a contract violation).
> 2. When decoding, if a code unit sequence is ill-formed, a replacement
> code point is produced (and the ill-formed code unit sequence skipped in an
> encoding dependent way subject to synchronization capabilities).
> 3. When transcoding, if a decoded code point cannot be encoded, the
> replacement code point from the target encoding is encoded (and that is
> guaranteed to produce a well-formed code unit sequence).
> I don't see where a replacement code unit sequence fits in to the above
> except as a possible optimization to avoid the overhead of encoding the
> replacement code point (in which case, the replacement code unit sequence
> better match how a replacement code point sequence would be encoded).
> Could someone please enlighten me? When would a replacement code unit
> sequence be used?

I want to agree with everything.
I would add that we have a naming problem and would suggest:

* encode_replacement
* decode_replacement

I noticed the same thing with many names, up to the name of the encoding.
Is it always implied that one end of the encoder object is Unicode and as
such "ascii" implies "ascii<->Unicode codepoints" ?

> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16oroblt

Received on 2020-02-06 09:06:00