On Thu, Feb 6, 2020, 15:56 Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

In our discussion of P1629 yesterday (a meeting summary will appear here in the next few days), I raised the question of why encoding objects provided the ability to specify both replacement code units and replacement code points.  I'm afraid I didn't follow the discussion well (I was distracted by kids and pizza delivery...).  I'd like to better understand the motivation for both.

My expectation is that only replacement code points should be required.  This is based on the following observations:

  1. When encoding, if a provided code point cannot be encoded, a replacement code point (that is guaranteed to be encodeable) is encoded.  (Encoding can never produce an ill-formed code unit sequence, at least not without a contract violation).
  2. When decoding, if a code unit sequence is ill-formed, a replacement code point is produced (and the ill-formed code unit sequence skipped in an encoding dependent way subject to synchronization capabilities).
  3. When transcoding, if a decoded code point cannot be encoded, the replacement code point from the target encoding is encoded (and that is guaranteed to produce a well-formed code unit sequence).

I don't see where a replacement code unit sequence fits in to the above except as a possible optimization to avoid the overhead of encoding the replacement code point (in which case, the replacement code unit sequence better match how a replacement code point sequence would be encoded).

Could someone please enlighten me?  When would a replacement code unit sequence be used?


I want to agree with everything.
I would add that we have a naming problem and would suggest:

* encode_replacement
* decode_replacement

I noticed the same thing with many names, up to the name of the encoding.
Is it always implied that one end of the encoder object is Unicode and as such "ascii" implies "ascii<->Unicode codepoints" ?