sg16: [SG16] P1629 and replacement code units vs replacement code points

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 6 Feb 2020 09:56:30 -0500

In our discussion of P1629 <https://wg21.link/p1629> yesterday (a
meeting summary will appear here
<https://github.com/sg16-unicode/sg16-meetings#february-5th-2020> in the
next few days), I raised the question of why encoding objects provided
the ability to specify both replacement code units and replacement code
points. I'm afraid I didn't follow the discussion well (I was
distracted by kids and pizza delivery...). I'd like to better understand
the motivation for both.

My expectation is that only replacement code points should be required.
This is based on the following observations:

1. When encoding, if a provided code point cannot be encoded, a
    replacement code point (that is guaranteed to be encodeable) is
    encoded. (Encoding can never produce an ill-formed code unit
    sequence, at least not without a contract violation).
2. When decoding, if a code unit sequence is ill-formed, a replacement
    code point is produced (and the ill-formed code unit sequence
    skipped in an encoding dependent way subject to synchronization
    capabilities).
3. When transcoding, if a decoded code point cannot be encoded, the
    replacement code point from the target encoding is encoded (and that
    is guaranteed to produce a well-formed code unit sequence).

I don't see where a replacement code unit sequence fits in to the above
except as a possible optimization to avoid the overhead of encoding the
replacement code point (in which case, the replacement code unit
sequence better match how a replacement code point sequence would be
encoded).

Could someone please enlighten me? When would a replacement code unit
sequence be used?

Tom.

Received on 2020-02-06 08:59:08