Date: Thu, 6 Feb 2020 09:56:30 -0500
In our discussion of P1629 <https://wg21.link/p1629> yesterday (a
meeting summary will appear here
<https://github.com/sg16-unicode/sg16-meetings#february-5th-2020> in the
next few days), I raised the question of why encoding objects provided
the ability to specify both replacement code units and replacement code
points. I'm afraid I didn't follow the discussion well (I was
distracted by kids and pizza delivery...). I'd like to better understand
the motivation for both.
My expectation is that only replacement code points should be required.
This is based on the following observations:
1. When encoding, if a provided code point cannot be encoded, a
replacement code point (that is guaranteed to be encodeable) is
encoded. (Encoding can never produce an ill-formed code unit
sequence, at least not without a contract violation).
2. When decoding, if a code unit sequence is ill-formed, a replacement
code point is produced (and the ill-formed code unit sequence
skipped in an encoding dependent way subject to synchronization
capabilities).
3. When transcoding, if a decoded code point cannot be encoded, the
replacement code point from the target encoding is encoded (and that
is guaranteed to produce a well-formed code unit sequence).
I don't see where a replacement code unit sequence fits in to the above
except as a possible optimization to avoid the overhead of encoding the
replacement code point (in which case, the replacement code unit
sequence better match how a replacement code point sequence would be
encoded).
Could someone please enlighten me? When would a replacement code unit
sequence be used?
Tom.
meeting summary will appear here
<https://github.com/sg16-unicode/sg16-meetings#february-5th-2020> in the
next few days), I raised the question of why encoding objects provided
the ability to specify both replacement code units and replacement code
points. I'm afraid I didn't follow the discussion well (I was
distracted by kids and pizza delivery...). I'd like to better understand
the motivation for both.
My expectation is that only replacement code points should be required.
This is based on the following observations:
1. When encoding, if a provided code point cannot be encoded, a
replacement code point (that is guaranteed to be encodeable) is
encoded. (Encoding can never produce an ill-formed code unit
sequence, at least not without a contract violation).
2. When decoding, if a code unit sequence is ill-formed, a replacement
code point is produced (and the ill-formed code unit sequence
skipped in an encoding dependent way subject to synchronization
capabilities).
3. When transcoding, if a decoded code point cannot be encoded, the
replacement code point from the target encoding is encoded (and that
is guaranteed to produce a well-formed code unit sequence).
I don't see where a replacement code unit sequence fits in to the above
except as a possible optimization to avoid the overhead of encoding the
replacement code point (in which case, the replacement code unit
sequence better match how a replacement code point sequence would be
encoded).
Could someone please enlighten me? When would a replacement code unit
sequence be used?
Tom.
Received on 2020-02-06 08:59:08