C++ Logo

sg16

Advanced search

Bidirectional invalid UTF decoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 24 Feb 2023 10:53:24 +0100
Hey

There is no technical challenge in decrementing a UTF-8 iterator that is
well-formed.

As eluded to in
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html,
When the UTF-8 sequence is ill-formed, we need to keep track of the
beginning of the range to not go before the start (note that views hide
that complexity from users).

But the other thing that we have not talked about in the case of ill-formed
UTF-8, is that, if the error policy is to substitute the replacement
character,
then we (probably?) can't guarantee that equals(invalid_utf_view,
reverse(reverse(invalid_utf_view))), as different subsequences may be
substituted.
This leaves us with the options of either documenting it, and hoping
nothing breaks, or adding a constraint on operator-- that the error policy
is not the default substitution policy.

(I don't think we should consider making decoding not bidirectional, it's
just too useful, ie to search a grapheme from the end, for example)

Anyway, that thought is now in the record :)

(This was brought to my attention by Eugene Gershnik, the author of this
decoding algorithm, which i use in my implementation
https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html )

Received on 2023-02-24 09:53:37