C++ Logo

sg16

Advanced search

Re: Bidirectional invalid UTF decoding

From: Steve Downey <sdowney_at_[hidden]>
Date: Fri, 24 Feb 2023 13:38:42 -0500
Note that the current 'common practice' is noted by Unicode 15
https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf section 3.9
U+FFFD Substitution of Maximal Subparts which is also the Web encoding
standard recommendation (which refers to an older rec from Unicode) :
http://www.w3.org/TR/encoding

We should probably follow the interop guidance. The Unicode doc has a few
test cases for various flavors of ill-formed.



On Fri, Feb 24, 2023 at 9:47 AM Zach Laine via SG16 <sg16_at_[hidden]>
wrote:

> I managed to get the same sequence forward and backward, even in error
> cases. I have tests that cover this exact situation:
>
> https://github.com/tzlaine/text/blob/master/test/utf8.cpp#L784
>
> I think it's appropriate for users to expect that a sequence be the
> same, no matter which direction you traverse it.
>
> Zach
>
> On Fri, Feb 24, 2023 at 3:53 AM Corentin <corentin.jabot_at_[hidden]> wrote:
> >
> > Hey
> >
> > There is no technical challenge in decrementing a UTF-8 iterator that is
> well-formed.
> >
> > As eluded to in
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html,
> > When the UTF-8 sequence is ill-formed, we need to keep track of the
> beginning of the range to not go before the start (note that views hide
> that complexity from users).
> >
> > But the other thing that we have not talked about in the case of
> ill-formed UTF-8, is that, if the error policy is to substitute the
> replacement character,
> > then we (probably?) can't guarantee that equals(invalid_utf_view,
> reverse(reverse(invalid_utf_view))), as different subsequences may be
> substituted.
> > This leaves us with the options of either documenting it, and hoping
> nothing breaks, or adding a constraint on operator-- that the error policy
> is not the default substitution policy.
> >
> > (I don't think we should consider making decoding not bidirectional,
> it's just too useful, ie to search a grapheme from the end, for example)
> >
> > Anyway, that thought is now in the record :)
> >
> > (This was brought to my attention by Eugene Gershnik, the author of this
> decoding algorithm, which i use in my implementation
> https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html )
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2023-02-24 18:38:57