C++ Logo


Advanced search

Re: Bidirectional invalid UTF decoding

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 24 Feb 2023 15:38:57 -0500
Copying Eugene since, I think, he isn't subscribed to the SG16 mailing
list. Eugene, feel free to contribute any thoughts you have; we're
always happy to have yet another participant that has implemented UTF-8
encoding/decoding! :)

I agree that we should follow the Unicode/W3C/WhatWG guidance as a
default behavior (possibly with additional opt-in error handling behaviors).

Corentin, with regard to the issue of reverse decoding, I'm not sure I
understand the concern. This code:

    equals(invalid_utf_view, reverse(reverse(invalid_utf_view)))

isn't valid since, assuming reverse is ranges::reverse(), the inner call
returns an iterator and the outer call would then need a sentinel argument.

It also isn't clear to me whether invalid_utf_view is a range of
(decoded) code points (in which case the reverse() calls operate on code
points) or a range of (invalid) code units (in which case there seems to
be no decoding taking place)

Can you clarify the example and the concern?

I would expect a forward decode and a reverse decode to implement the
same substitution policy; basically, the reverse decode should behave
as-if each code point is decoded by first iterating to the lead code
unit of a well-formed sequence or to the first code unit of an invalid
code unit sequence (either at the beginning of input, or immediately
following a well-formed sequence), then decoding in the forward
direction, and then reversing again to one before the initially
identified code unit. That sounds inefficient, but I don't think it
needs to be (at least, not for the happy path).

Perhaps I'm missing something, but I can't think of a scenario in which
any of the three policy options described in PR-121
<http://unicode.org/review/pr-121.html> would (or should) produce
different substitution results.


On 2/24/23 1:38 PM, Steve Downey via SG16 wrote:
> Note that the current 'common practice' is noted by Unicode 15
> https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf section 3.9
> U+FFFD Substitution of Maximal Subparts which is also the Web encoding
> standard recommendation (which refers to an older rec from Unicode) :
> http://www.w3.org/TR/encoding
> We should probably follow the interop guidance. The Unicode doc has a
> few test cases for various flavors of ill-formed.
> On Fri, Feb 24, 2023 at 9:47 AM Zach Laine via SG16
> <sg16_at_[hidden]> wrote:
> I managed to get the same sequence forward and backward, even in error
> cases. I have tests that cover this exact situation:
> https://github.com/tzlaine/text/blob/master/test/utf8.cpp#L784
> I think it's appropriate for users to expect that a sequence be the
> same, no matter which direction you traverse it.
> Zach
> On Fri, Feb 24, 2023 at 3:53 AM Corentin
> <corentin.jabot_at_[hidden]> wrote:
> >
> > Hey
> >
> > There is no technical challenge in decrementing a UTF-8 iterator
> that is well-formed.
> >
> > As eluded to in
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html,
> > When the UTF-8 sequence is ill-formed, we need to keep track of
> the beginning of the range to not go before the start (note that
> views hide that complexity from users).
> >
> > But the other thing that we have not talked about in the case of
> ill-formed UTF-8, is that, if the error policy is to substitute
> the replacement character,
> > then we (probably?) can't guarantee that
> equals(invalid_utf_view, reverse(reverse(invalid_utf_view))), as
> different subsequences may be substituted.
> > This leaves us with the options of either documenting it, and
> hoping nothing breaks, or adding a constraint on operator-- that
> the error policy is not the default substitution policy.
> >
> > (I don't think we should consider making decoding not
> bidirectional, it's just too useful, ie to search a grapheme from
> the end, for example)
> >
> > Anyway, that thought is now in the record :)
> >
> > (This was brought to my attention by Eugene Gershnik, the author
> of this decoding algorithm, which i use in my implementation
> https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html )
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2023-02-24 20:39:01