C++ Logo


Advanced search

Re: Bidirectional invalid UTF decoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 24 Feb 2023 22:10:07 +0100
On Fri, Feb 24, 2023, 21:38 Tom Honermann <tom_at_[hidden]> wrote:

> Copying Eugene since, I think, he isn't subscribed to the SG16 mailing
> list. Eugene, feel free to contribute any thoughts you have; we're always
> happy to have yet another participant that has implemented UTF-8
> encoding/decoding! :)
> I agree that we should follow the Unicode/W3C/WhatWG guidance as a default
> behavior (possibly with additional opt-in error handling behaviors).
> Corentin, with regard to the issue of reverse decoding, I'm not sure I
> understand the concern. This code:
> equals(invalid_utf_view, reverse(reverse(invalid_utf_view)))
> isn't valid since, assuming reverse is ranges::reverse(), the inner call
> returns an iterator and the outer call would then need a sentinel argument.
> It also isn't clear to me whether invalid_utf_view is a range of
> (decoded) code points (in which case the reverse() calls operate on code
> points) or a range of (invalid) code units (in which case there seems to be
> no decoding taking place)
> Can you clarify the example and the concern?

The example is not important. It's pseudo code.
I meant to say that reversing twice may not produce the same sequence as
decoding once forward.

I would expect a forward decode and a reverse decode to implement the same
> substitution policy; basically, the reverse decode should behave as-if each
> code point is decoded by first iterating to the lead code unit of a
> well-formed sequence or to the first code unit of an invalid code unit
> sequence (either at the beginning of input, or immediately following a
> well-formed sequence), then decoding in the forward direction, and then
> reversing again to one before the initially identified code unit. That
> sounds inefficient, but I don't think it needs to be (at least, not for the
> happy path).
Finding that lead codepoint, depending on the substitution policy, may
require to look ahead many code units, which sounds inefficient indeed.
I don't think there is a happy path as we constantly have to do that lookup
to know whether the path is happy.

But again it depends on the specific substitution policy.

Then again, if people can customize that policy we will get into trouble no
matter what.

Perhaps I'm missing something, but I can't think of a scenario in which any
> of the three policy options described in PR-121
> <http://unicode.org/review/pr-121.html> we would (or should) produce
> different substitution result
I think this document is good to illustrate the issue. To produce 1, which
is the pathological case, you need to look ahead (or rather back) 7
codepoints in that example.
Can you cache that? Maybe but a cache may be expensive to maintain and
base() needs to be maintained.

The ""best"" strategy may be to have 2 iterators chasing one another at a
distance that depends on the strategy - but it's still twice the work!

You are right that there may not be a case where we can't produce the same
output *if we know the substitution strategy* - but at what cost?

We also may not have a choice as we want to satisfy the requirements of
bidirectional iterators but that means that the error policy will have
quite the impact on performance!

> Tom.
> On 2/24/23 1:38 PM, Steve Downey via SG16 wrote:
> Note that the current 'common practice' is noted by Unicode 15
> https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf section 3.9
> U+FFFD Substitution of Maximal Subparts which is also the Web encoding
> standard recommendation (which refers to an older rec from Unicode) :
> http://www.w3.org/TR/encoding
> We should probably follow the interop guidance. The Unicode doc has a few
> test cases for various flavors of ill-formed.
> On Fri, Feb 24, 2023 at 9:47 AM Zach Laine via SG16 <sg16_at_[hidden]>
> wrote:
>> I managed to get the same sequence forward and backward, even in error
>> cases. I have tests that cover this exact situation:
>> https://github.com/tzlaine/text/blob/master/test/utf8.cpp#L784
>> I think it's appropriate for users to expect that a sequence be the
>> same, no matter which direction you traverse it.
>> Zach
>> On Fri, Feb 24, 2023 at 3:53 AM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>> >
>> > Hey
>> >
>> > There is no technical challenge in decrementing a UTF-8 iterator that
>> is well-formed.
>> >
>> > As eluded to in
>> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html,
>> > When the UTF-8 sequence is ill-formed, we need to keep track of the
>> beginning of the range to not go before the start (note that views hide
>> that complexity from users).
>> >
>> > But the other thing that we have not talked about in the case of
>> ill-formed UTF-8, is that, if the error policy is to substitute the
>> replacement character,
>> > then we (probably?) can't guarantee that equals(invalid_utf_view,
>> reverse(reverse(invalid_utf_view))), as different subsequences may be
>> substituted.
>> > This leaves us with the options of either documenting it, and hoping
>> nothing breaks, or adding a constraint on operator-- that the error policy
>> is not the default substitution policy.
>> >
>> > (I don't think we should consider making decoding not bidirectional,
>> it's just too useful, ie to search a grapheme from the end, for example)
>> >
>> > Anyway, that thought is now in the record :)
>> >
>> > (This was brought to my attention by Eugene Gershnik, the author of
>> this decoding algorithm, which i use in my implementation
>> https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html )
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2023-02-24 21:10:20