ISOCPP sg16 List: Re: Bidirectional invalid UTF decoding

From: Eugene Gershnik <gershnik_at_[hidden]>
Date: Fri, 24 Feb 2023 22:39:23 +0000

Thank you Tom, and apologies for missing the meeting again!

I believe there are 3 issues here:

1. Is it a problem if reverse iteration produces different output. Yes it is, at least for user algorithms, even if no standard algorithms would be affected. Consider a simple case of wanting to trim whitespace on both ends of a string. The obvious and fastest implementation is to iterate from begin() forward and from rend() backward discarding Unicode whitespace until non whitespace is found or they meet. Then cut out the subsequence of this pair. But this needs both iterators to be reachable from each other. Which might not be the case if reverse one is jumping to “before” the forward one…

2. Can exact same sequences be produced. Yes it is possible but as Corentin noticed at a cost.

The thing is, the cost likely would have to be born by *both* forward and reverse iterators.

First of all looking at PR-121 option 2 (if I understand it correctly) is the one most decoders seem to follow. The fast decoders (I am very familiar with state machine ones) detect an error somewhere in the middle of the sequence and then defer to the “user” (e.g. code one level higher) to deal with it. An ideal “user” doesn’t have any state and performs no branches other than check whether the encoder needs more input, is done or has an error. Adding more branches and state there kinda defeats the whole purpose of having a fast decoder in the first place.

From what I discovered back in a day forward decoders usually use the following heuristic: restart from the last byte that caused the error unless it was the first byte in a codepoint. In other words, if the first byte in a codepoint is bad replace and move to the next one. If any of the trail bytes is bad they assume that everything before it is one replacement character and then restart from there. This _seems_ to be what PR-121 implies but I am not 100% sure.
You can see what your browser is doing at https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html

But be it as it may, the reverse decoder, in order to match would need to exactly replicate this strategy. For a forward strategy described above this would become quite convoluted. You will essentially need to find the start, reverse the direction, decode the forward replacements, store them and return in opposite order. This seems pretty bad both in space and branch count (you kinda lose all the benefits of state machine with all the if()s it will require). It is true that the branching will be somewhat limited to the unhappy path but the space overhead is still there. Plus code complexity tends to affect optimizations even on the happy path. Hard to tell without implementing and measuring what end effect would be.

Now, you can use a different strategy to make forward iteration friendly to the reverse. For example option 1 from PR-121 seems to be like this. But this will penalize forward decoding. It will have to add complexity and branches looking for the proper “end” and the character. This approach would make forward and reverse iteration more or less the same in terms of speed and both suboptimal.

Note that I don’t know how this would play with branchless decoders, what strategy they use for efficient replacement etc.

3. What should be done about it? In my own code I am aware of the issue so I step carefully around it - basically whenever there is simultaneous forward and reverse iteration I need to carefully evaluate what can happen with substitution. I fear this is not a good general purpose approach for the standard library.
Prohibiting operator--() isn’t enough if reverse_iterator is still available because the same issue would happen when mixing iterator and reverse_iterator iteration.
Making iteration not bidirectional at all seems heavy handed because in most cases it works just fine.

Maybe using option 1 as the default replacement strategy (while still allowing a different one for people who want speed) can be made to work. But this would likely lead to observations like “C++ uff-8 decoding is slooooow!”. Also it will produce result visibly quite different from other software for invalid input. Whether this is an issue or not I don’t know.

Hope it helps somewhat!

--
Eugene

On Feb 24, 2023, at 1:10 PM, Corentin <corentin.jabot_at_[hidden]> wrote:

On Fri, Feb 24, 2023, 21:38 Tom Honermann <tom_at_honermann.net<mailto:tom_at_[hidden]>> wrote:

Copying Eugene since, I think, he isn't subscribed to the SG16 mailing list. Eugene, feel free to contribute any thoughts you have; we're always happy to have yet another participant that has implemented UTF-8 encoding/decoding! :)

I agree that we should follow the Unicode/W3C/WhatWG guidance as a default behavior (possibly with additional opt-in error handling behaviors).

Corentin, with regard to the issue of reverse decoding, I'm not sure I understand the concern. This code:

equals(invalid_utf_view, reverse(reverse(invalid_utf_view)))

isn't valid since, assuming reverse is ranges::reverse(), the inner call returns an iterator and the outer call would then need a sentinel argument.

It also isn't clear to me whether invalid_utf_view is a range of (decoded) code points (in which case the reverse() calls operate on code points) or a range of (invalid) code units (in which case there seems to be no decoding taking place)

Can you clarify the example and the concern?

The example is not important. It's pseudo code.
I meant to say that reversing twice may not produce the same sequence as decoding once forward.

I would expect a forward decode and a reverse decode to implement the same substitution policy; basically, the reverse decode should behave as-if each code point is decoded by first iterating to the lead code unit of a well-formed sequence or to the first code unit of an invalid code unit sequence (either at the beginning of input, or immediately following a well-formed sequence), then decoding in the forward direction, and then reversing again to one before the initially identified code unit. That sounds inefficient, but I don't think it needs to be (at least, not for the happy path).

Finding that lead codepoint, depending on the substitution policy, may require to look ahead many code units, which sounds inefficient indeed.
I don't think there is a happy path as we constantly have to do that lookup to know whether the path is happy.

But again it depends on the specific substitution policy.

Then again, if people can customize that policy we will get into trouble no matter what.

Perhaps I'm missing something, but I can't think of a scenario in which any of the three policy options described in PR-121<http://unicode.org/review/pr-121.html> we would (or should) produce different substitution result

I think this document is good to illustrate the issue. To produce 1, which is the pathological case, you need to look ahead (or rather back) 7 codepoints in that example.
Can you cache that? Maybe but a cache may be expensive to maintain and base() needs to be maintained.

The ""best"" strategy may be to have 2 iterators chasing one another at a distance that depends on the strategy - but it's still twice the work!

You are right that there may not be a case where we can't produce the same output *if we know the substitution strategy* - but at what cost?

We also may not have a choice as we want to satisfy the requirements of bidirectional iterators but that means that the error policy will have quite the impact on performance!

Tom.

On 2/24/23 1:38 PM, Steve Downey via SG16 wrote:
Note that the current 'common practice' is noted by Unicode 15 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf section 3.9 U+FFFD Substitution of Maximal Subparts which is also the Web encoding standard recommendation (which refers to an older rec from Unicode) : http://www.w3.org/TR/encoding

We should probably follow the interop guidance. The Unicode doc has a few test cases for various flavors of ill-formed.

On Fri, Feb 24, 2023 at 9:47 AM Zach Laine via SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>> wrote:
I managed to get the same sequence forward and backward, even in error
cases. I have tests that cover this exact situation:

https://github.com/tzlaine/text/blob/master/test/utf8.cpp#L784

I think it's appropriate for users to expect that a sequence be the
same, no matter which direction you traverse it.

Zach

On Fri, Feb 24, 2023 at 3:53 AM Corentin <corentin.jabot_at_[hidden]<mailto:corentin.jabot_at_[hidden]>> wrote:
>
> Hey
>
> There is no technical challenge in decrementing a UTF-8 iterator that is well-formed.
>
> As eluded to in https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html,
> When the UTF-8 sequence is ill-formed, we need to keep track of the beginning of the range to not go before the start (note that views hide that complexity from users).
>
> But the other thing that we have not talked about in the case of ill-formed UTF-8, is that, if the error policy is to substitute the replacement character,
> then we (probably?) can't guarantee that equals(invalid_utf_view, reverse(reverse(invalid_utf_view))), as different subsequences may be substituted.
> This leaves us with the options of either documenting it, and hoping nothing breaks, or adding a constraint on operator-- that the error policy is not the default substitution policy.
>
> (I don't think we should consider making decoding not bidirectional, it's just too useful, ie to search a grapheme from the end, for example)
>
> Anyway, that thought is now in the record :)
>
> (This was brought to my attention by Eugene Gershnik, the author of this decoding algorithm, which i use in my implementation https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html )
--
SG16 mailing list
SG16_at_[hidden]socpp.org<mailto:SG16_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2023-02-24 22:39:25