C++ Logo

sg16

Advanced search

Re: Bidirectional invalid UTF decoding

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Fri, 24 Feb 2023 08:47:19 -0600
I managed to get the same sequence forward and backward, even in error
cases. I have tests that cover this exact situation:

https://github.com/tzlaine/text/blob/master/test/utf8.cpp#L784

I think it's appropriate for users to expect that a sequence be the
same, no matter which direction you traverse it.

Zach

On Fri, Feb 24, 2023 at 3:53 AM Corentin <corentin.jabot_at_[hidden]> wrote:
>
> Hey
>
> There is no technical challenge in decrementing a UTF-8 iterator that is well-formed.
>
> As eluded to in https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html,
> When the UTF-8 sequence is ill-formed, we need to keep track of the beginning of the range to not go before the start (note that views hide that complexity from users).
>
> But the other thing that we have not talked about in the case of ill-formed UTF-8, is that, if the error policy is to substitute the replacement character,
> then we (probably?) can't guarantee that equals(invalid_utf_view, reverse(reverse(invalid_utf_view))), as different subsequences may be substituted.
> This leaves us with the options of either documenting it, and hoping nothing breaks, or adding a constraint on operator-- that the error policy is not the default substitution policy.
>
> (I don't think we should consider making decoding not bidirectional, it's just too useful, ie to search a grapheme from the end, for example)
>
> Anyway, that thought is now in the record :)
>
> (This was brought to my attention by Eugene Gershnik, the author of this decoding algorithm, which i use in my implementation https://gershnik.github.io/2021/03/24/reverse-utf8-decoding.html )

Received on 2023-02-24 14:47:31