ISOCPP sg16 List: Re: Handling ill-formed Unicode in the library

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 12 Sep 2022 15:30:50 -0400

On 9/12/22 3:12 PM, Corentin Jabot via SG16 wrote:
>
>
> On Mon, Sep 12, 2022, 20:44 Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> Hi, Mark.
>
> Thank you for reporting this. I've tentatively put this on the
> agenda for September 28th (along with review of LWG issues 3767
> <https://cplusplus.github.io/LWG/issue3767> and 3412
> <https://cplusplus.github.io/LWG/issue3412>).
>
> Other comments inlined below.
>
> On 9/12/22 1:03 PM, Mark de Wever via SG16 wrote:
>> During the May 11th[1] telecon the paper
>>
>> P2286R8: Formatting Ranges
>>
>> was reviewed.
>>
>> There were concerns raised regarding the lack of specifications for
>> determining the boundaries of ill-formed code unit sequences. We decided
>> it was not a big issue since:
>> - the method used does not appear to be observable since each code unit
>> of the sequence is written to the output anyway.
>> - it should not matter for self-synchronizing encodings.
>>
>> I'm working on the implementation of this part of the paper in libc++
>> and I'm having concerns with example 5 [2]
>>
>> string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
>> // s5 has value: ["\x{c3}\x{28}"]
>>
>> \xc3 is the start of a 2-byte UTF-8 code unit sequence
>> \x28 is not a valid successor byte
>> it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS
>>
>> Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes
>>
>> If the converter encounters an ill-formed UTF-8 code unit sequence
>> which starts with a valid first byte, but which does not continue with
>> valid successor bytes (see Table 3-7), it must not consume the
>> successor bytes as part of the ill-formed subsequence whenever those
>> successor bytes themselves constitute part of a well-formed UTF-8 code
>> unit subsequence.
>>
>> I would have expected the output to be ["\x{c3}("]. So all code units
>> are written, but it isn't clear what the exact specification is.
> I think you are right and that the example is incorrect.
>
>
> I am not so sure whether it is correct or not.
> We need a consistent answer here. It's really important that error
> recovery behaves consistently across existing and future facilities
> and i tend to agree with Charlie on option 2 being desirable.
> Either way we do need a resolution.

I think the answer (for this case) turns out to be the same for all
three of the PR-121 <https://www.unicode.org/review/pr-121.html>
policies since the ill-formed subsequence consists of just the single
\xc3 code unit.

Per [format.string.escaped]p(2.2.3)
<https://eel.is/c++draft/format.string.escaped#2.2.3>, the intended
behavior corresponds to PR-121 policy option 3; each code unit of the
ill-formed code unit sequence is individually encoded (replaced) in the
formatted output.

Tom.

>
>
>> During the telecon Charlie shared a link to Unicode PR-121 [4] and
>> suggested we use policy option 2. Both for handling ill-formed Unicode
>> in an escape string and for the width estimation introduced in
>>
>> P1868R2 🦄 width: clarifying units of width and precision in std::format
>>
>> P1868 doesn't discuss the width estimation of ill-formed Unicode.
>>
>> For P1868 libc++ uses policy option 1 for ill-formed Unicode in the
>> width estimation. MSVC STL uses policy option 2. This means there is
>> implementation divergence in the width estimation.
>>
>> At the moment I have two algorithms in libc++ one for P1868 and one for
>> how I interpret the rules of P2286. (The P2286 code hasn't been
>> reviewed and I expect reviewers to strongly dislike having two
>> algorithms.)
>>
>> I would propose to write a paper as DR which
>> - Addresses the width estimation when encountering ill-formed Unicode.
>> When writing the algorithm I noticed most terminals used policy
>> option 1, however at the time I was unaware of PR-121. So I would like
>> some feedback on which policy option is preferred.
>> - Clearly specifies how to recover from ill-formed Unicode; preferably
>> referring to the Unicode Standard.
>
> It isn't clear to me that it is important for implementations to
> behave consistently when formatting text that contains ill-formed
> code unit sequences, but establishing a recommendation seems
> advisable in any case.
>
> I guess an argument could be made that [format.string.std]p13
> <https://eel.is/c++draft/format.string.std#13> states that an
> ill-formed code unit sequence has an unspecified width. I don't
> find such a reading very satisfying though so I agree we should
> add clarification.
>
> I do think it's sufficient, an invalid code unit sequence make the
> whole string not being in an Unicode encoding.
> We could add a note - as a lwg issue.
> (And unspecified seems appropriate)
>
> At a minimum, we should fix (or remove) the example
> <https://eel.is/c++draft/format.string.escaped#example-1>
> mentioned above.
>
> We could probably handle all of these as LWG issues as opposed to
> a paper if you prefer, but I'll happily schedule a paper should
> one appear!
>
> +1
>
> Tom.
>
>> Due to private obligations I'm not sure whether I will be back in time
>> to join the next telecon. So I rather have it on the agenda for the
>> 28th if we want to discuss it in a telecon.
>>
>> [1]https://github.com/sg16-unicode/sg16-meetings#may-11th-2022
>> [2]http://eel.is/c++draft/format#string.escaped-example-1
>> [3]https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf
>> [4]http://unicode.org/review/pr-121.html
>>
>> Mark
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2022-09-12 19:30:56