ISOCPP sg16 List: Re: Handling ill-formed Unicode in the library

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 12 Sep 2022 14:44:44 -0400

Hi, Mark.

Thank you for reporting this. I've tentatively put this on the agenda
for September 28th (along with review of LWG issues 3767
<https://cplusplus.github.io/LWG/issue3767> and 3412
<https://cplusplus.github.io/LWG/issue3412>).

Other comments inlined below.

On 9/12/22 1:03 PM, Mark de Wever via SG16 wrote:
> During the May 11th[1] telecon the paper
>
> P2286R8: Formatting Ranges
>
> was reviewed.
>
> There were concerns raised regarding the lack of specifications for
> determining the boundaries of ill-formed code unit sequences. We decided
> it was not a big issue since:
> - the method used does not appear to be observable since each code unit
> of the sequence is written to the output anyway.
> - it should not matter for self-synchronizing encodings.
>
> I'm working on the implementation of this part of the paper in libc++
> and I'm having concerns with example 5 [2]
>
> string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
> // s5 has value: ["\x{c3}\x{28}"]
>
> \xc3 is the start of a 2-byte UTF-8 code unit sequence
> \x28 is not a valid successor byte
> it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS
>
> Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes
>
> If the converter encounters an ill-formed UTF-8 code unit sequence
> which starts with a valid first byte, but which does not continue with
> valid successor bytes (see Table 3-7), it must not consume the
> successor bytes as part of the ill-formed subsequence whenever those
> successor bytes themselves constitute part of a well-formed UTF-8 code
> unit subsequence.
>
> I would have expected the output to be ["\x{c3}("]. So all code units
> are written, but it isn't clear what the exact specification is.
I think you are right and that the example is incorrect.
>
> During the telecon Charlie shared a link to Unicode PR-121 [4] and
> suggested we use policy option 2. Both for handling ill-formed Unicode
> in an escape string and for the width estimation introduced in
>
> P1868R2 🦄 width: clarifying units of width and precision in std::format
>
> P1868 doesn't discuss the width estimation of ill-formed Unicode.
>
> For P1868 libc++ uses policy option 1 for ill-formed Unicode in the
> width estimation. MSVC STL uses policy option 2. This means there is
> implementation divergence in the width estimation.
>
> At the moment I have two algorithms in libc++ one for P1868 and one for
> how I interpret the rules of P2286. (The P2286 code hasn't been
> reviewed and I expect reviewers to strongly dislike having two
> algorithms.)
>
> I would propose to write a paper as DR which
> - Addresses the width estimation when encountering ill-formed Unicode.
> When writing the algorithm I noticed most terminals used policy
> option 1, however at the time I was unaware of PR-121. So I would like
> some feedback on which policy option is preferred.
> - Clearly specifies how to recover from ill-formed Unicode; preferably
> referring to the Unicode Standard.

It isn't clear to me that it is important for implementations to behave
consistently when formatting text that contains ill-formed code unit
sequences, but establishing a recommendation seems advisable in any case.

I guess an argument could be made that [format.string.std]p13
<https://eel.is/c++draft/format.string.std#13> states that an ill-formed
code unit sequence has an unspecified width. I don't find such a reading
very satisfying though so I agree we should add clarification.

At a minimum, we should fix (or remove) the example
<https://eel.is/c++draft/format.string.escaped#example-1> mentioned above.

We could probably handle all of these as LWG issues as opposed to a
paper if you prefer, but I'll happily schedule a paper should one appear!

Tom.

>
> Due to private obligations I'm not sure whether I will be back in time
> to join the next telecon. So I rather have it on the agenda for the
> 28th if we want to discuss it in a telecon.
>
> [1]https://github.com/sg16-unicode/sg16-meetings#may-11th-2022
> [2]http://eel.is/c++draft/format#string.escaped-example-1
> [3]https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf
> [4]http://unicode.org/review/pr-121.html
>
> Mark

Received on 2022-09-12 18:44:46