On Mon, Sep 12, 2022, 20:44 Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
Hi, Mark.
Thank you for reporting this. I've tentatively put this on the agenda for September 28th (along with review of LWG issues 3767 and 3412).
Other comments inlined below.
On 9/12/22 1:03 PM, Mark de Wever via SG16 wrote:
I think you are right and that the example is incorrect.During the May 11th[1] telecon the paper P2286R8: Formatting Ranges was reviewed. There were concerns raised regarding the lack of specifications for determining the boundaries of ill-formed code unit sequences. We decided it was not a big issue since: - the method used does not appear to be observable since each code unit of the sequence is written to the output anyway. - it should not matter for self-synchronizing encodings. I'm working on the implementation of this part of the paper in libc++ and I'm having concerns with example 5 [2] string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8 // s5 has value: ["\x{c3}\x{28}"] \xc3 is the start of a 2-byte UTF-8 code unit sequence \x28 is not a valid successor byte it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence. I would have expected the output to be ["\x{c3}("]. So all code units are written, but it isn't clear what the exact specification is.
I am not so sure whether it is correct or not.We need a consistent answer here. It's really important that error recovery behaves consistently across existing and future facilities and i tend to agree with Charlie on option 2 being desirable.Either way we do need a resolution.
I think the answer (for this case) turns out to be the same for all three of the PR-121 policies since the ill-formed subsequence consists of just the single \xc3 code unit.
Per [format.string.escaped]p(2.2.3),
the intended behavior corresponds to PR-121 policy option 3; each
code unit of the ill-formed code unit sequence is individually
encoded (replaced) in the formatted output.
Tom.
During the telecon Charlie shared a link to Unicode PR-121 [4] and suggested we use policy option 2. Both for handling ill-formed Unicode in an escape string and for the width estimation introduced in P1868R2 🦄 width: clarifying units of width and precision in std::format P1868 doesn't discuss the width estimation of ill-formed Unicode. For P1868 libc++ uses policy option 1 for ill-formed Unicode in the width estimation. MSVC STL uses policy option 2. This means there is implementation divergence in the width estimation. At the moment I have two algorithms in libc++ one for P1868 and one for how I interpret the rules of P2286. (The P2286 code hasn't been reviewed and I expect reviewers to strongly dislike having two algorithms.) I would propose to write a paper as DR which - Addresses the width estimation when encountering ill-formed Unicode. When writing the algorithm I noticed most terminals used policy option 1, however at the time I was unaware of PR-121. So I would like some feedback on which policy option is preferred. - Clearly specifies how to recover from ill-formed Unicode; preferably referring to the Unicode Standard.It isn't clear to me that it is important for implementations to behave consistently when formatting text that contains ill-formed code unit sequences, but establishing a recommendation seems advisable in any case.
I guess an argument could be made that [format.string.std]p13 states that an ill-formed code unit sequence has an unspecified width. I don't find such a reading very satisfying though so I agree we should add clarification.
I do think it's sufficient, an invalid code unit sequence make the whole string not being in an Unicode encoding.We could add a note - as a lwg issue.(And unspecified seems appropriate)
At a minimum, we should fix (or remove) the example mentioned above.
We could probably handle all of these as LWG issues as opposed to a paper if you prefer, but I'll happily schedule a paper should one appear!
+1
--
Tom.
Due to private obligations I'm not sure whether I will be back in time to join the next telecon. So I rather have it on the agenda for the 28th if we want to discuss it in a telecon. [1] https://github.com/sg16-unicode/sg16-meetings#may-11th-2022 [2] http://eel.is/c++draft/format#string.escaped-example-1 [3] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf [4] http://unicode.org/review/pr-121.html Mark
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16