Hi, Mark.
Thank you for reporting this. I've tentatively put this on the agenda for September 28th (along with review of LWG issues 3767 and 3412).
Other comments inlined below.
During the May 11th[1] telecon the paper
P2286R8: Formatting Ranges
was reviewed.
There were concerns raised regarding the lack of specifications for
determining the boundaries of ill-formed code unit sequences. We decided
it was not a big issue since:
- the method used does not appear to be observable since each code unit
of the sequence is written to the output anyway.
- it should not matter for self-synchronizing encodings.
I'm working on the implementation of this part of the paper in libc++
and I'm having concerns with example 5 [2]
string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
// s5 has value: ["\x{c3}\x{28}"]
\xc3 is the start of a 2-byte UTF-8 code unit sequence
\x28 is not a valid successor byte
it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS
Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes
If the converter encounters an ill-formed UTF-8 code unit sequence
which starts with a valid first byte, but which does not continue with
valid successor bytes (see Table 3-7), it must not consume the
successor bytes as part of the ill-formed subsequence whenever those
successor bytes themselves constitute part of a well-formed UTF-8 code
unit subsequence.
I would have expected the output to be ["\x{c3}("]. So all code units
are written, but it isn't clear what the exact specification is.
I think you are right and that the example is incorrect.During the telecon Charlie shared a link to Unicode PR-121 [4] and suggested we use policy option 2. Both for handling ill-formed Unicode in an escape string and for the width estimation introduced in P1868R2 🦄 width: clarifying units of width and precision in std::format P1868 doesn't discuss the width estimation of ill-formed Unicode. For P1868 libc++ uses policy option 1 for ill-formed Unicode in the width estimation. MSVC STL uses policy option 2. This means there is implementation divergence in the width estimation. At the moment I have two algorithms in libc++ one for P1868 and one for how I interpret the rules of P2286. (The P2286 code hasn't been reviewed and I expect reviewers to strongly dislike having two algorithms.) I would propose to write a paper as DR which - Addresses the width estimation when encountering ill-formed Unicode. When writing the algorithm I noticed most terminals used policy option 1, however at the time I was unaware of PR-121. So I would like some feedback on which policy option is preferred. - Clearly specifies how to recover from ill-formed Unicode; preferably referring to the Unicode Standard.
It isn't clear to me that it is important for implementations to
behave consistently when formatting text that contains ill-formed
code unit sequences, but establishing a recommendation seems
advisable in any case.
I guess an argument could be made that [format.string.std]p13
states that an ill-formed code unit sequence has an unspecified
width. I don't find such a reading very satisfying though so I
agree we should add clarification.
At a minimum, we should fix (or remove) the example mentioned above.
We could probably handle all of these as LWG issues as opposed to
a paper if you prefer, but I'll happily schedule a paper should
one appear!
Tom.
Due to private obligations I'm not sure whether I will be back in time to join the next telecon. So I rather have it on the agenda for the 28th if we want to discuss it in a telecon. [1] https://github.com/sg16-unicode/sg16-meetings#may-11th-2022 [2] http://eel.is/c++draft/format#string.escaped-example-1 [3] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf [4] http://unicode.org/review/pr-121.html Mark