Date: Mon, 12 Sep 2022 19:03:04 +0200
During the May 11th[1] telecon the paper
P2286R8: Formatting Ranges
was reviewed.
There were concerns raised regarding the lack of specifications for
determining the boundaries of ill-formed code unit sequences. We decided
it was not a big issue since:
- the method used does not appear to be observable since each code unit
of the sequence is written to the output anyway.
- it should not matter for self-synchronizing encodings.
I'm working on the implementation of this part of the paper in libc++
and I'm having concerns with example 5 [2]
string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
// s5 has value: ["\x{c3}\x{28}"]
\xc3 is the start of a 2-byte UTF-8 code unit sequence
\x28 is not a valid successor byte
it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS
Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes
If the converter encounters an ill-formed UTF-8 code unit sequence
which starts with a valid first byte, but which does not continue with
valid successor bytes (see Table 3-7), it must not consume the
successor bytes as part of the ill-formed subsequence whenever those
successor bytes themselves constitute part of a well-formed UTF-8 code
unit subsequence.
I would have expected the output to be ["\x{c3}("]. So all code units
are written, but it isn't clear what the exact specification is.
During the telecon Charlie shared a link to Unicode PR-121 [4] and
suggested we use policy option 2. Both for handling ill-formed Unicode
in an escape string and for the width estimation introduced in
P1868R2 🦄 width: clarifying units of width and precision in std::format
P1868 doesn't discuss the width estimation of ill-formed Unicode.
For P1868 libc++ uses policy option 1 for ill-formed Unicode in the
width estimation. MSVC STL uses policy option 2. This means there is
implementation divergence in the width estimation.
At the moment I have two algorithms in libc++ one for P1868 and one for
how I interpret the rules of P2286. (The P2286 code hasn't been
reviewed and I expect reviewers to strongly dislike having two
algorithms.)
I would propose to write a paper as DR which
- Addresses the width estimation when encountering ill-formed Unicode.
When writing the algorithm I noticed most terminals used policy
option 1, however at the time I was unaware of PR-121. So I would like
some feedback on which policy option is preferred.
- Clearly specifies how to recover from ill-formed Unicode; preferably
referring to the Unicode Standard.
Due to private obligations I'm not sure whether I will be back in time
to join the next telecon. So I rather have it on the agenda for the
28th if we want to discuss it in a telecon.
[1] https://github.com/sg16-unicode/sg16-meetings#may-11th-2022
[2] http://eel.is/c++draft/format#string.escaped-example-1
[3] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf
[4] http://unicode.org/review/pr-121.html
Mark
P2286R8: Formatting Ranges
was reviewed.
There were concerns raised regarding the lack of specifications for
determining the boundaries of ill-formed code unit sequences. We decided
it was not a big issue since:
- the method used does not appear to be observable since each code unit
of the sequence is written to the output anyway.
- it should not matter for self-synchronizing encodings.
I'm working on the implementation of this part of the paper in libc++
and I'm having concerns with example 5 [2]
string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
// s5 has value: ["\x{c3}\x{28}"]
\xc3 is the start of a 2-byte UTF-8 code unit sequence
\x28 is not a valid successor byte
it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS
Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes
If the converter encounters an ill-formed UTF-8 code unit sequence
which starts with a valid first byte, but which does not continue with
valid successor bytes (see Table 3-7), it must not consume the
successor bytes as part of the ill-formed subsequence whenever those
successor bytes themselves constitute part of a well-formed UTF-8 code
unit subsequence.
I would have expected the output to be ["\x{c3}("]. So all code units
are written, but it isn't clear what the exact specification is.
During the telecon Charlie shared a link to Unicode PR-121 [4] and
suggested we use policy option 2. Both for handling ill-formed Unicode
in an escape string and for the width estimation introduced in
P1868R2 🦄 width: clarifying units of width and precision in std::format
P1868 doesn't discuss the width estimation of ill-formed Unicode.
For P1868 libc++ uses policy option 1 for ill-formed Unicode in the
width estimation. MSVC STL uses policy option 2. This means there is
implementation divergence in the width estimation.
At the moment I have two algorithms in libc++ one for P1868 and one for
how I interpret the rules of P2286. (The P2286 code hasn't been
reviewed and I expect reviewers to strongly dislike having two
algorithms.)
I would propose to write a paper as DR which
- Addresses the width estimation when encountering ill-formed Unicode.
When writing the algorithm I noticed most terminals used policy
option 1, however at the time I was unaware of PR-121. So I would like
some feedback on which policy option is preferred.
- Clearly specifies how to recover from ill-formed Unicode; preferably
referring to the Unicode Standard.
Due to private obligations I'm not sure whether I will be back in time
to join the next telecon. So I rather have it on the agenda for the
28th if we want to discuss it in a telecon.
[1] https://github.com/sg16-unicode/sg16-meetings#may-11th-2022
[2] http://eel.is/c++draft/format#string.escaped-example-1
[3] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf
[4] http://unicode.org/review/pr-121.html
Mark
Received on 2022-09-12 17:03:08