ISOCPP sg16 List: Re: Handling ill-formed Unicode in the library

From: Mark de Wever <koraq_at_[hidden]>
Date: Tue, 13 Sep 2022 19:51:40 +0200

On Mon, Sep 12, 2022 at 09:12:30PM +0200, Corentin Jabot via SG16 wrote:
> On Mon, Sep 12, 2022, 20:44 Tom Honermann via SG16 <sg16_at_[hidden]>
> wrote:
>
> > I'm working on the implementation of this part of the paper in libc++
> > and I'm having concerns with example 5 [2]
> >
> > string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
> > // s5 has value: ["\x{c3}\x{28}"]
> >
> > \xc3 is the start of a 2-byte UTF-8 code unit sequence
> > \x28 is not a valid successor byte
> > it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS
> >
> > Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes
> >
> > If the converter encounters an ill-formed UTF-8 code unit sequence
> > which starts with a valid first byte, but which does not continue with
> > valid successor bytes (see Table 3-7), it must not consume the
> > successor bytes as part of the ill-formed subsequence whenever those
> > successor bytes themselves constitute part of a well-formed UTF-8 code
> > unit subsequence.
> >
> > I would have expected the output to be ["\x{c3}("]. So all code units
> > are written, but it isn't clear what the exact specification is.
> >
> > I think you are right and that the example is incorrect.
> >
>
> I am not so sure whether it is correct or not.
> We need a consistent answer here. It's really important that error recovery
> behaves consistently across existing and future facilities and i tend to
> agree with Charlie on option 2 being desirable.
> Either way we do need a resolution.

+1

> During the telecon Charlie shared a link to Unicode PR-121 [4] and
> > suggested we use policy option 2. Both for handling ill-formed Unicode
> > in an escape string and for the width estimation introduced in
> >
> > P1868R2 🦄 width: clarifying units of width and precision in std::format
> >
> > P1868 doesn't discuss the width estimation of ill-formed Unicode.
> >
> > For P1868 libc++ uses policy option 1 for ill-formed Unicode in the
> > width estimation. MSVC STL uses policy option 2. This means there is
> > implementation divergence in the width estimation.
> >
> > At the moment I have two algorithms in libc++ one for P1868 and one for
> > how I interpret the rules of P2286. (The P2286 code hasn't been
> > reviewed and I expect reviewers to strongly dislike having two
> > algorithms.)
> >
> > I would propose to write a paper as DR which
> > - Addresses the width estimation when encountering ill-formed Unicode.
> > When writing the algorithm I noticed most terminals used policy
> > option 1, however at the time I was unaware of PR-121. So I would like
> > some feedback on which policy option is preferred.
> > - Clearly specifies how to recover from ill-formed Unicode; preferably
> > referring to the Unicode Standard.
> >
> > It isn't clear to me that it is important for implementations to behave
> > consistently when formatting text that contains ill-formed code unit
> > sequences, but establishing a recommendation seems advisable in any case.
> >
> > I guess an argument could be made that [format.string.std]p13
> > <https://eel.is/c++draft/format.string.std#13> states that an ill-formed
> > code unit sequence has an unspecified width. I don't find such a reading
> > very satisfying though so I agree we should add clarification.
> >
> I do think it's sufficient, an invalid code unit sequence make the whole
> string not being in an Unicode encoding.
> We could add a note - as a lwg issue.
> (And unspecified seems appropriate)

Interesting I never read that paragraph as "how to handle ill-formed
Unicode". I always interpreted non-Unicode as other character encodings.
I think it would be good to be a bit more precise, especially when we
want to add more Unicode support in the Standard.

> > At a minimum, we should fix (or remove) the example
> > <https://eel.is/c++draft/format.string.escaped#example-1> mentioned above.
> >
> > We could probably handle all of these as LWG issues as opposed to a paper
> > if you prefer, but I'll happily schedule a paper should one appear!
> >
> +1

I don't mind LWG-issues over a paper, but let's see how big or small the
resolution becomes.

Mark

Received on 2022-09-13 17:51:46