C++ Logo


Advanced search

Re: Handling ill-formed Unicode in the library

From: Mark de Wever <koraq_at_[hidden]>
Date: Wed, 5 Oct 2022 18:22:59 +0200
On Mon, Sep 12, 2022 at 02:44:44PM -0400, Tom Honermann wrote:
> > Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes
> >
> > If the converter encounters an ill-formed UTF-8 code unit sequence
> > which starts with a valid first byte, but which does not continue with
> > valid successor bytes (see Table 3-7), it must not consume the
> > successor bytes as part of the ill-formed subsequence whenever those
> > successor bytes themselves constitute part of a well-formed UTF-8 code
> > unit subsequence.
> >
> > I would have expected the output to be ["\x{c3}("]. So all code units
> > are written, but it isn't clear what the exact specification is.
> I think you are right and that the example is incorrect.

As discussed during the last telecon I created a PR to address the issue
in the WP.



Received on 2022-10-05 16:23:04