<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 12, 2022, 20:44 Tom Honermann via SG16 &lt;<a href="mailto:sg16@lists.isocpp.org" rel="noreferrer noreferrer" target="_blank">sg16@lists.isocpp.org</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div>
    <p>Hi, Mark.</p>
    <p>Thank you for reporting this. I&#39;ve tentatively put this on the
      agenda for September 28th (along with review of LWG issues <a href="https://cplusplus.github.io/LWG/issue3767" rel="noreferrer noreferrer noreferrer" target="_blank">3767</a> and <a href="https://cplusplus.github.io/LWG/issue3412" rel="noreferrer noreferrer noreferrer" target="_blank">3412</a>).</p>
    <p>Other comments inlined below.<br>
    </p>
    <div>On 9/12/22 1:03 PM, Mark de Wever via
      SG16 wrote:<br>
    </div>
    <blockquote type="cite">
      <pre>During the May 11th[1] telecon the paper

   P2286R8: Formatting Ranges

was reviewed.

There were concerns raised regarding the lack of specifications for
determining the boundaries of ill-formed code unit sequences. We decided
it was not a big issue since:
- the method used does not appear to be observable since each code unit
  of the sequence is written to the output anyway.
- it should not matter for self-synchronizing encodings.

I&#39;m working on the implementation of this part of the paper in libc++
and I&#39;m having concerns with example 5 [2]

  string s5 = format(&quot;[{:?}]&quot;, &quot;\xc3\x28&quot;); // invalid UTF-8
                                            // s5 has value: [&quot;\x{c3}\x{28}&quot;]

\xc3 is the start of a 2-byte UTF-8 code unit sequence
\x28 is not a valid successor byte 
     it is a valid 1-byte UTF-8 sequence for LEFT PARENTHESIS

Based on Chapter 3 of Unicode 14 [3] Constraints on Conversion Processes

  If the converter encounters an ill-formed UTF-8 code unit sequence
  which starts with a valid first byte, but which does not continue with
  valid successor bytes (see Table 3-7), it must not consume the
  successor bytes as part of the ill-formed subsequence whenever those
  successor bytes themselves constitute part of a well-formed UTF-8 code
  unit subsequence.

I would have expected the output to be [&quot;\x{c3}(&quot;]. So all code units
are written, but it isn&#39;t clear what the exact specification is.</pre>
    </blockquote>
    I think you are right and that the example is incorrect.<br></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">I am not so sure whether it is correct or not.</div><div dir="auto">We need a consistent answer here. It&#39;s really important that error recovery behaves consistently across existing and future facilities and i tend to agree with Charlie on option 2 being desirable.</div><div dir="auto">Either way we do need a resolution. </div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
    <blockquote type="cite">
      <pre>During the telecon Charlie shared a link to Unicode PR-121 [4] and
suggested we use policy option 2. Both for handling ill-formed Unicode
in an escape string and for the width estimation introduced in

  P1868R2 🦄 width: clarifying units of width and precision in std::format

P1868 doesn&#39;t discuss the width estimation of ill-formed Unicode.

For P1868 libc++ uses policy option 1 for ill-formed Unicode in the
width estimation. MSVC STL uses policy option 2. This means there is
implementation divergence in the width estimation.

At the moment I have two algorithms in libc++ one for P1868 and one for
how I interpret the rules of P2286. (The P2286 code hasn&#39;t been
reviewed and I expect reviewers to strongly dislike having two
algorithms.)

I would propose to write a paper as DR which
- Addresses the width estimation when encountering ill-formed Unicode.
  When writing the algorithm I noticed most terminals used policy
  option 1, however at the time I was unaware of PR-121. So I would like
  some feedback on which policy option is preferred.
- Clearly specifies how to recover from ill-formed Unicode; preferably
  referring to the Unicode Standard.</pre>
    </blockquote>
    <p>It isn&#39;t clear to me that it is important for implementations to
      behave consistently when formatting text that contains ill-formed
      code unit sequences, but establishing a recommendation seems
      advisable in any case. <br>
    </p>
    <p>I guess an argument could be made that <a href="https://eel.is/c++draft/format.string.std#13" rel="noreferrer noreferrer noreferrer" target="_blank">[format.string.std]p13</a>
      states that an ill-formed code unit sequence has an unspecified
      width. I don&#39;t find such a reading very satisfying though so I
      agree we should add clarification.<br></p></div></blockquote></div></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><p></p></div></blockquote></div></div><div dir="auto">I do think it&#39;s sufficient, an invalid code unit sequence make the whole string not being in an Unicode encoding.</div><div dir="auto">We could add a note - as a lwg issue.</div><div dir="auto">(And unspecified seems appropriate)</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><p>
    </p>
    <p>At a minimum, we should fix (or remove) the <a href="https://eel.is/c++draft/format.string.escaped#example-1" rel="noreferrer noreferrer noreferrer" target="_blank">example</a>
      mentioned above.</p>
    <p>We could probably handle all of these as LWG issues as opposed to
      a paper if you prefer, but I&#39;ll happily schedule a paper should
      one appear!<br></p></div></blockquote></div></div><div dir="auto">+1</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><p>
    </p>
    <p>Tom.<br>
    </p>
    <blockquote type="cite">
      <pre>Due to private obligations I&#39;m not sure whether I will be back in time
to join the next telecon. So I rather have it on the agenda for the
28th if we want to discuss it in a telecon.

[1] <a href="https://github.com/sg16-unicode/sg16-meetings#may-11th-2022" rel="noreferrer noreferrer noreferrer" target="_blank">https://github.com/sg16-unicode/sg16-meetings#may-11th-2022</a>
[2] <a href="http://eel.is/c++draft/format#string.escaped-example-1" rel="noreferrer noreferrer noreferrer" target="_blank">http://eel.is/c++draft/format#string.escaped-example-1</a>
[3] <a href="https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf" rel="noreferrer noreferrer noreferrer" target="_blank">https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf</a>
[4] <a href="http://unicode.org/review/pr-121.html" rel="noreferrer noreferrer noreferrer" target="_blank">http://unicode.org/review/pr-121.html</a>

Mark
</pre>
    </blockquote>
  </div>

-- <br>
SG16 mailing list<br>
<a href="mailto:SG16@lists.isocpp.org" rel="noreferrer noreferrer noreferrer" target="_blank">SG16@lists.isocpp.org</a><br>
<a href="https://lists.isocpp.org/mailman/listinfo.cgi/sg16" rel="noreferrer noreferrer noreferrer noreferrer" target="_blank">https://lists.isocpp.org/mailman/listinfo.cgi/sg16</a><br>
</blockquote></div></div></div>

