<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 9 Sep 2020 at 20:39, Tom Honermann &lt;<a href="mailto:tom@honermann.net">tom@honermann.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <div>On 9/9/20 1:26 PM, Corentin wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="auto">
        <div><br>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Wed, Sep 9, 2020, 18:42
              Tom Honermann &lt;<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>&gt; wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div>
                <div>I conducted an experiment today that I&#39;ve been
                  meaning to do for a while now and that is relevant for
                  this paper (and perhaps a worthwhile addition to the
                  paper).</div>
              </div>
            </blockquote>
          </div>
        </div>
        <div dir="auto"><br>
        </div>
        <div dir="auto">Thanks Tom, this is very interesting.</div>
        <div dir="auto">But it is only marginally relevant to the paper
          which if I remember correctly goes into some details about
          round-trip.</div>
        <div dir="auto">It is not something the paper is trying to
          neither prevent nor mandate.</div>
        <div dir="auto">We only seek to mandate there exist a
          transformation from source to Unicode, which doesn&#39;t imply nor
          require that the transformation be reversible.</div>
        <div dir="auto">Support (or lack thereof) of round-tripping is
          possible within these constraints and never observable from
          the program. And as you observed it doesn&#39;t match existing
          practices - except on some edg derived compilers.</div>
      </div>
    </blockquote>
    Discussion is not limited to what is proposed in the paper and is
    encouraged in order to probe suitability of a proposal within the
    full complexity of the C++ ecosystem.  This experiment and other
    discussion on the mailing list is intended to probe the full
    solution space.  The goal of such questioning is to identify
    solutions that increase consensus.<br></div></blockquote><div><br></div><div>Are you proposing that a specific round tripping behavior for shift jis should be mandated? </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
    <blockquote type="cite">
      <div dir="auto">
        <div dir="auto"><br>
        </div>
        <div dir="auto"><br>
        </div>
        <div dir="auto">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div>
                <div><br>
                </div>
                <div>Microsoft code page 932 (Microsoft&#39;s Shift JIS
                  variant) defines a number of code points that do not
                  round trip through Unicode due to having duplicate
                  code point assignments that are (quite reasonably) not
                  duplicated in Unicode.  One of them is:</div>
                <blockquote>
                  <div>0x8795   -&gt; U+221a   -&gt; 0x81e3   Square
                    Root</div>
                </blockquote>
                <div>The following test case demonstrates existing
                  behavior as exhibited by gcc (11.0.0 snapshot) and
                  Visual C++ (2019).  Both of these compilers accept the
                  test case when compiled with the command lines shown. 
                  Repeating the experiment will require substituting the
                  replacement characters in the string literal with the
                  indicated Shift JIS double byte sequence (or using the
                  attached file if it survives transmission).<br>
                </div>
                <blockquote>
                  <div><tt>$ cat t.cpp</tt></div>
                  <div><tt>constexpr char sx8795[] = &quot;��&quot;; // 0x87 0x95
                      =&gt; CP932 0x8795 == U+221A (Square Root)</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx8795[0] ==
                      0x81); // Converted to CP932 0x81e3</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx8795[1] ==
                      0xe3); // Converted to CP932 0x81e3</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx8795[2] ==
                      0);</tt><tt><br>
                    </tt><tt>constexpr char sx81e3[] = &quot;��&quot;;  // 0x81
                      0xe3 =&gt; CP932 0x81e3 == U+221A (Square Root)</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx81e3[0] ==
                      0x81); // Preserved</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx81e3[1] ==
                      0xe3); // Preserved</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx81e3[2] ==
                      0);</tt></div>
                  <div><tt><br>
                    </tt></div>
                  <div><tt>$ g++ -c -finput-charset=cp932
                      -fexec-charset=cp932 -std=c++17 t.cpp</tt></div>
                  <div><tt>&lt;no errors&gt;<br>
                    </tt></div>
                  <div><tt><br>
                    </tt></div>
                  <div><tt>$ cl /c /std:c++17 /source-charset:.932
                      /execution-charset:.932 t.cpp</tt></div>
                  <div><tt>&lt;no errors&gt;</tt><br>
                  </div>
                </blockquote>
                <div>Note that both compilers converted the source
                  0x8795 double byte sequence to 0x81e3; the original
                  source bytes were not preserved.</div>
                <div><br>
                </div>
                <div>However, both compilers fail the test case if
                  character set conversions are not specified:</div>
                <blockquote>
                  <div><tt>$ g++ -c -std=c++17 t.cpp</tt><tt><br>
                    </tt><tt>t.cpp:2:40: error: static assertion failed</tt><tt><br>
                    </tt><tt>    2 | static_assert((unsigned
                      char)sx8795[0] == 0x81); // Converted to CP932
                      0x81e3</tt><tt><br>
                    </tt><tt>      |              
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~</tt><tt><br>
                    </tt><tt>t.cpp:3:40: error: static assertion failed</tt><tt><br>
                    </tt><tt>    3 | static_assert((unsigned
                      char)sx8795[1] == 0xe3); // Converted to CP932
                      0x81e3</tt><tt><br>
                    </tt><tt>      |              
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~</tt></div>
                  <div><tt><br>
                    </tt></div>
                  <div><tt>$ cl /nologo /diagnostics:caret /c /std:c++17
                      t.cpp</tt><tt><br>
                    </tt><tt>t.cpp</tt><tt><br>
                    </tt><tt>t.cpp(2,40): error C2607: static assertion
                      failed<br>
                      static_assert((unsigned char)sx8795[0] == 0x81);
                      // Converted to CP932 0x81e3<br>
                                                             ^<br>
                      t.cpp(3,40): error C2607: static assertion failed<br>
                      static_assert((unsigned char)sx8795[1] == 0xe3);
                      // Converted to CP932 0x81e3</tt></div>
                  <div><tt>                                       ^</tt><br>
                  </div>
                </blockquote>
                <div>Note that these double byte sequences are not valid
                  UTF-8 sequences, so the pass-the-bytes mode exhibited
                  by gcc is not an artifact of normal UTF-8 handling. 
                  The sequences are valid for Windows-1252 (which is the
                  default encoding used by Visual C++ on the system I
                  tested on), so this test is not indicative of a
                  pass-the-bytes mode for Visual C++.</div>
              </div>
            </blockquote>
          </div>
        </div>
        <div dir="auto"><br>
        </div>
        <div dir="auto">When gcc assumes the source is utf-8, iconv is
          not called and no check or conversations is performed, I
          believe that&#39;s what you are seeing here.</div>
      </div>
    </blockquote>
    <p>Gcc exhibits the same behavior whether the source encoding is
      assumed to be UTF-8 or explicitly indicated as such.  I think the
      more complicated answer is that gcc doesn&#39;t use iconv for string
      literal contents when the source and execution character sets are
      the same (or it ignores conversion errors in that case).  Errors
      are (necessarily) issued when the source and execution character
      sets differ:<br>
    </p>
    <blockquote>
      <p><tt>$ g++ -c -std=c++17 -finput-charset=utf-8
          -fexec-charset=cp932 t.cpp</tt><tt><br>
        </tt><tt>t.cpp:1:27: error: converting to execution character
          set: Invalid or incomplete multibyte or wide character</tt><tt><br>
        </tt><tt>    1 | constexpr char sx8795[] = &quot;��&quot;; // 0x87 0x95
          =&gt; CP932 0x8795 == U+221A (Square Root)</tt><tt><br>
        </tt><tt>      |                           ^~~~</tt></p>
      <p><tt>...</tt></p>
      <p><tt>t.cpp:5:27: error: converting to execution character set:
          Invalid or incomplete multibyte or wide character</tt><tt><br>
        </tt><tt>    5 | constexpr char sx81e3[] = &quot;��&quot;;  // 0x81 0xe3
          =&gt; CP932 0x81e3 == U+221A (Square Root)</tt><tt><br>
        </tt><tt>      |                           ^~~~</tt><tt><br>
        </tt><tt>...</tt><br>
      </p>
    </blockquote>
    <p> This example, as well as the preceding and following quoted ones
      are examples of mojibake and are used in this context to
      illustrate behavior with ill-formed UTF-8 input.  I failed to
      point that out previously.<br>
    </p>
    <p>Tom.<br>
    </p>
    <blockquote type="cite">
      <div dir="auto">
        <div dir="auto"><br>
        </div>
        <div dir="auto">
          <div class="gmail_quote">
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div>
                <div><br>
                </div>
                <div>Finally, it is worth noting that explicitly
                  treating the source as UTF-8 does not cause any
                  additional errors for gcc, but does for Visual C++
                  (Gcc does not diagnose the ill-formed UTF-8 sequences
                  in the string literals, but Visual C++ does).<br>
                </div>
                <blockquote>
                  <div><tt>$ g++ -c -finput-charset=utf-8 -std=c++17
                      t.cpp</tt><tt><br>
                    </tt><tt>t.cpp:2:40: error: static assertion failed</tt><tt><br>
                    </tt><tt>    2 | static_assert((unsigned
                      char)sx8795[0] == 0x81); // Converted to CP932
                      0x81e3</tt><tt><br>
                    </tt><tt>      |              
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~</tt><tt><br>
                    </tt><tt>t.cpp:3:40: error: static assertion failed</tt><tt><br>
                    </tt><tt>    3 | static_assert((unsigned
                      char)sx8795[1] == 0xe3); // Converted to CP932
                      0x81e3</tt><tt><br>
                    </tt><tt>      |              
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~</tt></div>
                  <div><tt><br>
                    </tt></div>
                  <div><tt>$ cl /nologo /diagnostics:caret /c /std:c++17
                      /utf-8 t.cpp</tt><tt><br>
                    </tt><tt>t.cpp</tt><tt><br>
                    </tt><tt>t.cpp(1,1): warning C4828: The file
                      contains a character starting at offset 0x1b that
                      is illegal in the current source character set
                      (codepage 65001).</tt><tt><br>
                    </tt><tt>constexpr char sx8795[] = &quot;&quot;; // 0x87 0x95
                      =&gt; CP932 0x8795 == U+221A (Square Root)</tt><tt><br>
                    </tt><tt>^</tt><tt><br>
                    </tt><tt>t.cpp(1,1): warning C4828: The file
                      contains a character starting at offset 0x1c that
                      is illegal in the current source character set
                      (codepage 65001).</tt><tt><br>
                    </tt><tt>constexpr char sx8795[] = &quot;&quot;; // 0x87 0x95
                      =&gt; CP932 0x8795 == U+221A (Square Root)</tt><tt><br>
                    </tt><tt>^</tt><tt><br>
                    </tt><tt>t.cpp(1,1): warning C4828: The file
                      contains a character starting at offset 0x13e that
                      is illegal in the current source character set
                      (codepage 65001).</tt><tt><br>
                    </tt><tt>constexpr char sx8795[] = &quot;&quot;; // 0x87 0x95
                      =&gt; CP932 0x8795 == U+221A (Square Root)</tt><tt><br>
                    </tt><tt>^</tt><tt><br>
                    </tt><tt>t.cpp(1,1): warning C4828: The file
                      contains a character starting at offset 0x13f that
                      is illegal in the current source character set
                      (codepage 65001).</tt><tt><br>
                    </tt><tt>constexpr char sx8795[] = &quot;&quot;; // 0x87 0x95
                      =&gt; CP932 0x8795 == U+221A (Square Root)</tt><tt><br>
                    </tt><tt>^</tt><tt><br>
                    </tt><tt>t.cpp(2,40): error C2607: static assertion
                      failed</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx8795[0] ==
                      0x81); // Converted to CP932 0x81e3</tt><tt><br>
                    </tt><tt>                                       ^</tt><tt><br>
                    </tt><tt>t.cpp(3,40): error C2607: static assertion
                      failed</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx8795[1] ==
                      0xe3); // Converted to CP932 0x81e3</tt><tt><br>
                    </tt><tt>                                       ^</tt><tt><br>
                    </tt><tt>t.cpp(5,27): error C2001: newline in
                      constant</tt><tt><br>
                    </tt><tt>constexpr char sx81e3[] = &quot;&quot;;  // 0x81 0xe3
                      =&gt; CP932 0x81e3 == U+221A (Square Root)</tt><tt><br>
                    </tt><tt>                          ^</tt><tt><br>
                    </tt><tt>t.cpp(6,1): error C2143: syntax error:
                      missing &#39;;&#39; before &#39;static_assert&#39;</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx81e3[0] ==
                      0x81); // Preserved</tt><tt><br>
                    </tt><tt>^</tt><tt><br>
                    </tt><tt>t.cpp(8,40): error C2607: static assertion
                      failed</tt><tt><br>
                    </tt><tt>static_assert((unsigned char)sx81e3[2] ==
                      0);</tt><tt><br>
                    </tt><tt>                                       ^</tt><br>
                  </div>
                </blockquote>
                <div>Tom.<br>
                </div>
                <div><br>
                </div>
                <div>On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:<br>
                </div>
                <blockquote type="cite">
                  <div>On 8/24/20 8:31 AM, Peter Brett via SG16 wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <pre>Hi all,

In this week&#39;s meeting, we are going to discuss the remaining
proposals from P2178R1 &quot;Misc lexing and string handling improvements&quot;.
In particular, we will discuss proposal 9:

    Proposal 9: Reaffirming Unicode as the character set of the
    internal representation

In anticipation of a lively discussion, Corentin and I have written a
short new paper which will be appearing in the September mailing.

    P2194R0 The character set of C++ source code is Unicode
    <a href="https://isocpp.org/files/papers/P2194R0.pdf" rel="noreferrer" target="_blank">https://isocpp.org/files/papers/P2194R0.pdf</a></pre>
                  </blockquote>
                  <p>In preparation for this discussion, please also
                    (re-)read section 5.2.1 of the <a href="http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf" rel="noreferrer" target="_blank">C99 Rationale document</a>;
                    in particular the &quot;UCN Models&quot; section on pages 20
                    and 21.<br>
                  </p>
                  <p>Tom.<br>
                  </p>
                  <blockquote type="cite">
                    <pre>We hope that the study group finds this contribution helpful and
informative.

Best regards,

                       Peter

</pre>
                  </blockquote>
                  <p><br>
                  </p>
                  <br>
                  <fieldset></fieldset>
                </blockquote>
                <p><br>
                </p>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
  </div>

</blockquote></div></div>

