<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jun 3, 2020, 05:19 Tom Honermann &lt;<a href="mailto:tom@honermann.net">tom@honermann.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div>
    <div>On 6/2/20 7:57 AM, Corentin Jabot via
      SG16 wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div dir="ltr"><br>
        </div>
        <div dir="auto"><br>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Tue, Jun 2, 2020, 13:34
              Alisdair Meredith via SG16 &lt;<a href="mailto:sg16@lists.isocpp.org" target="_blank" rel="noreferrer">sg16@lists.isocpp.org</a>&gt;
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Translation phase 1
              maps source code to either a member of the<br>
              basic character set, or a UCN corresponding to that
              character.<br>
              What if there is no such UCN?  Is that undefined behavior,
              or is<br>
              the program ill-formed?  I can find nothing on this in
              [lex.phases]<br>
              where we describe processing the source through an
              implemetation<br>
              defined character mapping.<br>
              <br>
              When we get to [lex.charset] we can see it is clearly
              ill-formed if<br>
              the produced UCN is invalid - is that supposed to be the
              resolution<br>
              here?  Source must always map to a UCN, but the UCN need
              not<br>
              be valid, so we get an error parsing the (implied) UCN in
              a later<br>
              phase?<br>
            </blockquote>
            <div><br>
            </div>
            <div>One more reason i want to  rewrite phase 1.</div>
            <div><br>
            </div>
            <div>2 things should be specified here:</div>
            <div><br>
            </div>
            <div><font face="arial, sans-serif">&gt; <span style="color:rgb(0,0,0);text-align:justify">Any source
                  file character not in the</span><span style="color:rgb(0,0,0);text-align:justify"> </span><a href="http://eel.is/c++draft/lex#def:basic_source_character_set" style="text-align:justify;text-decoration-line:none" target="_blank" rel="noreferrer">basic source character set</a><span style="color:rgb(0,0,0);text-align:justify"> </span><span style="color:rgb(0,0,0);text-align:justify">is
                  replaced by the</span><span style="color:rgb(0,0,0);text-align:justify">  </span>
                <div id="m_5563270604875249260gmail-phases-1.1.sentence-3" style="display:inline;color:rgb(0,0,0);text-align:justify"><a href="http://eel.is/c++draft/lex#nt:universal-character-name" style="text-decoration-line:none;font-style:italic" target="_blank" rel="noreferrer">universal-character-name</a> that
                  designates that character<a href="http://eel.is/c++draft/lex#phases-1.1.sentence-3" style="text-decoration-line:none;color:inherit" target="_blank" rel="noreferrer">.</a></div>
                <span style="color:rgb(0,0,0);text-align:justify"> </span></font></div>
            <div><span style="color:rgb(0,0,0);text-align:justify"><font face="arial, sans-serif"><br>
                </font></span></div>
            <div><span style="color:rgb(0,0,0);text-align:justify"><font face="arial, sans-serif">This is wrong, characters may
                  map to ucn sequences, not single ucns.</font></span></div>
          </div>
        </div>
      </div>
    </blockquote>
    <font face="arial, sans-serif">Good point.  This can happen with a
      few legacy encodings.  For example, Big5-HKCS includes a few
      characters that map to a Unicode code point pair.</font><br>
    <blockquote type="cite">
      <div dir="ltr">
        <div dir="auto">
          <div class="gmail_quote">
            <div><span style="color:rgb(0,0,0);text-align:justify"><font face="arial, sans-serif"><br>
                </font></span></div>
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000">Characters that do not have
                representation in Unicode should be ill-formed  - with
                the caveat that implementers can do _anything_ in phase
                0</font></div>
          </div>
        </div>
      </div>
    </blockquote>
    <p><font face="arial, sans-serif">Phase 0?</font></p></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Sorry, let me clarify. Whatever we specify, we can&#39;t prevent implementers to do transformations before phase one, so phase 1 is mostly a guidance. </div><div dir="auto">I don&#39;t think there is a way around that, nor should there be</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
    <p><font face="arial, sans-serif">Since source file encoding and
        phase 1 translation is implementation-defined, I don&#39;t think the
        standard can really say much about this scenario.  But it seems
        pretty clear that, post phase 1, the standard specifies no
        facilities to handle a non-Unicode character.<br>
      </font></p>
    <blockquote type="cite">
      <div dir="ltr">
        <div dir="auto">
          <div class="gmail_quote">
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000"><br>
              </font></div>
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000">Note that the existence of a
                mapping is different from the validity of a UCN</font></div>
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000">It is an implementation
                strategy to map characters without representation to
                nothing.</font></div>
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000">Other valid strategies would
                be to use the PUA to represent these characters </font></div>
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000"><br>
              </font></div>
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000"><br>
              </font></div>
            <div style="text-align:justify"><span style="color:rgb(0,0,0);font-family:arial,sans-serif">To
                give you an idea of where i want to be, here is a very
                early draft of what I think phase 1 and 2 should do,
                pending</span><br>
            </div>
            <div style="text-align:justify"><font face="arial,
                sans-serif" color="#000000">a couple of design changes
                that EWG would have to look at </font></div>
            <div style="text-align:justify"><font size="3" face="arial,
                sans-serif" color="#000000"><br>
              </font></div>
            <font face="monospace">1. If the physical source character
              is the Unicode character set, each code point in the
              source<br>
              file is converted to the internal representation of that
              same code point. Codepoints that<br>
              are surrogate codepoints or invalid codepoints are
              ill-formed.<br>
              Otherwise, each abstract character in the source file is
              mapped in an implementation-<br>
              defined manner to a sequence of Unicode codepoint
              representing the same abstract<br>
              character. (introducing new-line characters for
              end-of-line indicators if necessary).<br>
              An implementation may use any internal encoding able to
              represent uniquely any Uni-<br>
              code codepoint. <b>If an abstract character in the source
                file is not representable in the<br>
                Unicode character set, the program is ill-formed.</b><br>
              An implementation supports source files representing a
              sequence of UTF-8 code units.<br>
              Any additional physical source file character sets
              accepted are implementation-defined.<br>
              How the the character set of a source file is determined
              is implementation-defined.<br>
              <br>
            </font></div>
          <div class="gmail_quote"><font face="monospace">2. Each
              implementation-defined line termination sequence of
              characters is replaced by a<br>
              LINE FED character (U+000A). Each instance of a
              BACKSLASH (\) immediately<br>
              followed by a LINE FEED</font><font face="monospace"> or
              at the end of a file is deleted, splicing physical source<br>
              lines to form logical source lines. Only the last
              backslash on any physical source line shall<br>
              be eligible for being part of such a splice. Except for
              splices reverted in a raw string literal,</font></div>
          <div class="gmail_quote"><font face="monospace">if a splice
              results in a codepoint sequence that matches the syntax of
              a universal-character-<br>
              name, the behavior is implementation-defined. A source
              file that is not empty and that does not end<br>
              in a <i>LINE FEED</i>, or that ends in a LINE FEED
              immediately preceded by a BACKSLASH</font><span style="font-family:monospace"> before any such splicing
              takes place, shall be processed as if an</span></div>
          <div class="gmail_quote"><font face="monospace">additional
              LINE FEED were appended to the file.<br>
              Sequences of whitespace codepoints at the end of each line
              are removed.<br>
              Each universal-character-name is replaced by the Unicode
              codepoint it designates.</font><br>
          </div>
        </div>
      </div>
    </blockquote>
    <font face="monospace">The sentence regarding the removal of
      trailing whitespace doesn&#39;t specify whether the removal occurs for
      physical or logical lines (before or after splicing).</font><br>
    <p><font face="monospace">Wording concerns aside, I think it would
        be helpful to list the intended behavioral changes.<br></font></p></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Here: removing of trailing whitespaces is mandated, utf8 support is mandated.</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><p><font face="monospace">
      </font></p>
    <p><font face="monospace">Tom.<br>
      </font></p>
    <blockquote type="cite">
      <div dir="ltr">
        <div dir="auto">
          <div class="gmail_quote">
            <div style="text-align:justify"><font size="3" face="arial,
                sans-serif" color="#000000"> </font></div>
            <div style="text-align:justify">Corentin</div>
            <div><br>
            </div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <br>
              AlisdairM<br>
              -- <br>
              SG16 mailing list<br>
              <a href="mailto:SG16@lists.isocpp.org" rel="noreferrer noreferrer" target="_blank">SG16@lists.isocpp.org</a><br>
              <a href="https://lists.isocpp.org/mailman/listinfo.cgi/sg16" rel="noreferrer noreferrer noreferrer" target="_blank">https://lists.isocpp.org/mailman/listinfo.cgi/sg16</a><br>
            </blockquote>
          </div>
        </div>
      </div>
      <br>
      <fieldset></fieldset>
    </blockquote>
    <p><br>
    </p>
  </div>

</blockquote></div></div></div>

