<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 27, 2021 at 5:57 AM Tom Honermann &lt;<a href="mailto:tom@honermann.net">tom@honermann.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <div>On 4/26/21 1:04 PM, Corentin Jabot via
      SG16 wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div dir="ltr"><br>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Mon, Apr 26, 2021 at 6:19
            PM Tom Honermann via SG16 &lt;<a href="mailto:sg16@lists.isocpp.org" target="_blank">sg16@lists.isocpp.org</a>&gt;
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <div>On 4/19/21 10:58 AM, Tom Honermann via SG16 wrote:<br>
              </div>
              <blockquote type="cite">
                <p>SG16 will hold a telecon on Wednesday, April 28th at
                  19:30 UTC (<a href="https://www.timeanddate.com/worldclock/converter.html?iso=20210428T193000&amp;p1=1440&amp;p2=tz_pdt&amp;p3=tz_mdt&amp;p4=tz_cdt&amp;p5=tz_edt&amp;p6=tz_cest" target="_blank">timezone
                    conversion</a>).</p>
                <p>The agenda is:</p>
                <ul>
                  <li><a href="https://wg21.link/p2093r5" target="_blank">P2093R5:
                      Formatted output</a></li>
                  <li><a href="https://isocpp.org/files/papers/P2348R0.pdf" target="_blank">P2348R0:
                      Whitespaces Wording Revamp</a><br>
                  </li>
                </ul>
                <p>LEWG discussed P2093R5 at their 2021-04-06 telecon
                  and decided to refer the paper back to SG16 for
                  further discussion.  LEWG meeting minutes are
                  available <a href="https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06" target="_blank">here</a>;
                  please review them prior to the telecon.  LEWG
                  reviewed the list of prior SG16 deferred questions
                  posted to them <a href="http://lists.isocpp.org/lib-ext/2021/03/18189.php" target="_blank">here</a>.  Of
                  those, they established consensus on an answer for #2
                  (they agreed not to block <tt>std::print()</tt> on a
                  proposal for underlying terminal facilities), but
                  referred the rest back to us.  My interpretation of
                  their actions is that LEWG would like a revision of
                  the paper to address these concerns based on SG16
                  input (e.g., discuss design options and SG16 consensus
                  or lack thereof).  We&#39;ll therefore focus on these
                  questions at this telecon.</p>
                <p>Hubert provided the following very interesting
                  example usage.</p>
                <p><tt>std::print(&quot;{:%r}\n&quot;,
                    std::chrono::system_clock::now().time_since_epoch());</tt></p>
                <p>At issue is the encoding used by locale sensitive
                  chrono formatters.  Search <a href="http://eel.is/c++draft/time.format" target="_blank">[time.format]</a>
                  for &quot;locale&quot; to find example chrono format specifiers
                  that are locale dependent.  The example above contains
                  the <tt>%r</tt> specifier and is locale sensitive
                  because AM/PM designations may be localized.  In a
                  Chinese locale the desired translation of &quot;PM&quot; is
                  &quot;下午&quot;, but the locale will provide the translation in
                  the locale encoding.  As specified in P2093R5, if the
                  execution (literal) encoding is UTF-8, than <tt>std::print()</tt>
                  will expect the translation to be provided in UTF-8,
                  but if the locale is not UTF-8-based (e.g., Big5;
                  perhaps Shift-JIS for the Japanese 午後 translation),
                  then the result is mojibake. This is a good example of
                  how locale conflates translation and character
                  encoding.</p>
                <p>Addressing the above will be our first order of
                  business.  Please reserve some time to independently
                  think about this problem (ignore responses to this
                  message for a few days if you need to).  I am
                  explicitly not listing possible approaches to address
                  this concern in this message so as to avoid adding
                  (further) bias in any specific direction.  I suspect
                  the answers to the previously deferred SG16 questions
                  will be easier to answer once this concern is
                  resolved.</p>
              </blockquote>
              <p>Now that we&#39;ve all had some time to think about this
                issue, here are some possible directions we can pursue
                to resolve it.  These are presented in no particular
                order.<br>
              </p>
              <ul>
                <li>Specialize <a href="https://en.cppreference.com/w/cpp/locale/locale" target="_blank"><tt>std::locale</tt>
                    facets</a> and related I/O manipulators like <a href="https://en.cppreference.com/w/cpp/io/manip/put_time" target="_blank"><tt>std::put_time()</tt></a>
                  for <tt>char8_t</tt>.  This would allow <tt>std::print()</tt>
                  to, when the literal encoding is UTF-8, opt-in to use
                  of the UTF-8/<tt>char8_t</tt> facets and I/O
                  manipulators.<br>
                </li>
                <li>When the literal encoding is UTF-8, stipulate that
                  running the program in a non-UTF-8 based locale is
                  non-conforming.  This would effectively require MSVC
                  programmers to, when building code with the <tt>/utf-8</tt>
                  option, to also <a href="https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page" target="_blank">force
                    selection of a UTF-8 code page via a manifest</a>
                  and require use of Windows 10 build 1903 or later.</li>
                <li>When the literal encoding is UTF-8, specify that
                  non-UTF-8 based locale dependent translations be
                  implicitly transcoded (such transcoding should never
                  result in errors except perhaps for memory allocation
                  failures).<br>
                </li>
                <li>Drop the special case handling for the literal
                  encoding being UTF-8 and specify that, when bypassing
                  a stream to write directly to the console, that the
                  output be implicitly transcoded from the current
                  locale dependent encoding (whatever it is) to the
                  console encoding (UTF-8). </li>
              </ul>
            </div>
          </blockquote>
          <div><br>
          </div>
          <div>We have 2 things to explain to LEWG for print. And we do
            not need to operate change to the design, just to explain
            things to them in a terms they can understand (and they want
            to rely on our expertise which</div>
          <div>implies consensus among ourselves)</div>
          <div><br>
          </div>
          <div>1. It is always non-sense to interpret a string in
            encoding X when it is in fact not.</div>
          <div>2. From there, if a string literal is in UTF-8, we HAVE
            to assume the execution encoding is also utf-8. Why rely on
            the literal encoding and not execution? it is resilient to
            call to setlocale and more efficient. Also, format strings
            are likely to be literals.</div>
          <div>3. From there if that string is displayed on a
            terminal/console/screen/tty, it is text. So it has to be
            rendered correctly. On a specific system (windows) there is
            a way to enforce that. Because windows has a separate
            mechanism for unicode display and console handling that
            exists independently of the C++ execution encoding.</div>
          <div>4. &quot;we have to assume&quot; in 2. implies a precondition. That
            is true REGARDLESS of utf-8 or not. in all cases the format
            string has to be interpreted as text, which assumes it is
            valid in the execution encoding. CF the Microsoft STL issue
            for braces in shift JS.</div>
          <div>5. This means that converting to UTF-16 on windows for
            the purpose of console display is always valid (no
            &quot;&quot;transcosding&quot;&quot; error) within the contract of the function,
            and as such does not have to be specified. Preconditions
            violations are UB within the standard library and we should
            keep doing that. In practice the implementation (which is
            here the terminal, not the stl) will do character
            replacement the best it can, or render something horrible.</div>
        </div>
      </div>
    </blockquote>
    <p>I agree with all of that, but I don&#39;t see how it relates to the
      problematic example above.  The issue with the example is that the
      &quot;%r&quot; field specifier may cause non-UTF-8 content supplied by the
      locale to be written.<br></p></div></blockquote><div>I see two problems here.</div><div>One is that this should not be locale dependent by default - has that been discussed? It seems to run amok of fmt design.</div><div><br></div><div>The other is that, if print(&quot;xxx{}&quot;, foo) assumes that xxx is utf8, and the formated result is displayed onto a terminal, then the entire thing _has to_ be utf-8. note that this is because of</div><div>a precondition on the act if displaying on the terminal which has nothing to do with formatting it&#39;s a 2 step process format -&gt; print on terminal both of which have different preconditions (formating puts a requirement on the format string, to parse it, print additionally puts preconditions that the resulting  thing will be utf8 such that individual arguments have to be to.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>The locale in there is a red herring. Changing the
            execution encoding is always dicey - all  strings that were
            correctly interpreted correctly before the locale change are
            potentially no longer</div>
          <div>correctly interpreted because their encoding no longer
            matches the new execution encoding.</div>
          <div>The existence of a setlocale function doesn&#39;t imply that
            calling it leads to sensible results if the locale
            change also changes the encoding :) <br>
          </div>
        </div>
      </div>
    </blockquote>
    The example doesn&#39;t assume a locale change, at least not beyond an
    initial <tt>std::setlocale(LC_ALL, &quot;&quot;)</tt> during program startup.<br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div><br>
          </div>
          <div>&gt; Specialize <a href="https://en.cppreference.com/w/cpp/locale/locale" target="_blank"><tt>std::locale</tt>
              facets</a> and related I/O manipulators like <a href="https://en.cppreference.com/w/cpp/io/manip/put_time" target="_blank"><tt>std::put_time()</tt></a>
            for <tt>char8_t</tt>.  This would allow <tt>std::print()</tt>
            to, when the literal encoding is UTF-8, opt-in to use of the
            UTF-8/<tt>char8_t</tt> facets and I/O manipulators.</div>
          <div><br>
          </div>
          <div>This is a different issue, one Peter and I have
            discussed. we should not try to shove char into char8_t.
            Both char8_t and utf-8 char are valid use cases. Also, the
            whole point of fmt::print is to avoid all of that :)</div>
        </div>
      </div>
    </blockquote>
    <p>I think this is strongly related, or we are misunderstanding each
      other.  I see the point of <tt>std::print()</tt> being to bypass
      the implicit (wrong) console transcoding.</p></div></blockquote><div>fmt::print just dumps the bytes in the general case, similarly to printf, that is then interpreted incorrectly by the windows console. I don&#39;t see where there might be transcoding</div><div>in the program (I expect the console to do interesting things, but that&#39;s outside of C++).</div><div><br></div><div>C++ thinks a string is Utf-8</div><div>System (incorrectly) disagrees</div><div>System has a method that allows it to agree</div><div>Do we use that method?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
    <p>I strongly agree that <tt>char8_t</tt> and UTF-8 <tt>char</tt>
      are valid use cases.<br>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>&gt; When the literal encoding is UTF-8, stipulate that
            running the program in a non-UTF-8 based locale is
            non-conforming.  This would effectively require MSVC
            programmers to, when building code with the <tt>/utf-8</tt>
            option, to also <a href="https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page" target="_blank">force selection of
              a UTF-8 code page via a manifest</a> and require use of
            Windows 10 build 1903 or later.</div>
          <div><br>
          </div>
          <div>If you program contains literals that are not correctly
            interpreted by the execution encoding, the behavior of your
            program cannot be correct &lt;insert scary U word&gt;. So
            they should probably do that but it seems out of scope.</div>
          <div>The literalS encoding and the execution encoding should
            be consistent (each string literal should be correctly
            interpreted).</div>
          <div><br>
          </div>
          <div>&gt; When the literal encoding is UTF-8, specify that
            non-UTF-8 based locale dependent translations be implicitly
            transcoded</div>
          <div>Sorry, can you detail what you mean? I do not understand,
            sorry<br>
          </div>
        </div>
      </div>
    </blockquote>
    In the example above, the &quot;%r&quot; field specifier indicates that a
    locale dependent 12-hour clock time be formatted.  The AM/PM
    designator to be formatted is locale dependent.  If the locale is
    not UTF-8 based, then mojibake is produced (if the literal encoding
    is UTF-8).  This suggestion addresses the problem by implicitly
    transcoding the locale dependent AM/PM designator from the locale
    encoding to UTF-8 when formatting the output.<br></div></blockquote><div><br></div><div>Think about cases in which that can happen</div><div>There is a non-utf8 locale and a utf8 string literal mixed together.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>&gt; Drop the special case handling for the literal
            encoding being UTF-8 and specify that, when bypassing a
            stream to write directly to the console, that the output be
            implicitly transcoded from the current locale dependent
            encoding (whatever it is) to the console encoding (UTF-8). </div>
          <div><br>
          </div>
          <div>Dropping the special case seems more difficult in terms
            of wording.</div>
        </div>
      </div>
    </blockquote>
    I think it is simpler actually; we would just have to say that the
    implicit transcoding is from the locale encoding to the console
    encoding.<br></div></blockquote><div><br></div><div>It&#39;s really hard to know what the console encoding is (it is a very microsoft specific thing), and the windows console basically have a wide (utf16) and narrow encoding (not sure it works exactly like that but it&#39;s a good enough model)</div><div>Transcoding in the general case might be worse.</div><div>A wording that encourages vendors to... encourage utf8 content to not be misinterpreted as something else might help but good luck wording that!</div><div>Especially as it needs to handle file redirection, etc</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div>If everything else fails, Microsoft could do the sensible
            thing as a matter of QOL.</div>
        </div>
      </div>
    </blockquote>
    <p>Agreed.</p>
    <p>Tom.<br>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div> </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p>Please feel free to comment on these, or additional,
                approaches before our meeting on Wednesday.</p>
              <p>I think it would benefit LEWG if a revision of the
                paper presented each of these possibilities, the
                consequences, and the rationale (and hopefully SG16
                consensus) for the proposed direction.<br>
              </p>
              <p>Tom.<br>
              </p>
              <blockquote type="cite">
                <p>I do not intend to time limit discussion of P2093R5
                  as I believe this is an important matter to resolve. 
                  If we are able to complete discussion of P2093R5, then
                  we&#39;ll discuss P2348R0.<br>
                </p>
                <p>Tom.<br>
                </p>
                <br>
                <fieldset></fieldset>
              </blockquote>
              <p><br>
              </p>
            </div>
            -- <br>
            SG16 mailing list<br>
            <a href="mailto:SG16@lists.isocpp.org" target="_blank">SG16@lists.isocpp.org</a><br>
            <a href="https://lists.isocpp.org/mailman/listinfo.cgi/sg16" rel="noreferrer" target="_blank">https://lists.isocpp.org/mailman/listinfo.cgi/sg16</a><br>
          </blockquote>
        </div>
      </div>
      <br>
      <fieldset></fieldset>
    </blockquote>
    <p><br>
    </p>
  </div>

</blockquote></div></div>

