<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jan 7, 2020, 05:24 Tom Honermann &lt;<a href="mailto:tom@honermann.net">tom@honermann.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div>
    <div>On 1/6/20 8:27 AM, Thiago Macieira via
      SG16 wrote:<br>
    </div>
    <blockquote type="cite">
      <pre>On Monday, 6 January 2020 10:09:26 -03 Corentin Jabot wrote:
</pre>
      <blockquote type="cite">
        <blockquote type="cite">
          <pre>And yet that is nonsense. It can&#39;t convert a codec name to its MIB number
unless that is in a table somewhere the implementation has access to. So
by
definition, the text_encoding_id is limited to the codecs the Standard
Library
knows about. Other libraries should deploy their own text_encoding_id
equivalents.
</pre>
        </blockquote>
        <pre>That is a good point for which i think the solution might be to force
hosted implementation to
always provide the entire table (which is really not that big) ?
</pre>
      </blockquote>
      <pre>It&#39;s not, but even then you have the problem that the table in the vendor&#39;s 
implementation may be out of date compared to what the application expects. 
And are vendors allowed to extend the table with other names, such as WTF-8?

Like I said, if all you wanted was the table, you can get the table. I&#39;ll 
write an XSL-T script for you to generate the table....

</pre>
      <blockquote type="cite">
        <pre>I think at some point we lost track of what the proposal is about:
It&#39;s about answering:
- What is the execution character encoding (which only the implementation
can do)
- What is the environment encoding (which the implementation can do better)
</pre>
      </blockquote>
      <pre>Ok, good points. If we restrict text_encoding_id to those, then 
text_encoding_id has no need to support the full table or unknown codecs. By 
definition, it supports only what the implementation supports.</pre>
    </blockquote>
    In Belfast, we discussed the following example in the context of <a href="http://eel.is/c++draft/time.duration#io-4" target="_blank" rel="noreferrer">[time.duration.io]p4</a>;
    printing of the micro units suffix:
    <blockquote>
      <p><tt>template&lt;class traits, class Rep, class Period&gt;</tt><tt><br>
        </tt><tt>void print_fancy_suffix(basic_ostream&lt;char,
          traits&gt;&amp; os, const duration&lt;Rep, Period&gt;&amp; d)</tt><tt><br>
        </tt><tt>{</tt><tt><br>
        </tt><tt>  if constexpr (text_encoding::literal().mib == UTF-8)
          {</tt><tt><br>
        </tt><tt>    os &lt;&lt; d.count() &lt;&lt; &quot;\u00B5s&quot;;</tt><tt><br>
        </tt><tt>  } else {</tt><tt><br>
        </tt><tt>    os &lt;&lt; d.count() &lt;&lt; &quot;us&quot;;</tt><tt><br>
        </tt><tt>  }</tt><tt><br>
        </tt><tt>}</tt><br>
      </p>
    </blockquote>
    <p>I see that as one of the primary motivating use cases at
      present.  However, I don&#39;t think this represents the extent of use
      cases well.</p>
    <p>I would like to see these encoding identifiers adopted for use in
      ICU, iconv, QT, or other encoding providers.  I think these
      encoding identifiers could be useful in the context of <a href="https://wg21.link/p1629" target="_blank" rel="noreferrer">P1629</a>.<br>
    </p>
    <p>I don&#39;t want to see code doing string comparisons to match
      encodings.</p></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Yet that is how it has to work. iconv only exposes name based interface.</div><div dir="auto">Qt does provide both name and mib based interface and can provide a text_encoding based interface.</div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
    <p>The set of encodings an implementation cares about is hard to
      determine since it crosses compiler, standard library, and third
      party boundaries.  For example, my understanding is that gcc
      relies on the host system&#39;s iconv() implementation to determine
      the valid execution character set targets and to transcode from
      source file encoding, at least for many encodings.  So, for gcc,
      the set of encodings needed to provide complete support (e.g., to
      avoid .mib() returning <tt>other</tt> or <tt>unknown</tt> for a
      supported encoding) would involve negotiation between the
      compiler, run-time library, and host system.</p></div></blockquote></div></div><div dir="auto">Which is not implementable.</div><div dir="auto">Either we force hosted implentation to provide the full database or we accept that the list might be incomplete.</div><div dir="auto">In practice on a given platform there is a direct relation between the encodings supported by the compiler and the system on which it is run</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
    <blockquote type="cite">
      <pre></pre>
      <blockquote type="cite">
        <pre>And have that information be consistent across platform when possible (for
interaction with libraries such as Qt, icu, iconv) - everything else is
secondary.
Which means an implementation will provide informations about encoding
relevant to the platform.

Now, an encoding id is 3 things:
- A name,
- A mib when applicable
- Aliases when applicable
</pre>
      </blockquote>
      <pre>Agreed, though implementations should be wary that the alias list might be 
empty. Portable applications should rely on the MIB and on the official name.</pre>
    </blockquote>
    <p>What are you referring to as the &quot;official name&quot;?  The <a href="https://www.iana.org/assignments/character-sets/character-sets.xhtml" target="_blank" rel="noreferrer">IANA
        character registry</a> lists two names and a set of aliases. 
      One of the names is labeled as &quot;Preferred MIME Name&quot;, the other is
      just &quot;Name&quot;.  Not all registered character sets have a &quot;Preferred
      MIME Name&quot;.  They all do have a &quot;Name&quot;.  There are cases where
      neither the &quot;Preferred MIME Name&quot; nor the &quot;Name&quot; are reflected in
      the list of aliases.  All of the registered sets also contain an
      identifier friendly alias starting with &quot;cs&quot;.  The &quot;Name&quot; name is
      not a particularly friendly name (since it includes a version
      date), nor is it particularly familiar in many cases (e.g.,
      &quot;Extended_UNIX_Code_Packed_Format_for_Japanese&quot; vs &quot;EUC-JP&quot;).</p></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">The mime name is the same as the name if unspecified.</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
    <blockquote type="cite">
      <pre></pre>
      <blockquote type="cite">
        <pre>And the name used to construct the object is used to lookup the extra
optional informations.
I think the only reason to differentiate &quot;unknown&quot; and &quot;other&quot; in the way
you suggest is if
we need to support aliases for non registered encodings.
Is that the case?
</pre>
      </blockquote>
      <pre>I think the implementation should strive to never return &quot;unknown&quot;, except in 
case of an internal failure to determine what the encoding is. As a matter of 
quality, implementations should be designed not to do that.

And yet providing a list of well-known MIBs is useful in and of itself. In 
that case, mib::unknown is a valid and well-known value.

</pre>
    </blockquote>
    <p>I think the proposal is leaning too heavily on the IANA
      registry.  For example, <tt>operator==</tt> is specified in terms
      of what the <tt>.mib()</tt> member function returns.  In previous
      emails, Thiago suggested that the <tt>text_encoding_id</tt> class
      could be more opaque; e.g., it could have its own internal system
      for identifying whether two names refer to the same, potentially
      unregistered, encoding (in which case, <tt>.mib()</tt> would
      return <tt>other</tt>, but this would not impact the behavior of
      <tt>operator==</tt>).  I strongly agree with this direction.<br></p></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">This direction is not implementable portably.</div><div dir="auto"><br></div><div dir="auto">The iana registry was always an implentation detail but it is an important implentation detail nonetheless. </div><div dir="auto">We cannot offer reliable and consistent comparison without it.</div><div dir="auto">The discussion assumes that there exist unregistered encodings which have many different names on a given platform and I don&#39;t see evidence of that.</div><div dir="auto"><br></div><div dir="auto">I would like to see a concrete example of situation in which the provided comparison algorithm is not sufficient.</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><p>
    </p>
    <p>Tom.<br>
    </p>
  </div>

</blockquote></div></div></div>

