C++ Logo

sg16

Advanced search

[SG16] Questions for LEWG for P1885: Naming Text Encodings to Demystify Them

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 7 Feb 2021 02:47:44 -0500
Greetings, LEWGabees!

The following are questions/concerns that came up during the various
SG16 reviews of P1885 <https://wg21.link/p1885> that are not strongly
SG16 related and are therefore being delegated to LEWG.

Minutes for prior SG16 reviews of P1885, in chronological order, are
available at:

  * SG16 in Belfast
    <https://wiki.edg.com/bin/view/Wg21belfast/SG16P1885R0>; review of
    P1885R0.
    (For reasons I don't recall now, polls for P1854 were mingled with
    the minutes for P1885)
  * January 22nd, 2020 telecon
    <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#january-22nd-2020>;
    review of P1885R1.
  * SG16 in Prague
    <https://wiki.edg.com/bin/view/Wg21prague/SG16D1885R2>; review of a
    draft of P1885R2.
  * November 11th, 2020 telecon
    <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#november-11th-2020>;
    review of P1885R3.

Additional archived email discussion can be found at:

  * 2019-12-27: Bike shedding for Christmas: P1885 Naming Text Encodings
    <https://lists.isocpp.org/sg16/2019/12/0993.php>
    With multiple threads continued the next month
    <https://lists.isocpp.org/sg16/2020/01/index.php>.
  * 2020-01-23: Comment on P1885R0: Naming Text Encodings to Demystify
    Them <https://lists.isocpp.org/sg16/2020/01/1078.php>
  * 2020-03-24: UK national body concerns about P1885R1 'Naming Text
    Encodings to Demystify Them'
    <https://lists.isocpp.org/sg16/2020/03/1180.php>
  * 2020-10-27: LEWG(I) Weekly review - P1885: Naming Text Encodings to
    Demystify Them <https://lists.isocpp.org/lib-ext/2020/10/16547.php>
    With multiple threads continued the next month
    <https://lists.isocpp.org/lib-ext/2020/11/index.php>.

Questions raised include:

 1. Naming:
     1. The text_encoding type represents an encoding name and/or
        identifier as opposed to a type that provides encoding
        services. Should the name more strongly reflect that intended
        use as a name/identifier?
     2. The id and mib() members of text_encoding correspond to
        IANA-specific values and terms. It is conceivable that mappings
        to a different/additional registry could be desired at some time
        in the future. Should these names more strongly reflect their
        IANA association?
     3. The enumerators of text_encoding::id were obtained by, for each
        IANA registered encoding, taking the "cs" prefixed alias name
        (of which there is always exactly one), and dropping the "cs"
        prefix. A special change was then made to rename the one that
        would have been "Unicode" to "UCS2". Many of the resulting names
        consist of only capital letters and may be mistaken for macros.
        Are these names ok? Or do they intrude too much on the
        namespace of user identifiers?
     4. The literal() and wide_literal() members of text_encoding return
        names for what the standard calls the /execution character set/
        and /execution-wide character set/. Are these names ok? (SG16
        has discussed updating terminology used within the standard, but
        has not yet forwarded a paper containing such a proposal).
     5. The system() and wide_system() members of text_encoding return
        names for the locale sensitive run-time encoding that was active
        at the start of the process (e.g., before any calls to
        setlocale()). Are these names ok? On Windows, system() would
        return an encoding corresponding to GetACP()
        <https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp>.
 2. Interface:
     1. The text_encoding type, if implemented as shown with the
        exposition data members, would have a minimum size of 68 bytes.
        SG16 has discussed future use of this type as a tag type or
        non-type template parameter to select an encoding at compile
        time. Does the size of the type raise any concerns for such use?
     2. The max_name_length member of text_encoding is specified with a
        length of 63 (not including a string terminator). The IANA
        character set registry
        <https://www.iana.org/assignments/character-sets/character-sets.xhtml>
        introductory text states that "The character set names may be up
        to 40 characters taken from the printable characters of
        US-ASCII". Should this length be adjusted to match or should
        the current length be retained? Additional encoding names known
        to ICU that are not registered with the IANA registry can be
        browsed with ICU's Converter Explorer
        <https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.
        The longest name there (which includes IANA names) appears to be
        27 characters.
     3. The proposed design exposes a library solution that is not
        accessible to the preprocessor. Is LEWG ok with the (wide)
        execution character set continuing to be unknown for
        preprocessor directives? (a patch
        <https://github.com/gcc-mirror/gcc/commit/eccec8684142e05f2f92f0f5bd5b47dda3ba1529>
        accepted for gcc 11 to provide this information for the purposes
        of implementing this feature will expose the names of these
        encodings as string literals via new
        __GNUC_EXECUTION_CHARSET_NAME and
        __GNUC_WIDE_EXECUTION_CHARSET_NAME predefined macros).
     4. The interface allows implementations to extend the set of
        recognized encodings beyond those registered with IANA in a way
        that permits those additional implementation known encodings to
        have associated aliases (e.g., the implementation could use
        negative values for additional text_encoding::id enumerators;
        RFC 3808 states only positive values will be used
        <https://tools.ietf.org/html/rfc3808#section-3>). However,
        similar extension is not possible for user code (User code can
        construct instances of text_encoding with unrecognized names,
        but cannot establish alias sets for them). This means polyfill
        will not be possible. Is this ok?
     5. The interface does not provide indication of an unrecognized
        encoding name other than by querying the mib() member to see if
        the name was mapped to other (in which case, it could still
        correspond to an encoding known to the implementation that is
        not registered with IANA). This is intentional since the
        application has no other mechanism for validating names and
        support for unknown names is an explicit design goal. Is this ok?
     6. Equality is defined partially, but not solely, in terms of
        text_encoding::id such that the following expression all
        evaluate as indicated (where Foo, Bar, cz123, and CZ-12.3 are
        all unrecognized encoding names):
           text_encoding("US-ASCII") == text_encoding("ISO646-US") //
        True because .mib() returns the same value for each.
           text_encoding("Foo") == text_encoding("Bar") //
        False despite .mib() returning the same value (id::other) for each.
           text_encoding("cz123") == text_encoding("CZ-12.3") //
        True because the names match (cas-insensitive ignoring '-' and
        '.') despite .mib() returning id::other for each.
        Is this ok?
     7. Is the ability to compare a text_encoding object directly with
        an ID desirable?
           text_encoding("US-ASCII") == text_encoding::id::ASCII
        as opposed to requiring:
           text_encoding("US-ASCII").mib() == text_encoding::id::ASCII
     8. Is the name comparison algorithm denoted by COMP_NAME()
        acceptable? This algorithm corresponds to Unicode UTS#22
        <https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching>
        which notes that it results in ambiguities for some of the IANA
        registered names.
     9. Are the preconditions for the text_encoding constructors acceptable?
 3. Name sources:
     1. Is dependence solely on the IANA registry acceptable? Some
        concerns were noted in the various mailing list discussions.
        ICU's Converter Explorer
        <https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>
        provides convenient means to browse encodings known to ICU that
        are not registered with IANA (Make sure "IANA" is selected along
        with other desired sources, then look for rows that have no
        entry in the IANA column).
 4. References:
     1. This is more of a question for LWG. The IANA registry is not
        versioned, but does contain a last updated time stamp. No
        stability guarantees are provided, nor is there an obvious way
        to access older revisions of the registry. Is a reference ok?
        Or do we need to include the contents in the standard? The IANA
        registry had not been updated for many years until just a month
        ago when "UTF-7-IMAP" was added.

Tom.


Received on 2021-02-07 01:47:49