Greetings, LEWGabees!

The following are questions/concerns that came up during the various SG16 reviews of P1885 that are not strongly SG16 related and are therefore being delegated to LEWG.

Minutes for prior SG16 reviews of P1885, in chronological order, are available at:

Additional archived email discussion can be found at:

Questions raised include:

  1. Naming:
    1. The text_encoding type represents an encoding name and/or identifier as opposed to a type that provides encoding services.  Should the name more strongly reflect that intended use as a name/identifier?
    2. The id and mib() members of text_encoding correspond to IANA-specific values and terms.  It is conceivable that mappings to a different/additional registry could be desired at some time in the future.  Should these names more strongly reflect their IANA association?
    3. The enumerators of text_encoding::id were obtained by, for each IANA registered encoding, taking the "cs" prefixed alias name (of which there is always exactly one), and dropping the "cs" prefix.  A special change was then made to rename the one that would have been "Unicode" to "UCS2".  Many of the resulting names consist of only capital letters and may be mistaken for macros.  Are these names ok?  Or do they intrude too much on the namespace of user identifiers?
    4. The literal() and wide_literal() members of text_encoding return names for what the standard calls the execution character set and execution-wide character set.  Are these names ok?  (SG16 has discussed updating terminology used within the standard, but has not yet forwarded a paper containing such a proposal).
    5. The system() and wide_system() members of text_encoding return names for the locale sensitive run-time encoding that was active at the start of the process (e.g., before any calls to setlocale()).  Are these names ok?  On Windows, system() would return an encoding corresponding to GetACP().
  2. Interface:
    1. The text_encoding type, if implemented as shown with the exposition data members, would have a minimum size of 68 bytes.  SG16 has discussed future use of this type as a tag type or non-type template parameter to select an encoding at compile time.  Does the size of the type raise any concerns for such use?
    2. The max_name_length member of text_encoding is specified with a length of 63 (not including a string terminator).  The IANA character set registry introductory text states that "The character set names may be up to 40 characters taken from the printable characters of US-ASCII".  Should this length be adjusted to match or should the current length be retained?  Additional encoding names known to ICU that are not registered with the IANA registry can be browsed with ICU's Converter Explorer.  The longest name there (which includes IANA names) appears to be 27 characters.
    3. The proposed design exposes a library solution that is not accessible to the preprocessor.  Is LEWG ok with the (wide) execution character set continuing to be unknown for preprocessor directives? (a patch accepted for gcc 11 to provide this information for the purposes of implementing this feature will expose the names of these encodings as string literals via new __GNUC_EXECUTION_CHARSET_NAME and __GNUC_WIDE_EXECUTION_CHARSET_NAME predefined macros).
    4. The interface allows implementations to extend the set of recognized encodings beyond those registered with IANA in a way that permits those additional implementation known encodings to have associated aliases (e.g., the implementation could use negative values for additional text_encoding::id enumerators; RFC 3808 states only positive values will be used).  However, similar extension is not possible for user code (User code can construct instances of text_encoding with unrecognized names, but cannot establish alias sets for them).  This means polyfill will not be possible.  Is this ok?
    5. The interface does not provide indication of an unrecognized encoding name other than by querying the mib() member to see if the name was mapped to other (in which case, it could still correspond to an encoding known to the implementation that is not registered with IANA).  This is intentional since the application has no other mechanism for validating names and support for unknown names is an explicit design goal.  Is this ok?
    6. Equality is defined partially, but not solely, in terms of text_encoding::id such that the following expression all evaluate as indicated (where Foo, Bar, cz123, and CZ-12.3 are all unrecognized encoding names):
        text_encoding("US-ASCII") == text_encoding("ISO646-US") // True because .mib() returns the same value for each.
        text_encoding("Foo")      == text_encoding("Bar")       // False despite .mib() returning the same value (id::other) for each.
        text_encoding("cz123")    == text_encoding("CZ-12.3")   // True because the names match (cas-insensitive ignoring '-' and '.') despite .mib() returning id::other for each.
      Is this ok?
    7. Is the ability to compare a text_encoding object directly with an ID desirable?
        text_encoding("US-ASCII") == text_encoding::id::ASCII
      as opposed to requiring:
        text_encoding("US-ASCII").mib() == text_encoding::id::ASCII
    8. Is the name comparison algorithm denoted by COMP_NAME() acceptable?  This algorithm corresponds to Unicode UTS#22 which notes that it results in ambiguities for some of the IANA registered names.
    9. Are the preconditions for the text_encoding constructors acceptable?
  3. Name sources:
    1. Is dependence solely on the IANA registry acceptable?  Some concerns were noted in the various mailing list discussions.  ICU's Converter Explorer provides convenient means to browse encodings known to ICU that are not registered with IANA (Make sure "IANA" is selected along with other desired sources, then look for rows that have no entry in the IANA column).
  4. References:
    1. This is more of a question for LWG.  The IANA registry is not versioned, but does contain a last updated time stamp.  No stability guarantees are provided, nor is there an obvious way to access older revisions of the registry.  Is a reference ok?  Or do we need to include the contents in the standard?  The IANA registry had not been updated for many years until just a month ago when "UTF-7-IMAP" was added.