Date: Sun, 7 Feb 2021 02:47:44 -0500
Greetings, LEWGabees!
The following are questions/concerns that came up during the various
SG16 reviews of P1885 <https://wg21.link/p1885> that are not strongly
SG16 related and are therefore being delegated to LEWG.
Minutes for prior SG16 reviews of P1885, in chronological order, are
available at:
* SG16 in Belfast
<https://wiki.edg.com/bin/view/Wg21belfast/SG16P1885R0>; review of
P1885R0.
(For reasons I don't recall now, polls for P1854 were mingled with
the minutes for P1885)
* January 22nd, 2020 telecon
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#january-22nd-2020>;
review of P1885R1.
* SG16 in Prague
<https://wiki.edg.com/bin/view/Wg21prague/SG16D1885R2>; review of a
draft of P1885R2.
* November 11th, 2020 telecon
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#november-11th-2020>;
review of P1885R3.
Additional archived email discussion can be found at:
* 2019-12-27: Bike shedding for Christmas: P1885 Naming Text Encodings
<https://lists.isocpp.org/sg16/2019/12/0993.php>
With multiple threads continued the next month
<https://lists.isocpp.org/sg16/2020/01/index.php>.
* 2020-01-23: Comment on P1885R0: Naming Text Encodings to Demystify
Them <https://lists.isocpp.org/sg16/2020/01/1078.php>
* 2020-03-24: UK national body concerns about P1885R1 'Naming Text
Encodings to Demystify Them'
<https://lists.isocpp.org/sg16/2020/03/1180.php>
* 2020-10-27: LEWG(I) Weekly review - P1885: Naming Text Encodings to
Demystify Them <https://lists.isocpp.org/lib-ext/2020/10/16547.php>
With multiple threads continued the next month
<https://lists.isocpp.org/lib-ext/2020/11/index.php>.
Questions raised include:
1. Naming:
1. The text_encoding type represents an encoding name and/or
identifier as opposed to a type that provides encoding
services. Should the name more strongly reflect that intended
use as a name/identifier?
2. The id and mib() members of text_encoding correspond to
IANA-specific values and terms. It is conceivable that mappings
to a different/additional registry could be desired at some time
in the future. Should these names more strongly reflect their
IANA association?
3. The enumerators of text_encoding::id were obtained by, for each
IANA registered encoding, taking the "cs" prefixed alias name
(of which there is always exactly one), and dropping the "cs"
prefix. A special change was then made to rename the one that
would have been "Unicode" to "UCS2". Many of the resulting names
consist of only capital letters and may be mistaken for macros.
Are these names ok? Or do they intrude too much on the
namespace of user identifiers?
4. The literal() and wide_literal() members of text_encoding return
names for what the standard calls the /execution character set/
and /execution-wide character set/. Are these names ok? (SG16
has discussed updating terminology used within the standard, but
has not yet forwarded a paper containing such a proposal).
5. The system() and wide_system() members of text_encoding return
names for the locale sensitive run-time encoding that was active
at the start of the process (e.g., before any calls to
setlocale()). Are these names ok? On Windows, system() would
return an encoding corresponding to GetACP()
<https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp>.
2. Interface:
1. The text_encoding type, if implemented as shown with the
exposition data members, would have a minimum size of 68 bytes.
SG16 has discussed future use of this type as a tag type or
non-type template parameter to select an encoding at compile
time. Does the size of the type raise any concerns for such use?
2. The max_name_length member of text_encoding is specified with a
length of 63 (not including a string terminator). The IANA
character set registry
<https://www.iana.org/assignments/character-sets/character-sets.xhtml>
introductory text states that "The character set names may be up
to 40 characters taken from the printable characters of
US-ASCII". Should this length be adjusted to match or should
the current length be retained? Additional encoding names known
to ICU that are not registered with the IANA registry can be
browsed with ICU's Converter Explorer
<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.
The longest name there (which includes IANA names) appears to be
27 characters.
3. The proposed design exposes a library solution that is not
accessible to the preprocessor. Is LEWG ok with the (wide)
execution character set continuing to be unknown for
preprocessor directives? (a patch
<https://github.com/gcc-mirror/gcc/commit/eccec8684142e05f2f92f0f5bd5b47dda3ba1529>
accepted for gcc 11 to provide this information for the purposes
of implementing this feature will expose the names of these
encodings as string literals via new
__GNUC_EXECUTION_CHARSET_NAME and
__GNUC_WIDE_EXECUTION_CHARSET_NAME predefined macros).
4. The interface allows implementations to extend the set of
recognized encodings beyond those registered with IANA in a way
that permits those additional implementation known encodings to
have associated aliases (e.g., the implementation could use
negative values for additional text_encoding::id enumerators;
RFC 3808 states only positive values will be used
<https://tools.ietf.org/html/rfc3808#section-3>). However,
similar extension is not possible for user code (User code can
construct instances of text_encoding with unrecognized names,
but cannot establish alias sets for them). This means polyfill
will not be possible. Is this ok?
5. The interface does not provide indication of an unrecognized
encoding name other than by querying the mib() member to see if
the name was mapped to other (in which case, it could still
correspond to an encoding known to the implementation that is
not registered with IANA). This is intentional since the
application has no other mechanism for validating names and
support for unknown names is an explicit design goal. Is this ok?
6. Equality is defined partially, but not solely, in terms of
text_encoding::id such that the following expression all
evaluate as indicated (where Foo, Bar, cz123, and CZ-12.3 are
all unrecognized encoding names):
text_encoding("US-ASCII") == text_encoding("ISO646-US") //
True because .mib() returns the same value for each.
text_encoding("Foo") == text_encoding("Bar") //
False despite .mib() returning the same value (id::other) for each.
text_encoding("cz123") == text_encoding("CZ-12.3") //
True because the names match (cas-insensitive ignoring '-' and
'.') despite .mib() returning id::other for each.
Is this ok?
7. Is the ability to compare a text_encoding object directly with
an ID desirable?
text_encoding("US-ASCII") == text_encoding::id::ASCII
as opposed to requiring:
text_encoding("US-ASCII").mib() == text_encoding::id::ASCII
8. Is the name comparison algorithm denoted by COMP_NAME()
acceptable? This algorithm corresponds to Unicode UTS#22
<https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching>
which notes that it results in ambiguities for some of the IANA
registered names.
9. Are the preconditions for the text_encoding constructors acceptable?
3. Name sources:
1. Is dependence solely on the IANA registry acceptable? Some
concerns were noted in the various mailing list discussions.
ICU's Converter Explorer
<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>
provides convenient means to browse encodings known to ICU that
are not registered with IANA (Make sure "IANA" is selected along
with other desired sources, then look for rows that have no
entry in the IANA column).
4. References:
1. This is more of a question for LWG. The IANA registry is not
versioned, but does contain a last updated time stamp. No
stability guarantees are provided, nor is there an obvious way
to access older revisions of the registry. Is a reference ok?
Or do we need to include the contents in the standard? The IANA
registry had not been updated for many years until just a month
ago when "UTF-7-IMAP" was added.
Tom.
The following are questions/concerns that came up during the various
SG16 reviews of P1885 <https://wg21.link/p1885> that are not strongly
SG16 related and are therefore being delegated to LEWG.
Minutes for prior SG16 reviews of P1885, in chronological order, are
available at:
* SG16 in Belfast
<https://wiki.edg.com/bin/view/Wg21belfast/SG16P1885R0>; review of
P1885R0.
(For reasons I don't recall now, polls for P1854 were mingled with
the minutes for P1885)
* January 22nd, 2020 telecon
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#january-22nd-2020>;
review of P1885R1.
* SG16 in Prague
<https://wiki.edg.com/bin/view/Wg21prague/SG16D1885R2>; review of a
draft of P1885R2.
* November 11th, 2020 telecon
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#november-11th-2020>;
review of P1885R3.
Additional archived email discussion can be found at:
* 2019-12-27: Bike shedding for Christmas: P1885 Naming Text Encodings
<https://lists.isocpp.org/sg16/2019/12/0993.php>
With multiple threads continued the next month
<https://lists.isocpp.org/sg16/2020/01/index.php>.
* 2020-01-23: Comment on P1885R0: Naming Text Encodings to Demystify
Them <https://lists.isocpp.org/sg16/2020/01/1078.php>
* 2020-03-24: UK national body concerns about P1885R1 'Naming Text
Encodings to Demystify Them'
<https://lists.isocpp.org/sg16/2020/03/1180.php>
* 2020-10-27: LEWG(I) Weekly review - P1885: Naming Text Encodings to
Demystify Them <https://lists.isocpp.org/lib-ext/2020/10/16547.php>
With multiple threads continued the next month
<https://lists.isocpp.org/lib-ext/2020/11/index.php>.
Questions raised include:
1. Naming:
1. The text_encoding type represents an encoding name and/or
identifier as opposed to a type that provides encoding
services. Should the name more strongly reflect that intended
use as a name/identifier?
2. The id and mib() members of text_encoding correspond to
IANA-specific values and terms. It is conceivable that mappings
to a different/additional registry could be desired at some time
in the future. Should these names more strongly reflect their
IANA association?
3. The enumerators of text_encoding::id were obtained by, for each
IANA registered encoding, taking the "cs" prefixed alias name
(of which there is always exactly one), and dropping the "cs"
prefix. A special change was then made to rename the one that
would have been "Unicode" to "UCS2". Many of the resulting names
consist of only capital letters and may be mistaken for macros.
Are these names ok? Or do they intrude too much on the
namespace of user identifiers?
4. The literal() and wide_literal() members of text_encoding return
names for what the standard calls the /execution character set/
and /execution-wide character set/. Are these names ok? (SG16
has discussed updating terminology used within the standard, but
has not yet forwarded a paper containing such a proposal).
5. The system() and wide_system() members of text_encoding return
names for the locale sensitive run-time encoding that was active
at the start of the process (e.g., before any calls to
setlocale()). Are these names ok? On Windows, system() would
return an encoding corresponding to GetACP()
<https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp>.
2. Interface:
1. The text_encoding type, if implemented as shown with the
exposition data members, would have a minimum size of 68 bytes.
SG16 has discussed future use of this type as a tag type or
non-type template parameter to select an encoding at compile
time. Does the size of the type raise any concerns for such use?
2. The max_name_length member of text_encoding is specified with a
length of 63 (not including a string terminator). The IANA
character set registry
<https://www.iana.org/assignments/character-sets/character-sets.xhtml>
introductory text states that "The character set names may be up
to 40 characters taken from the printable characters of
US-ASCII". Should this length be adjusted to match or should
the current length be retained? Additional encoding names known
to ICU that are not registered with the IANA registry can be
browsed with ICU's Converter Explorer
<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.
The longest name there (which includes IANA names) appears to be
27 characters.
3. The proposed design exposes a library solution that is not
accessible to the preprocessor. Is LEWG ok with the (wide)
execution character set continuing to be unknown for
preprocessor directives? (a patch
<https://github.com/gcc-mirror/gcc/commit/eccec8684142e05f2f92f0f5bd5b47dda3ba1529>
accepted for gcc 11 to provide this information for the purposes
of implementing this feature will expose the names of these
encodings as string literals via new
__GNUC_EXECUTION_CHARSET_NAME and
__GNUC_WIDE_EXECUTION_CHARSET_NAME predefined macros).
4. The interface allows implementations to extend the set of
recognized encodings beyond those registered with IANA in a way
that permits those additional implementation known encodings to
have associated aliases (e.g., the implementation could use
negative values for additional text_encoding::id enumerators;
RFC 3808 states only positive values will be used
<https://tools.ietf.org/html/rfc3808#section-3>). However,
similar extension is not possible for user code (User code can
construct instances of text_encoding with unrecognized names,
but cannot establish alias sets for them). This means polyfill
will not be possible. Is this ok?
5. The interface does not provide indication of an unrecognized
encoding name other than by querying the mib() member to see if
the name was mapped to other (in which case, it could still
correspond to an encoding known to the implementation that is
not registered with IANA). This is intentional since the
application has no other mechanism for validating names and
support for unknown names is an explicit design goal. Is this ok?
6. Equality is defined partially, but not solely, in terms of
text_encoding::id such that the following expression all
evaluate as indicated (where Foo, Bar, cz123, and CZ-12.3 are
all unrecognized encoding names):
text_encoding("US-ASCII") == text_encoding("ISO646-US") //
True because .mib() returns the same value for each.
text_encoding("Foo") == text_encoding("Bar") //
False despite .mib() returning the same value (id::other) for each.
text_encoding("cz123") == text_encoding("CZ-12.3") //
True because the names match (cas-insensitive ignoring '-' and
'.') despite .mib() returning id::other for each.
Is this ok?
7. Is the ability to compare a text_encoding object directly with
an ID desirable?
text_encoding("US-ASCII") == text_encoding::id::ASCII
as opposed to requiring:
text_encoding("US-ASCII").mib() == text_encoding::id::ASCII
8. Is the name comparison algorithm denoted by COMP_NAME()
acceptable? This algorithm corresponds to Unicode UTS#22
<https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching>
which notes that it results in ambiguities for some of the IANA
registered names.
9. Are the preconditions for the text_encoding constructors acceptable?
3. Name sources:
1. Is dependence solely on the IANA registry acceptable? Some
concerns were noted in the various mailing list discussions.
ICU's Converter Explorer
<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>
provides convenient means to browse encodings known to ICU that
are not registered with IANA (Make sure "IANA" is selected along
with other desired sources, then look for rows that have no
entry in the IANA column).
4. References:
1. This is more of a question for LWG. The IANA registry is not
versioned, but does contain a last updated time stamp. No
stability guarantees are provided, nor is there an obvious way
to access older revisions of the registry. Is a reference ok?
Or do we need to include the contents in the standard? The IANA
registry had not been updated for many years until just a month
ago when "UTF-7-IMAP" was added.
Tom.
Received on 2021-02-07 01:47:49