ISOCPP sg16 List: Agenda for the 2022-10-12 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 6 Oct 2022 17:59:15 -0400

SG16 will hold a telecon on Wednesday, October 12th, at 19:30 UTC
(timezone conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20221012T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).

The agenda is:

  * A presentation by Michael Kuperstein regarding i18n and l10n and
    existing practice in the industry.
  * NB comment processing.

INCITS has made US NB comments available to its members. I reviewed the
list and identified the following as ones that I believe SG16 should
establish a position on. There are other comments that are related to
papers SG16 has previously discussed, but in those cases, I believe the
concerns raised do not require SG16 input.

Due to duplicated comments in the list of US comments, it is possible
that the comment identifiers below will change.

  US-2: [defns.multibyte] <http://eel.is/c++draft/defns.multibyte>

The notion of an "execution character set" is no longer given prominence
in the Draft standard, aside from some notes about its relationship to
the concept as defined by C, and clarifying that certain character
encodings are unrelated to this character set. This makes it a
questionable choice for use in the definition of "multibyte character".

*Proposed change:*

Change the definition of "multibyte character" to use a character
encoding with a more definite specification given by the Standard.

  US-38: [format.string.escaped]
  <https://eel.is/c++draft/format.string.escaped>

The subject subclause describes how characters or strings are "escaped"
to be formatted more suitably "for debugging or for logging".

The actual suitability for debugging or for logging depends on the needs
of the application, and there is a conflict between formatting for human
readability of the textual content and formatting for clarity and
fidelity of encoding nuances. Indeed, for the latter, there can still be
(for stateful encodings) a conflict between formatting for human visual
inspection versus formatting for machine consumption of the output
sequence as a C++ string/character literal.

The current design introduces extensions to the API and to the format
string syntax that assume that there is one specific default that should
be chosen "for debugging or for logging". The reasoning behind the
chosen default and the extensibility of the current design does not
appear to be sufficiently explored.

Note 1:
An example, for Unicode encodings, of a choice between prioritizing
between human readability of the textual content and visual clarity of
encoding nuances is in the treatment of characters having Unicode
property Grapheme_Extend=Yes. The current design favors visual clarity
of encoding nuances by outputing such characters as escape sequences.

Note 2:
For stateful encodings, the lack of return to the initial shift state at
the end of the sequence cannot be represented using a C++
string/character literal unless if a prior shift sequence from the
initial shift state is rendered via escape sequence(s). It is not clear
that scanning forward is generally always an option (nor is it clear
that doing so is desirable).

*Proposed change:*

Narrow the purported scope and affirm the design choices of the default
behavior:
Modify "logging" to "technical logging" and spell out the priorities in
order in the description (this has the benefit of clearly communicating
intention and providing guidance for implementation choices).

1. The output is intended to be a C++ string/character literal that
    reproduces the encoded sequence. (This seems to be taken for granted
    and not made explicit in the current draft.)
2. Prefer visually distinguishing between different methods of encoding
    "equivalent" textual content.

Make any adjustments necessary to the API or the format string syntax
associated with "escaped" strings to allow for future additions for
alternative escaping.

  US-64: [uaxid.pattern] <https://eel.is/c++draft/uaxid.pattern>

The Unicode org has clarified that the pattern whitespace and pattern
syntax rules apply to the lexing and parsing of computer languages.

*Proposed change:*

Replace with "UAX#31 describes how formal languages such as computer
languages should describe and implement their use of whitespace and
syntactically significant characters during the processes of lexing and
parsing. C++ does not claim conformance with this requirement."

Tom.

Received on 2022-10-06 21:59:19