sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 1 Jul 2020 09:28:37 +0200

On 28/06/2020 06.50, Tom Honermann via Core wrote:
> A new draft revision of P2029 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals) is now available at https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html. This addresses the CWG feedback provided during the March 23rd, 2020 core issues processing teleconference <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>
> Wording review feedback prior to the next Core issues processing teleconference would be much appreciated!

[lex.ccon] pX (before p6):

In general, I'd like to see italics-text definitions here
and less of a descriptive, but more of a prescriptive, tone
applied. I also don't like the term "conditional character
literal"; it's too obtuse. Suggestion: "non-encodable
character literal".

Suggestion:

(quote)
A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
more than one /c-char/. A /non-encodable character literal/ is a character literal
whose /c-char-sequence/ consists of a single /c-char/ that is not a
/numeric-escape-sequence/ and that specifies a character that either lacks representation
in the applicable associated character encoding or that cannot be encoded in a single code unit.
The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
shall be absent or 'L'. Such /character-literal/s are conditionally-supported.

The kind of a character-literal, its type, and its associated character encoding is determined
by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
cases exclude former ones.
(end quote)

This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
character literals, and we don't have to define the combinatorial explosion of terms
here. In table Y, remove italics for "none" and "L" second and third rows and use
"multi-character literal" and "non-encodable character literal" in both situations.
Reorder in the order "multi-character", "non-encodable", ordinary.

In lex.ccon pZ

Simplify and reorder; we've said most of it in X already:

Z.1 A multi-character literal or a non-encodable character literal has an implementation-defined value.

Z.2 A character literal consisting of a single /numeric-escape-sequence/ specifying an integer value v
has the following value:
   - If v does not exceed the range of the type of the /character-literal/, its value is v.
   - Otherwise, if the /encoding-prefix/ of the /character-literal/ is absent or L, the value is implementation-defined
   - Otherwise, the program is ill-formed.

Z.3 A character literal consisting of a single /conditional-escape-sequence/ is conditionally-supported
and has an implementation-defined value.

Z.4 Any other character literal has the value of the corresponding code unit in the applicable associated character encoding.

The update does not address the concern that phase 5 encodes but phase 6
concatenates string literals, which might change the encoding.

Example: "a" and u8"b" is concatenated as u8"ab".
Suppose my ordinary literal encoding is EBCDIC.
Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
And then, this is normatively equivalent to u8"ab". That doesn't add up.

If we concatenate first and then encode, we have the issue that
numeric-escape-sequences might alter their meaning by the concatenation.
Example: "\33" "3" under the status quo must be encoded as \33 followed by "3"
(two code units).
When we concatenate first, this becomes "\333" (presumably a single code unit).
This is particularly serious because using string literal concatenation is the
only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.

It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
then we concatenate (to get the right type/encoding), then we encode. Ugh.

We should be clear in the text whether an implementation is allowed to encode
a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
each character is encoded separately. There was concern that "separately"
doesn't address stateful encodings, where the encoding of string character
i+1 may depend on what string character i was.

Maybe replace "associated character encoding" -> "associated literal encoding"
globally to avoid the mention of "character" here.

"These sequences should have no effect on encoding state for stateful character encodings."
-> "These sequences are assumed not to affect encoding state for stateful character encodings."

In general, we can't use "should" (normative encouragement) in notes.

Jens

Received on 2020-07-01 02:31:55