sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 2 Jul 2020 01:39:59 -0400

On 7/1/20 3:28 AM, Jens Maurer wrote:
> On 28/06/2020 06.50, Tom Honermann via Core wrote:
>> A new draft revision of P2029 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals) is now available at https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html. This addresses the CWG feedback provided during the March 23rd, 2020 core issues processing teleconference <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>
>> Wording review feedback prior to the next Core issues processing teleconference would be much appreciated!
Thank you for the detailed suggestions!
>
> [lex.ccon] pX (before p6):
>
> In general, I'd like to see italics-text definitions here
> and less of a descriptive, but more of a prescriptive, tone
> applied. I also don't like the term "conditional character
> literal"; it's too obtuse. Suggestion: "non-encodable
> character literal".
That sounds good. I never cared much for "conditional character
literal" either. I'm fine with "non-encodable character literal".
>
> Suggestion:
>
> (quote)
> A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
> more than one /c-char/. A /non-encodable character literal/ is a character literal
> whose /c-char-sequence/ consists of a single /c-char/ that is not a
> /numeric-escape-sequence/ and that specifies a character that either lacks representation
> in the applicable associated character encoding or that cannot be encoded in a single code unit.
> The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
> shall be absent or 'L'. Such /character-literal/s are conditionally-supported.
>
> The kind of a character-literal, its type, and its associated character encoding is determined
> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
> cases exclude former ones.
> (end quote)
>
> This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
> character literals, and we don't have to define the combinatorial explosion of terms
> here. In table Y, remove italics for "none" and "L" second and third rows and use
> "multi-character literal" and "non-encodable character literal" in both situations.
> Reorder in the order "multi-character", "non-encodable", ordinary.
Ok, this seems like good direction. The only bit I'm questioning is the
"where latter cases exclude former ones" and reordering within the
table. The suggested ordering would seem to prioritize the special
cases (which makes sense), but then the "latter cases exclude former
ones" seems to reverse that such that the former cases aren't reached
(because no restrictions regarding number of c-chars or encodability is
placed on the ordinary cases). Perhaps the intent was that latter cases
are excluded by former ones?
>
>
> In lex.ccon pZ
>
> Simplify and reorder; we've said most of it in X already:
>
> Z.1 A multi-character literal or a non-encodable character literal has an implementation-defined value.
>
> Z.2 A character literal consisting of a single /numeric-escape-sequence/ specifying an integer value v
> has the following value:
> - If v does not exceed the range of the type of the /character-literal/, its value is v.
> - Otherwise, if the /encoding-prefix/ of the /character-literal/ is absent or L, the value is implementation-defined
> - Otherwise, the program is ill-formed.
>
> Z.3 A character literal consisting of a single /conditional-escape-sequence/ is conditionally-supported
> and has an implementation-defined value.
>
> Z.4 Any other character literal has the value of the corresponding code unit in the applicable associated character encoding.
Thank you. I struggled with this and knew what is in the draft wouldn't
survive review. This is better.
>
>
>
> The update does not address the concern that phase 5 encodes but phase 6
> concatenates string literals, which might change the encoding.
>
> Example: "a" and u8"b" is concatenated as u8"ab".
> Suppose my ordinary literal encoding is EBCDIC.
> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
> And then, this is normatively equivalent to u8"ab". That doesn't add up.
>
> If we concatenate first and then encode, we have the issue that
> numeric-escape-sequences might alter their meaning by the concatenation.
> Example: "\33" "3" under the status quo must be encoded as \33 followed by "3"
> (two code units).
> When we concatenate first, this becomes "\333" (presumably a single code unit).
> This is particularly serious because using string literal concatenation is the
> only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.
>
> It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
> then we concatenate (to get the right type/encoding), then we encode. Ugh.

Yes, I have intentionally chosen not to address this concern in this
paper; in part because this paper is not intended to change behavior for
implementations (other than to fix what seem to be unintended bugs in
some implementations). But for this, there is implementation divergence
that I think is not due to unintended behavior; Visual C++ does appear
to implement the encode first approach prescribed by the standard. See
https://github.com/sg16-unicode/sg16/issues/47 and
https://msvc.godbolt.org/z/4buyxk.

A core issue was requested for this in
https://lists.isocpp.org/core/2019/03/5770.php, but I don't think it was
ever added to the active issues list.

>
>
> We should be clear in the text whether an implementation is allowed to encode
> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
> each character is encoded separately. There was concern that "separately"
> doesn't address stateful encodings, where the encoding of string character
> i+1 may depend on what string character i was.
I added notes about that, but it sounds like you want something that
explicitly grants such allowances normatively, is that correct?
>
> Maybe replace "associated character encoding" -> "associated literal encoding"
> globally to avoid the mention of "character" here.
Despite the use of the "C" word, "character encoding" is more consistent
with Unicode terminology. Though if we really want to be consistent, we
should use "character encoding form" (which ISO/IEC 10646 then calls
simply "encoding form"). This is something we could discuss at the SG16
meeting next week.
>
> "These sequences should have no effect on encoding state for stateful character encodings."
> -> "These sequences are assumed not to affect encoding state for stateful character encodings."
>
> In general, we can't use "should" (normative encouragement) in notes.

Ah, yes. I should know by now that I shouldn't use should.

Tom.

Received on 2020-07-02 00:43:15