sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 2 Jul 2020 09:02:56 +0200

On 02/07/2020 07.39, Tom Honermann wrote:
> On 7/1/20 3:28 AM, Jens Maurer wrote:

>> Suggestion:
>>
>> (quote)
>> A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
>> more than one /c-char/. A /non-encodable character literal/ is a character literal
>> whose /c-char-sequence/ consists of a single /c-char/ that is not a
>> /numeric-escape-sequence/ and that specifies a character that either lacks representation
>> in the applicable associated character encoding or that cannot be encoded in a single code unit.
>> The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
>> shall be absent or 'L'. Such /character-literal/s are conditionally-supported.
>>
>> The kind of a character-literal, its type, and its associated character encoding is determined
>> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
>> cases exclude former ones.
>> (end quote)
>>
>> This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
>> character literals, and we don't have to define the combinatorial explosion of terms
>> here. In table Y, remove italics for "none" and "L" second and third rows and use
>> "multi-character literal" and "non-encodable character literal" in both situations.
>> Reorder in the order "multi-character", "non-encodable", ordinary.
> Ok, this seems like good direction. The only bit I'm questioning is the
> "where latter cases exclude former ones" and reordering within the
> table. The suggested ordering would seem to prioritize the special
> cases (which makes sense), but then the "latter cases exclude former
> ones" seems to reverse that such that the former cases aren't reached
> (because no restrictions regarding number of c-chars or encodability is
> placed on the ordinary cases). Perhaps the intent was that latter cases
> are excluded by former ones?

Yeah, whatever makes sense. The point is that the sub-rows of the table
are not independent, but have an implied "otherwise". We need to say
something somewhere to make that happen.

>> The update does not address the concern that phase 5 encodes but phase 6
>> concatenates string literals, which might change the encoding.
>>
>> Example: "a" and u8"b" is concatenated as u8"ab".
>> Suppose my ordinary literal encoding is EBCDIC.
>> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
>> And then, this is normatively equivalent to u8"ab". That doesn't add up.
>>
>> If we concatenate first and then encode, we have the issue that
>> numeric-escape-sequences might alter their meaning by the concatenation.
>> Example: "\33" "3" under the status quo must be encoded as \33 followed by "3"
>> (two code units).
>> When we concatenate first, this becomes "\333" (presumably a single code unit).
>> This is particularly serious because using string literal concatenation is the
>> only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.
>>
>> It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
>> then we concatenate (to get the right type/encoding), then we encode. Ugh.
>
> Yes, I have intentionally chosen not to address this concern in this
> paper; in part because this paper is not intended to change behavior for
> implementations (other than to fix what seem to be unintended bugs in
> some implementations). But for this, there is implementation divergence
> that I think is not due to unintended behavior; Visual C++ does appear
> to implement the encode first approach prescribed by the standard. See
> https://github.com/sg16-unicode/sg16/issues/47 and
> https://msvc.godbolt.org/z/4buyxk.

I don't understand the MSVC output for the case

const char8_t* u8_2 = u8"" "\u0102";
/execution-charset:utf-8 /std:c++latest

at all, assuming it is this line:

$SG2797 DB 0c3H, 084H, 0e2H, 080H, 09aH, 00H

I thought /execution-charset:utf-8 would select UTF-8 for
ordinary string literals, so even in an "encode first, then
concatenate" world, there should be no difference vs.
u8"" u8"\u0102".

> A core issue was requested for this in
> https://lists.isocpp.org/core/2019/03/5770.php, but I don't think it was
> ever added to the active issues list.

Mike, what's the number of the core issue for this?

Tom, please add half a paragraph to the front matter of your paper,
referring to the core issue and saying that your paper doesn't
(attempt to) address it. In the interim (so that we don't forget),
add the link to the e-mail you gave above.

>> We should be clear in the text whether an implementation is allowed to encode
>> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
>> each character is encoded separately. There was concern that "separately"
>> doesn't address stateful encodings, where the encoding of string character
>> i+1 may depend on what string character i was.

> I added notes about that, but it sounds like you want something that
> explicitly grants such allowances normatively, is that correct?

I'm seeing notes about stateful encodings; I'm not seeing a note about
the "sequence as a whole" approach in general.

Maybe in lex.string pZ.1 augment the note

"The encoding of a string may differ from the sequence of code units
obtained by encoding each character in the string individually."

>> Maybe replace "associated character encoding" -> "associated literal encoding"
>> globally to avoid the mention of "character" here.
> Despite the use of the "C" word, "character encoding" is more consistent
> with Unicode terminology. Though if we really want to be consistent, we
> should use "character encoding form" (which ISO/IEC 10646 then calls
> simply "encoding form"). This is something we could discuss at the SG16
> meeting next week.

The paper is in CWG's court; involving SG16 is not helpful at this stage
absent more severe concerns that would involve sending back the paper
as a CWG action. That said, everybody (including members of SG16) are
invited to CWG telecons to offer their opinion.

off-topic remarks: Since we'd be using a new term such as
"literal encoding" here, I don't think Unicode will get into our
way. I'd like to point out that "character encoding" (also in the
Unicode meaning) sounds like a character-at-a-time encoding, which
we expressly don't want to require. So, choosing a different term
than one that has Unicode semantic connotations seems wise.

>> "These sequences should have no effect on encoding state for stateful character encodings."
>> -> "These sequences are assumed not to affect encoding state for stateful character encodings."
>>
>> In general, we can't use "should" (normative encouragement) in notes.
>
> Ah, yes. I should know by now that I shouldn't use should.

There are a few more "should"s that need fixing, beyond the one instance
I highlighted specifically.

Jens

Received on 2020-07-02 02:06:15