sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 2 Jul 2020 12:43:47 -0400

On 7/2/20 3:02 AM, Jens Maurer via Core wrote:
> On 02/07/2020 07.39, Tom Honermann wrote:
>> On 7/1/20 3:28 AM, Jens Maurer wrote:
>>> Suggestion:
>>>
>>> (quote)
>>> A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
>>> more than one /c-char/. A /non-encodable character literal/ is a character literal
>>> whose /c-char-sequence/ consists of a single /c-char/ that is not a
>>> /numeric-escape-sequence/ and that specifies a character that either lacks representation
>>> in the applicable associated character encoding or that cannot be encoded in a single code unit.
>>> The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
>>> shall be absent or 'L'. Such /character-literal/s are conditionally-supported.
>>>
>>> The kind of a character-literal, its type, and its associated character encoding is determined
>>> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
>>> cases exclude former ones.
>>> (end quote)
>>>
>>> This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
>>> character literals, and we don't have to define the combinatorial explosion of terms
>>> here. In table Y, remove italics for "none" and "L" second and third rows and use
>>> "multi-character literal" and "non-encodable character literal" in both situations.
>>> Reorder in the order "multi-character", "non-encodable", ordinary.
>> Ok, this seems like good direction. The only bit I'm questioning is the
>> "where latter cases exclude former ones" and reordering within the
>> table. The suggested ordering would seem to prioritize the special
>> cases (which makes sense), but then the "latter cases exclude former
>> ones" seems to reverse that such that the former cases aren't reached
>> (because no restrictions regarding number of c-chars or encodability is
>> placed on the ordinary cases). Perhaps the intent was that latter cases
>> are excluded by former ones?
> Yeah, whatever makes sense. The point is that the sub-rows of the table
> are not independent, but have an implied "otherwise". We need to say
> something somewhere to make that happen.
Got it.
>
>>> The update does not address the concern that phase 5 encodes but phase 6
>>> concatenates string literals, which might change the encoding.
>>>
>>> Example: "a" and u8"b" is concatenated as u8"ab".
>>> Suppose my ordinary literal encoding is EBCDIC.
>>> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
>>> And then, this is normatively equivalent to u8"ab". That doesn't add up.
>>>
>>> If we concatenate first and then encode, we have the issue that
>>> numeric-escape-sequences might alter their meaning by the concatenation.
>>> Example: "\33" "3" under the status quo must be encoded as \33 followed by "3"
>>> (two code units).
>>> When we concatenate first, this becomes "\333" (presumably a single code unit).
>>> This is particularly serious because using string literal concatenation is the
>>> only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.
>>>
>>> It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
>>> then we concatenate (to get the right type/encoding), then we encode. Ugh.
>> Yes, I have intentionally chosen not to address this concern in this
>> paper; in part because this paper is not intended to change behavior for
>> implementations (other than to fix what seem to be unintended bugs in
>> some implementations). But for this, there is implementation divergence
>> that I think is not due to unintended behavior; Visual C++ does appear
>> to implement the encode first approach prescribed by the standard. See
>> https://github.com/sg16-unicode/sg16/issues/47 and
>> https://msvc.godbolt.org/z/4buyxk.
> I don't understand the MSVC output for the case
>
> const char8_t* u8_2 = u8"" "\u0102";
> /execution-charset:utf-8 /std:c++latest
>
> at all, assuming it is this line:
>
> $SG2797 DB 0c3H, 084H, 0e2H, 080H, 09aH, 00H

Unchecking "Unused labels" in the "Filter..." drop down list makes it
easier to correlate the lines.

I think this case does actually reflect unintended behavior in the
compiler. What appears to be happening is that U+0102 is encoded as
UTF-8 (0xC4 0x82) and then those individual code units are treated as
Windows-1252 and again re-encoded as UTF-8. In Windows-1252, 0xC4 is
U+00C4, 0x82 is U+201A, and encoding those as UTF-8 produces the
sequence { 0xC3 0x84 } { 0xE2 0x80 0x9A }.

>
> I thought /execution-charset:utf-8 would select UTF-8 for
> ordinary string literals, so even in an "encode first, then
> concatenate" world, there should be no difference vs.
> u8"" u8"\u0102".
That is what I would expect as well.
>
>> A core issue was requested for this in
>> https://lists.isocpp.org/core/2019/03/5770.php, but I don't think it was
>> ever added to the active issues list.
> Mike, what's the number of the core issue for this?
I'll send a separate email requesting this so that it gets more visibility.
>
> Tom, please add half a paragraph to the front matter of your paper,
> referring to the core issue and saying that your paper doesn't
> (attempt to) address it. In the interim (so that we don't forget),
> add the link to the e-mail you gave above.
Good idea, will do.
>
>>> We should be clear in the text whether an implementation is allowed to encode
>>> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
>>> each character is encoded separately. There was concern that "separately"
>>> doesn't address stateful encodings, where the encoding of string character
>>> i+1 may depend on what string character i was.
>> I added notes about that, but it sounds like you want something that
>> explicitly grants such allowances normatively, is that correct?
> I'm seeing notes about stateful encodings; I'm not seeing a note about
> the "sequence as a whole" approach in general.
>
> Maybe in lex.string pZ.1 augment the note
>
> "The encoding of a string may differ from the sequence of code units
> obtained by encoding each character in the string individually."
Ok. I was intentional in not prescribing either a "sequence at a whole"
or a "one at a time" approach. Explicitly acknowledging both approaches
makes sense; I like your suggestion.
>
>>> Maybe replace "associated character encoding" -> "associated literal encoding"
>>> globally to avoid the mention of "character" here.
>> Despite the use of the "C" word, "character encoding" is more consistent
>> with Unicode terminology. Though if we really want to be consistent, we
>> should use "character encoding form" (which ISO/IEC 10646 then calls
>> simply "encoding form"). This is something we could discuss at the SG16
>> meeting next week.
> The paper is in CWG's court; involving SG16 is not helpful at this stage
> absent more severe concerns that would involve sending back the paper
> as a CWG action. That said, everybody (including members of SG16) are
> invited to CWG telecons to offer their opinion.
I meant only that we could discuss the terminology SG16 desires for the
future in conjunction with our current terminology discussions; that
need not have any impact on this paper at this time.
>
> off-topic remarks: Since we'd be using a new term such as
> "literal encoding" here, I don't think Unicode will get into our
> way. I'd like to point out that "character encoding" (also in the
> Unicode meaning) sounds like a character-at-a-time encoding, which
> we expressly don't want to require. So, choosing a different term
> than one that has Unicode semantic connotations seems wise.
I'll address this more in my response to Corentin's most recent reply,
but I believe the term "character encoding" is correct here. Wikipedia's
definition <https://en.wikipedia.org/wiki/Character_encoding> is
useful. Note that Shift-JIS is a character encoding despite the fact
that it encodes non-characters (e.g., escape sequences).
>
>>> "These sequences should have no effect on encoding state for stateful character encodings."
>>> -> "These sequences are assumed not to affect encoding state for stateful character encodings."
>>>
>>> In general, we can't use "should" (normative encouragement) in notes.
>> Ah, yes. I should know by now that I shouldn't use should.
> There are a few more "should"s that need fixing, beyond the one instance
> I highlighted specifically.

Thanks, I'll do global search and replace.

Tom.

Received on 2020-07-02 11:47:03