C++ Logo


Advanced search

Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 2 Jul 2020 14:44:07 -0400
On 7/2/20 1:16 PM, Corentin via Core wrote:
> On Thu, Jul 2, 2020, 18:45 Tom Honermann via Core
> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
> On 7/2/20 3:02 AM, Jens Maurer via Core wrote:
>> On 02/07/2020 07.39, Tom Honermann wrote:
>>> On 7/1/20 3:28 AM, Jens Maurer wrote:
>>>> Suggestion:
>>>> (quote)
>>>> A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
>>>> more than one /c-char/. A /non-encodable character literal/ is a character literal
>>>> whose /c-char-sequence/ consists of a single /c-char/ that is not a
>>>> /numeric-escape-sequence/ and that specifies a character that either lacks representation
>>>> in the applicable associated character encoding or that cannot be encoded in a single code unit.
>>>> The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
>>>> shall be absent or 'L'. Such /character-literal/s are conditionally-supported.
>>>> The kind of a character-literal, its type, and its associated character encoding is determined
>>>> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
>>>> cases exclude former ones.
>>>> (end quote)
>>>> This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
>>>> character literals, and we don't have to define the combinatorial explosion of terms
>>>> here. In table Y, remove italics for "none" and "L" second and third rows and use
>>>> "multi-character literal" and "non-encodable character literal" in both situations.
>>>> Reorder in the order "multi-character", "non-encodable", ordinary.
>>> Ok, this seems like good direction. The only bit I'm questioning is the
>>> "where latter cases exclude former ones" and reordering within the
>>> table. The suggested ordering would seem to prioritize the special
>>> cases (which makes sense), but then the "latter cases exclude former
>>> ones" seems to reverse that such that the former cases aren't reached
>>> (because no restrictions regarding number of c-chars or encodability is
>>> placed on the ordinary cases). Perhaps the intent was that latter cases
>>> are excluded by former ones?
>> Yeah, whatever makes sense. The point is that the sub-rows of the table
>> are not independent, but have an implied "otherwise". We need to say
>> something somewhere to make that happen.
> Got it.
>>>> The update does not address the concern that phase 5 encodes but phase 6
>>>> concatenates string literals, which might change the encoding.
>>>> Example: "a" and u8"b" is concatenated as u8"ab".
>>>> Suppose my ordinary literal encoding is EBCDIC.
>>>> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
>>>> And then, this is normatively equivalent to u8"ab". That doesn't add up.
>>>> If we concatenate first and then encode, we have the issue that
>>>> numeric-escape-sequences might alter their meaning by the concatenation.
>>>> Example: "\33" "3" under the status quo must be encoded as \33 followed by "3"
>>>> (two code units).
>>>> When we concatenate first, this becomes "\333" (presumably a single code unit).
>>>> This is particularly serious because using string literal concatenation is the
>>>> only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.
>>>> It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
>>>> then we concatenate (to get the right type/encoding), then we encode. Ugh.
>>> Yes, I have intentionally chosen not to address this concern in this
>>> paper; in part because this paper is not intended to change behavior for
>>> implementations (other than to fix what seem to be unintended bugs in
>>> some implementations). But for this, there is implementation divergence
>>> that I think is not due to unintended behavior; Visual C++ does appear
>>> to implement the encode first approach prescribed by the standard. See
>>> https://github.com/sg16-unicode/sg16/issues/47 and
>>> https://msvc.godbolt.org/z/4buyxk.
>> I don't understand the MSVC output for the case
>> const char8_t* u8_2 = u8"" "\u0102";
>> /execution-charset:utf-8 /std:c++latest
>> at all, assuming it is this line:
>> $SG2797 DB 0c3H, 084H, 0e2H, 080H, 09aH, 00H
> Unchecking "Unused labels" in the "Filter..." drop down list makes
> it easier to correlate the lines.
> I think this case does actually reflect unintended behavior in the
> compiler. What appears to be happening is that U+0102 is encoded
> as UTF-8 (0xC4 0x82) and then those individual code units are
> treated as Windows-1252 and again re-encoded as UTF-8. In
> Windows-1252, 0xC4 is U+00C4, 0x82 is U+201A, and encoding those
> as UTF-8 produces the sequence { 0xC3 0x84 } { 0xE2 0x80 0x9A }.
>> I thought /execution-charset:utf-8 would select UTF-8 for
>> ordinary string literals, so even in an "encode first, then
>> concatenate" world, there should be no difference vs.
>> u8"" u8"\u0102".
> That is what I would expect as well.
>>> A core issue was requested for this in
>>> , but I don't think it was
>>> ever added to the active issues list.
>> Mike, what's the number of the core issue for this?
> I'll send a separate email requesting this so that it gets more
> visibility.
>> Tom, please add half a paragraph to the front matter of your paper,
>> referring to the core issue and saying that your paper doesn't
>> (attempt to) address it. In the interim (so that we don't forget),
>> add the link to the e-mail you gave above.
> Good idea, will do.
>>>> We should be clear in the text whether an implementation is allowed to encode
>>>> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
>>>> each character is encoded separately. There was concern that "separately"
>>>> doesn't address stateful encodings, where the encoding of string character
>>>> i+1 may depend on what string character i was.
>>> I added notes about that, but it sounds like you want something that
>>> explicitly grants such allowances normatively, is that correct?
>> I'm seeing notes about stateful encodings; I'm not seeing a note about
>> the "sequence as a whole" approach in general.
>> Maybe in lex.string pZ.1 augment the note
>> "The encoding of a string may differ from the sequence of code units
>> obtained by encoding each character in the string individually."
> Ok. I was intentional in not prescribing either a "sequence at a
> whole" or a "one at a time" approach. Explicitly acknowledging
> both approaches makes sense; I like your suggestion.
> I am a bit tired of explaining that but now an implementation can
> renormalize strings in phase 5....

Yes, and as mentioned previously, if we want to prohibit that for
Unicode encodings, that can be handled in a separate paper that fully
specifies translation phases 1 and 5 for Unicode encoded source files.
This is not that paper.


>>>> Maybe replace "associated character encoding" -> "associated literal encoding"
>>>> globally to avoid the mention of "character" here.
>>> Despite the use of the "C" word, "character encoding" is more consistent
>>> with Unicode terminology. Though if we really want to be consistent, we
>>> should use "character encoding form" (which ISO/IEC 10646 then calls
>>> simply "encoding form"). This is something we could discuss at the SG16
>>> meeting next week.
>> The paper is in CWG's court; involving SG16 is not helpful at this stage
>> absent more severe concerns that would involve sending back the paper
>> as a CWG action. That said, everybody (including members of SG16) are
>> invited to CWG telecons to offer their opinion.
> I meant only that we could discuss the terminology SG16 desires
> for the future in conjunction with our current terminology
> discussions; that need not have any impact on this paper at this time.
>> off-topic remarks: Since we'd be using a new term such as
>> "literal encoding" here, I don't think Unicode will get into our
>> way. I'd like to point out that "character encoding" (also in the
>> Unicode meaning) sounds like a character-at-a-time encoding, which
>> we expressly don't want to require. So, choosing a different term
>> than one that has Unicode semantic connotations seems wise.
> I'll address this more in my response to Corentin's most recent
> reply, but I believe the term "character encoding" is correct
> here. Wikipedia's definition
> <https://en.wikipedia.org/wiki/Character_encoding> is useful.
> Note that Shift-JIS is a character encoding despite the fact that
> it encodes non-characters (e.g., escape sequences).
> yes, that term seems correct (fyi Non graphical characters are still
> considered characters)
>>>> "These sequences should have no effect on encoding state for stateful character encodings."
>>>> -> "These sequences are assumed not to affect encoding state for stateful character encodings."
>>>> In general, we can't use "should" (normative encouragement) in notes.
>>> Ah, yes. I should know by now that I shouldn't use should.
>> There are a few more "should"s that need fixing, beyond the one instance
>> I highlighted specifically.
> Thanks, I'll do global search and replace.
> Tom.
> _______________________________________________
> Core mailing list
> Core_at_[hidden] <mailto:Core_at_[hidden]>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9530.php
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9532.php

Received on 2020-07-02 13:47:26