C++ Logo

SG16

Advanced search

Subject: Re: [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)
From: Tom Honermann (tom_at_[hidden])
Date: 2020-07-02 13:44:07


On 7/2/20 1:16 PM, Corentin via Core wrote:
>
>
> On Thu, Jul 2, 2020, 18:45 Tom Honermann via Core
> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>
> On 7/2/20 3:02 AM, Jens Maurer via Core wrote:
>> On 02/07/2020 07.39, Tom Honermann wrote:
>>> On 7/1/20 3:28 AM, Jens Maurer wrote:
>>>> Suggestion:
>>>>
>>>> (quote)
>>>> A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
>>>> more than one /c-char/. A /non-encodable character literal/ is a character literal
>>>> whose /c-char-sequence/ consists of a single /c-char/ that is not a
>>>> /numeric-escape-sequence/ and that specifies a character that either lacks representation
>>>> in the applicable associated character encoding or that cannot be encoded in a single code unit.
>>>> The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
>>>> shall be absent or 'L'. Such /character-literal/s are conditionally-supported.
>>>>
>>>> The kind of a character-literal, its type, and its associated character encoding is determined
>>>> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
>>>> cases exclude former ones.
>>>> (end quote)
>>>>
>>>> This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
>>>> character literals, and we don't have to define the combinatorial explosion of terms
>>>> here. In table Y, remove italics for "none" and "L" second and third rows and use
>>>> "multi-character literal" and "non-encodable character literal" in both situations.
>>>> Reorder in the order "multi-character", "non-encodable", ordinary.
>>> Ok, this seems like good direction.  The only bit I'm questioning is the
>>> "where latter cases exclude former ones" and reordering within the
>>> table.  The suggested ordering would seem to prioritize the special
>>> cases (which makes sense), but then the "latter cases exclude former
>>> ones" seems to reverse that such that the former cases aren't reached
>>> (because no restrictions regarding number of c-chars or encodability is
>>> placed on the ordinary cases).  Perhaps the intent was that latter cases
>>> are excluded by former ones?
>> Yeah, whatever makes sense. The point is that the sub-rows of the table
>> are not independent, but have an implied "otherwise". We need to say
>> something somewhere to make that happen.
> Got it.
>>>> The update does not address the concern that phase 5 encodes but phase 6
>>>> concatenates string literals, which might change the encoding.
>>>>
>>>> Example: "a" and u8"b" is concatenated as u8"ab".
>>>> Suppose my ordinary literal encoding is EBCDIC.
>>>> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
>>>> And then, this is normatively equivalent to u8"ab". That doesn't add up.
>>>>
>>>> If we concatenate first and then encode, we have the issue that
>>>> numeric-escape-sequences might alter their meaning by the concatenation.
>>>> Example: "\33" "3" under the status quo must be encoded as \33 followed by "3"
>>>> (two code units).
>>>> When we concatenate first, this becomes "\333" (presumably a single code unit).
>>>> This is particularly serious because using string literal concatenation is the
>>>> only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.
>>>>
>>>> It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
>>>> then we concatenate (to get the right type/encoding), then we encode. Ugh.
>>> Yes, I have intentionally chosen not to address this concern in this
>>> paper; in part because this paper is not intended to change behavior for
>>> implementations (other than to fix what seem to be unintended bugs in
>>> some implementations).  But for this, there is implementation divergence
>>> that I think is not due to unintended behavior; Visual C++ does appear
>>> to implement the encode first approach prescribed by the standard.  See
>>> https://github.com/sg16-unicode/sg16/issues/47 and
>>> https://msvc.godbolt.org/z/4buyxk.
>> I don't understand the MSVC output for the case
>>
>> const char8_t* u8_2 = u8"" "\u0102";
>> /execution-charset:utf-8 /std:c++latest
>>
>> at all, assuming it is this line:
>>
>> $SG2797 DB 0c3H, 084H, 0e2H, 080H, 09aH, 00H
>
> Unchecking "Unused labels" in the "Filter..." drop down list makes
> it easier to correlate the lines.
>
> I think this case does actually reflect unintended behavior in the
> compiler.  What appears to be happening is that U+0102 is encoded
> as UTF-8 (0xC4 0x82) and then those individual code units are
> treated as Windows-1252 and again re-encoded as UTF-8.  In
> Windows-1252, 0xC4 is U+00C4, 0x82 is U+201A, and encoding those
> as UTF-8 produces the sequence { 0xC3 0x84 } { 0xE2 0x80 0x9A }.
>
>> I thought /execution-charset:utf-8 would select UTF-8 for
>> ordinary string literals, so even in an "encode first, then
>> concatenate" world, there should be no difference vs.
>> u8"" u8"\u0102".
> That is what I would expect as well.
>>> A core issue was requested for this in
>>> , but I don't think it was
>>> ever added to the active issues list.
>> Mike, what's the number of the core issue for this?
> I'll send a separate email requesting this so that it gets more
> visibility.
>> Tom, please add half a paragraph to the front matter of your paper,
>> referring to the core issue and saying that your paper doesn't
>> (attempt to) address it. In the interim (so that we don't forget),
>> add the link to the e-mail you gave above.
> Good idea, will do.
>>>> We should be clear in the text whether an implementation is allowed to encode
>>>> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
>>>> each character is encoded separately. There was concern that "separately"
>>>> doesn't address stateful encodings, where the encoding of string character
>>>> i+1 may depend on what string character i was.
>>> I added notes about that, but it sounds like you want something that
>>> explicitly grants such allowances normatively, is that correct?
>> I'm seeing notes about stateful encodings; I'm not seeing a note about
>> the "sequence as a whole" approach in general.
>>
>> Maybe in lex.string pZ.1 augment the note
>>
>> "The encoding of a string may differ from the sequence of code units
>> obtained by encoding each character in the string individually."
> Ok.  I was intentional in not prescribing either a "sequence at a
> whole" or a "one at a time" approach. Explicitly acknowledging
> both approaches makes sense; I like your suggestion.
>
>
> I am a bit tired of explaining that but now an implementation can
> renormalize strings in phase 5....

Yes, and as mentioned previously, if we want to prohibit that for
Unicode encodings, that can be handled in a separate paper that fully
specifies translation phases 1 and 5 for Unicode encoded source files. 
This is not that paper.

Tom.

>>>> Maybe replace "associated character encoding" -> "associated literal encoding"
>>>> globally to avoid the mention of "character" here.
>>> Despite the use of the "C" word, "character encoding" is more consistent
>>> with Unicode terminology.  Though if we really want to be consistent, we
>>> should use "character encoding form" (which ISO/IEC 10646 then calls
>>> simply "encoding form").  This is something we could discuss at the SG16
>>> meeting next week.
>> The paper is in CWG's court; involving SG16 is not helpful at this stage
>> absent more severe concerns that would involve sending back the paper
>> as a CWG action. That said, everybody (including members of SG16) are
>> invited to CWG telecons to offer their opinion.
> I meant only that we could discuss the terminology SG16 desires
> for the future in conjunction with our current terminology
> discussions; that need not have any impact on this paper at this time.
>> off-topic remarks: Since we'd be using a new term such as
>> "literal encoding" here, I don't think Unicode will get into our
>> way. I'd like to point out that "character encoding" (also in the
>> Unicode meaning) sounds like a character-at-a-time encoding, which
>> we expressly don't want to require. So, choosing a different term
>> than one that has Unicode semantic connotations seems wise.
> I'll address this more in my response to Corentin's most recent
> reply, but I believe the term "character encoding" is correct
> here. Wikipedia's definition
> <https://en.wikipedia.org/wiki/Character_encoding> is useful. 
> Note that Shift-JIS is a character encoding despite the fact that
> it encodes non-characters (e.g., escape sequences).
>
>
> yes, that term seems correct (fyi Non graphical characters are still
> considered characters)
>
>>>> "These sequences should have no effect on encoding state for stateful character encodings."
>>>> -> "These sequences are assumed not to affect encoding state for stateful character encodings."
>>>>
>>>> In general, we can't use "should" (normative encouragement) in notes.
>>> Ah, yes.  I should know by now that I shouldn't use should.
>> There are a few more "should"s that need fixing, beyond the one instance
>> I highlighted specifically.
>
> Thanks, I'll do global search and replace.
>
> Tom.
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden] <mailto:Core_at_[hidden]>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9530.php
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9532.php



SG16 list run by sg16-owner@lists.isocpp.org