sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 2 Jul 2020 19:16:13 +0200

On Thu, Jul 2, 2020, 18:45 Tom Honermann via Core <core_at_[hidden]>
wrote:

> On 7/2/20 3:02 AM, Jens Maurer via Core wrote:
>
> On 02/07/2020 07.39, Tom Honermann wrote:
>
> On 7/1/20 3:28 AM, Jens Maurer wrote:
>
> Suggestion:
>
> (quote)
> A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
> more than one /c-char/. A /non-encodable character literal/ is a character literal
> whose /c-char-sequence/ consists of a single /c-char/ that is not a
> /numeric-escape-sequence/ and that specifies a character that either lacks representation
> in the applicable associated character encoding or that cannot be encoded in a single code unit.
> The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
> shall be absent or 'L'. Such /character-literal/s are conditionally-supported.
>
> The kind of a character-literal, its type, and its associated character encoding is determined
> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
> cases exclude former ones.
> (end quote)
>
> This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
> character literals, and we don't have to define the combinatorial explosion of terms
> here. In table Y, remove italics for "none" and "L" second and third rows and use
> "multi-character literal" and "non-encodable character literal" in both situations.
> Reorder in the order "multi-character", "non-encodable", ordinary.
>
> Ok, this seems like good direction. The only bit I'm questioning is the
> "where latter cases exclude former ones" and reordering within the
> table. The suggested ordering would seem to prioritize the special
> cases (which makes sense), but then the "latter cases exclude former
> ones" seems to reverse that such that the former cases aren't reached
> (because no restrictions regarding number of c-chars or encodability is
> placed on the ordinary cases). Perhaps the intent was that latter cases
> are excluded by former ones?
>
> Yeah, whatever makes sense. The point is that the sub-rows of the table
> are not independent, but have an implied "otherwise". We need to say
> something somewhere to make that happen.
>
> Got it.
>
> The update does not address the concern that phase 5 encodes but phase 6
> concatenates string literals, which might change the encoding.
>
> Example: "a" and u8"b" is concatenated as u8"ab".
> Suppose my ordinary literal encoding is EBCDIC.
> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
> And then, this is normatively equivalent to u8"ab". That doesn't add up.
>
> If we concatenate first and then encode, we have the issue that
> numeric-escape-sequences might alter their meaning by the concatenation.
> Example: "\33" "3" under the status quo must be encoded as \33 followed by "3"
> (two code units).
> When we concatenate first, this becomes "\333" (presumably a single code unit).
> This is particularly serious because using string literal concatenation is the
> only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.
>
> It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
> then we concatenate (to get the right type/encoding), then we encode. Ugh.
>
> Yes, I have intentionally chosen not to address this concern in this
> paper; in part because this paper is not intended to change behavior for
> implementations (other than to fix what seem to be unintended bugs in
> some implementations). But for this, there is implementation divergence
> that I think is not due to unintended behavior; Visual C++ does appear
> to implement the encode first approach prescribed by the standard. Seehttps://github.com/sg16-unicode/sg16/issues/47 andhttps://msvc.godbolt.org/z/4buyxk.
>
> I don't understand the MSVC output for the case
>
> const char8_t* u8_2 = u8"" "\u0102";
> /execution-charset:utf-8 /std:c++latest
>
> at all, assuming it is this line:
>
> $SG2797 DB 0c3H, 084H, 0e2H, 080H, 09aH, 00H
>
> Unchecking "Unused labels" in the "Filter..." drop down list makes it
> easier to correlate the lines.
>
> I think this case does actually reflect unintended behavior in the
> compiler. What appears to be happening is that U+0102 is encoded as UTF-8
> (0xC4 0x82) and then those individual code units are treated as
> Windows-1252 and again re-encoded as UTF-8. In Windows-1252, 0xC4 is
> U+00C4, 0x82 is U+201A, and encoding those as UTF-8 produces the sequence {
> 0xC3 0x84 } { 0xE2 0x80 0x9A }.
>
> I thought /execution-charset:utf-8 would select UTF-8 for
> ordinary string literals, so even in an "encode first, then
> concatenate" world, there should be no difference vs.
> u8"" u8"\u0102".
>
> That is what I would expect as well.
>
> A core issue was requested for this in
> , but I don't think it was
> ever added to the active issues list.
>
> Mike, what's the number of the core issue for this?
>
> I'll send a separate email requesting this so that it gets more visibility.
>
> Tom, please add half a paragraph to the front matter of your paper,
> referring to the core issue and saying that your paper doesn't
> (attempt to) address it. In the interim (so that we don't forget),
> add the link to the e-mail you gave above.
>
> Good idea, will do.
>
> We should be clear in the text whether an implementation is allowed to encode
> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
> each character is encoded separately. There was concern that "separately"
> doesn't address stateful encodings, where the encoding of string character
> i+1 may depend on what string character i was.
>
> I added notes about that, but it sounds like you want something that
> explicitly grants such allowances normatively, is that correct?
>
> I'm seeing notes about stateful encodings; I'm not seeing a note about
> the "sequence as a whole" approach in general.
>
> Maybe in lex.string pZ.1 augment the note
>
> "The encoding of a string may differ from the sequence of code units
> obtained by encoding each character in the string individually."
>
> Ok. I was intentional in not prescribing either a "sequence at a whole"
> or a "one at a time" approach. Explicitly acknowledging both approaches
> makes sense; I like your suggestion.
>

I am a bit tired of explaining that but now an implementation can
renormalize strings in phase 5....

> Maybe replace "associated character encoding" -> "associated literal encoding"
> globally to avoid the mention of "character" here.
>
> Despite the use of the "C" word, "character encoding" is more consistent
> with Unicode terminology. Though if we really want to be consistent, we
> should use "character encoding form" (which ISO/IEC 10646 then calls
> simply "encoding form"). This is something we could discuss at the SG16
> meeting next week.
>
> The paper is in CWG's court; involving SG16 is not helpful at this stage
> absent more severe concerns that would involve sending back the paper
> as a CWG action. That said, everybody (including members of SG16) are
> invited to CWG telecons to offer their opinion.
>
> I meant only that we could discuss the terminology SG16 desires for the
> future in conjunction with our current terminology discussions; that need
> not have any impact on this paper at this time.
>
> off-topic remarks: Since we'd be using a new term such as
> "literal encoding" here, I don't think Unicode will get into our
> way. I'd like to point out that "character encoding" (also in the
> Unicode meaning) sounds like a character-at-a-time encoding, which
> we expressly don't want to require. So, choosing a different term
> than one that has Unicode semantic connotations seems wise.
>
> I'll address this more in my response to Corentin's most recent reply, but
> I believe the term "character encoding" is correct here. Wikipedia's
> definition <https://en.wikipedia.org/wiki/Character_encoding> is useful.
> Note that Shift-JIS is a character encoding despite the fact that it
> encodes non-characters (e.g., escape sequences).
>

yes, that term seems correct (fyi Non graphical characters are still
considered characters)

> "These sequences should have no effect on encoding state for stateful character encodings."
> -> "These sequences are assumed not to affect encoding state for stateful character encodings."
>
> In general, we can't use "should" (normative encouragement) in notes.
>
> Ah, yes. I should know by now that I shouldn't use should.
>
> There are a few more "should"s that need fixing, beyond the one instance
> I highlighted specifically.
>
> Thanks, I'll do global search and replace.
>
> Tom.
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9530.php
>

Received on 2020-07-02 12:19:40