sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 2 Jul 2020 09:15:32 +0200

On Thu, 2 Jul 2020 at 09:04, Jens Maurer via Core <core_at_[hidden]>
wrote:

> On 02/07/2020 07.39, Tom Honermann wrote:
> > On 7/1/20 3:28 AM, Jens Maurer wrote:
>
> >> Suggestion:
> >>
> >> (quote)
> >> A /multi-character literal/ is a /character-literal/ whose
> /c-char-sequence/ consists of
> >> more than one /c-char/. A /non-encodable character literal/ is a
> character literal
> >> whose /c-char-sequence/ consists of a single /c-char/ that is not a
> >> /numeric-escape-sequence/ and that specifies a character that either
> lacks representation
> >> in the applicable associated character encoding or that cannot be
> encoded in a single code unit.
> >> The /encoding-prefix/ of a multi-character literal or a non-encodable
> character literal
> >> shall be absent or 'L'. Such /character-literal/s are
> conditionally-supported.
> >>
> >> The kind of a character-literal, its type, and its associated character
> encoding is determined
> >> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in
> table Y, where latter
> >> cases exclude former ones.
> >> (end quote)
> >>
> >> This makes "multi-character" and "non-encodable" attributes of both
> non-prefix and L
> >> character literals, and we don't have to define the combinatorial
> explosion of terms
> >> here. In table Y, remove italics for "none" and "L" second and third
> rows and use
> >> "multi-character literal" and "non-encodable character literal" in both
> situations.
> >> Reorder in the order "multi-character", "non-encodable", ordinary.
> > Ok, this seems like good direction. The only bit I'm questioning is the
> > "where latter cases exclude former ones" and reordering within the
> > table. The suggested ordering would seem to prioritize the special
> > cases (which makes sense), but then the "latter cases exclude former
> > ones" seems to reverse that such that the former cases aren't reached
> > (because no restrictions regarding number of c-chars or encodability is
> > placed on the ordinary cases). Perhaps the intent was that latter cases
> > are excluded by former ones?
>
> Yeah, whatever makes sense. The point is that the sub-rows of the table
> are not independent, but have an implied "otherwise". We need to say
> something somewhere to make that happen.
>
> >> The update does not address the concern that phase 5 encodes but phase 6
> >> concatenates string literals, which might change the encoding.
> >>
> >> Example: "a" and u8"b" is concatenated as u8"ab".
> >> Suppose my ordinary literal encoding is EBCDIC.
> >> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to
> UTF-8.
> >> And then, this is normatively equivalent to u8"ab". That doesn't add
> up.
> >>
> >> If we concatenate first and then encode, we have the issue that
> >> numeric-escape-sequences might alter their meaning by the concatenation.
> >> Example: "\33" "3" under the status quo must be encoded as \33
> followed by "3"
> >> (two code units).
> >> When we concatenate first, this becomes "\333" (presumably a single
> code unit).
> >> This is particularly serious because using string literal concatenation
> is the
> >> only safe way I am aware of how to reliably terminate a
> hexadecimal-escape-sequence.
> >>
> >> It seems we need a nested mini-lexer here so that we first recognize
> escape-sequences,
> >> then we concatenate (to get the right type/encoding), then we encode.
> Ugh.
> >
> > Yes, I have intentionally chosen not to address this concern in this
> > paper; in part because this paper is not intended to change behavior for
> > implementations (other than to fix what seem to be unintended bugs in
> > some implementations). But for this, there is implementation divergence
> > that I think is not due to unintended behavior; Visual C++ does appear
> > to implement the encode first approach prescribed by the standard. See
> > https://github.com/sg16-unicode/sg16/issues/47 and
> > https://msvc.godbolt.org/z/4buyxk.
>
> I don't understand the MSVC output for the case
>
> const char8_t* u8_2 = u8"" "\u0102";
> /execution-charset:utf-8 /std:c++latest
>
> at all, assuming it is this line:
>
> $SG2797 DB 0c3H, 084H, 0e2H, 080H, 09aH, 00H
>
> I thought /execution-charset:utf-8 would select UTF-8 for
> ordinary string literals, so even in an "encode first, then
> concatenate" world, there should be no difference vs.
> u8"" u8"\u0102".
>
> > A core issue was requested for this in
> > https://lists.isocpp.org/core/2019/03/5770.php, but I don't think it was
> > ever added to the active issues list.
>
> Mike, what's the number of the core issue for this?
>
> Tom, please add half a paragraph to the front matter of your paper,
> referring to the core issue and saying that your paper doesn't
> (attempt to) address it. In the interim (so that we don't forget),
> add the link to the e-mail you gave above.
>
> >> We should be clear in the text whether an implementation is allowed to
> encode
> >> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
> >> each character is encoded separately. There was concern that
> "separately"
> >> doesn't address stateful encodings, where the encoding of string
> character
> >> i+1 may depend on what string character i was.
>
> > I added notes about that, but it sounds like you want something that
> > explicitly grants such allowances normatively, is that correct?
>
> I'm seeing notes about stateful encodings; I'm not seeing a note about
> the "sequence as a whole" approach in general.
>
> Maybe in lex.string pZ.1 augment the note
>
> "The encoding of a string may differ from the sequence of code units
> obtained by encoding each character in the string individually."
>
> >> Maybe replace "associated character encoding" -> "associated literal
> encoding"
> >> globally to avoid the mention of "character" here.
> > Despite the use of the "C" word, "character encoding" is more consistent
> > with Unicode terminology. Though if we really want to be consistent, we
> > should use "character encoding form" (which ISO/IEC 10646 then calls
> > simply "encoding form"). This is something we could discuss at the SG16
> > meeting next week.
>
> The paper is in CWG's court; involving SG16 is not helpful at this stage
> absent more severe concerns that would involve sending back the paper
> as a CWG action. That said, everybody (including members of SG16) are
> invited to CWG telecons to offer their opinion.
>
> off-topic remarks: Since we'd be using a new term such as
> "literal encoding" here, I don't think Unicode will get into our
> way. I'd like to point out that "character encoding" (also in the
> Unicode meaning) sounds like a character-at-a-time encoding, which
> we expressly don't want to require. So, choosing a different term
> than one that has Unicode semantic connotations seems wise.
>

literal encoding is a less ambiguous term either way.
We need a terminology such that we can distinguish the encoding of literals
from that of runtime strings, literal (associated) encoding achieves that.

>
> >> "These sequences should have no effect on encoding state for stateful
> character encodings."
> >> -> "These sequences are assumed not to affect encoding state for
> stateful character encodings."
> >>
> >> In general, we can't use "should" (normative encouragement) in notes.
> >
> > Ah, yes. I should know by now that I shouldn't use should.
>
> There are a few more "should"s that need fixing, beyond the one instance
> I highlighted specifically.
>
> Jens
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9525.php
>

Received on 2020-07-02 02:18:58