C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 1 Jul 2020 09:44:41 +0200
On Wed, 1 Jul 2020 at 09:29, Jens Maurer via Core <core_at_[hidden]>
wrote:

> On 28/06/2020 06.50, Tom Honermann via Core wrote:
> > A new draft revision of P2029 (Proposed resolution for core issues 411,
> 1656, and 2333; numeric and universal character escapes in character and
> string literals) is now available at
> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html. This
> addresses the CWG feedback provided during the March 23rd, 2020 core issues
> processing teleconference <
> http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23
> >.
> >
> > Wording review feedback prior to the next Core issues processing
> teleconference would be much appreciated!
>
>
> [lex.ccon] pX (before p6):
>
> In general, I'd like to see italics-text definitions here
> and less of a descriptive, but more of a prescriptive, tone
> applied. I also don't like the term "conditional character
> literal"; it's too obtuse. Suggestion: "non-encodable
> character literal".
>
> Suggestion:
>
> (quote)
> A /multi-character literal/ is a /character-literal/ whose
> /c-char-sequence/ consists of
> more than one /c-char/. A /non-encodable character literal/ is a
> character literal
> whose /c-char-sequence/ consists of a single /c-char/ that is not a
> /numeric-escape-sequence/ and that specifies a character that either lacks
> representation
> in the applicable associated character encoding or that cannot be encoded
> in a single code unit.
> The /encoding-prefix/ of a multi-character literal or a non-encodable
> character literal
> shall be absent or 'L'. Such /character-literal/s are
> conditionally-supported.
>
> The kind of a character-literal, its type, and its associated character
> encoding is determined
> by its /encoding-prefix/ and by its /c-char-sequence/ as specified in
> table Y, where latter
> cases exclude former ones.
> (end quote)
>
> This makes "multi-character" and "non-encodable" attributes of both
> non-prefix and L
> character literals, and we don't have to define the combinatorial
> explosion of terms
> here. In table Y, remove italics for "none" and "L" second and third rows
> and use
> "multi-character literal" and "non-encodable character literal" in both
> situations.
> Reorder in the order "multi-character", "non-encodable", ordinary.
>
>
> In lex.ccon pZ
>
> Simplify and reorder; we've said most of it in X already:
>
> Z.1 A multi-character literal or a non-encodable character literal has an
> implementation-defined value.
>
> Z.2 A character literal consisting of a single /numeric-escape-sequence/
> specifying an integer value v
> has the following value:
> - If v does not exceed the range of the type of the
> /character-literal/, its value is v.
> - Otherwise, if the /encoding-prefix/ of the /character-literal/ is
> absent or L, the value is implementation-defined
> - Otherwise, the program is ill-formed.
>
> Z.3 A character literal consisting of a single
> /conditional-escape-sequence/ is conditionally-supported
> and has an implementation-defined value.
>
> Z.4 Any other character literal has the value of the corresponding code
> unit in the applicable associated character encoding.
>
>
>
> The update does not address the concern that phase 5 encodes but phase 6
> concatenates string literals, which might change the encoding.
>
> Example: "a" and u8"b" is concatenated as u8"ab".
> Suppose my ordinary literal encoding is EBCDIC.
> Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to
> UTF-8.
> And then, this is normatively equivalent to u8"ab". That doesn't add up.
>
> If we concatenate first and then encode, we have the issue that
> numeric-escape-sequences might alter their meaning by the concatenation.
> Example: "\33" "3" under the status quo must be encoded as \33 followed
> by "3"
> (two code units).
> When we concatenate first, this becomes "\333" (presumably a single code
> unit).
> This is particularly serious because using string literal concatenation is
> the
> only safe way I am aware of how to reliably terminate a
> hexadecimal-escape-sequence.
>
> It seems we need a nested mini-lexer here so that we first recognize
> escape-sequences,
> then we concatenate (to get the right type/encoding), then we encode. Ugh.
>

Yes doing 3 phases seems the only viable option - maybe out of scope for
this paper.


>
> We should be clear in the text whether an implementation is allowed to
> encode
> a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
> each character is encoded separately. There was concern that "separately"
> doesn't address stateful encodings, where the encoding of string character
> i+1 may depend on what string character i was.
>

We should be careful not to change the behavior here.
Encoding sequences allow an implementation to encode <latin small letter e,
combining accute accent> as <latin small letter e with acute>
Which is not the current behavior described by the standard.
I think this is a much more important aspect (whether we think an
implementation should be able to do that or not) than trying to describe
the idiosyncrasies of all encodings.


>
> Maybe replace "associated character encoding" -> "associated literal
> encoding"
> globally to avoid the mention of "character" here.
>
> "These sequences should have no effect on encoding state for stateful
> character encodings."
> -> "These sequences are assumed not to affect encoding state for stateful
> character encodings."
>
> In general, we can't use "should" (normative encouragement) in notes.
>
> Jens
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9491.php
>

Received on 2020-07-01 02:48:06