On Fri, Oct 29, 2021 at 9:52 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 29/10/2021 04.53, Hubert Tong via SG16 wrote:
> Thanks Corentin for the paper. I hope this feedback helps the discussion.
>
> With respect to the contents of multicharacter literals, the paper does not give much motivation for disallowing numeric escape sequences which fit within a single unsigned char. Also, the wording says "shall be a member of the basic literal character set": this property of "being" is rather ambiguous in terms of authorial intent regarding the treatment of UCNs, etc. that designate members of the basic literal character set (a name for something is usually not the same as the thing it names).

I agree that the restriction

"Each c-char in a multicharacter literal shall be a member of the basic literal character set."

The goal is to avoid 'é' - which might be 2 code units - hence a multicharacter literal.

I agree that the wording is currently not satisfactory, i did not intend to disallow numeric escape sequences

Maybe

> The sequence of characters denoted by each contiguous sequence of basics-char s, r-char s, simple-escape-sequence s, and universal-character-name s is encoded to a code unit sequence using the string-literal’s associated character encoding. If a character lacks representation in the associated character encoding, then the string-literal is ill-formed.

is sufficient. Interestingly, This sentence doesn't specify what happens if a character has representation, which is more than one code unit.

Suggestion:

- Remove "Each c-char in a multicharacter literal shall be a member of the basic literal character set."

- Change

> If a character lacks representation in the associated character encoding, then the string-literal is ill-formed.

> If a character lacks representation in the associated character encoding <ins>or is not representable as a single code unit</ins>, then the string-literal is ill-formed.

(This works because all combining diacritics are represented in more than a code unit in all encodings I'm aware of.)

is novel and should just go.

(Why would the use of "@" or "$" be particularly bad in a multicharacter literal?)

> With respect to the new encodability restriction for strings, I believe that unevaluated strings should not be treated the same way as strings that need to be translated into a literal encoding. I think we may need to advance P2361 ("Unevaluated strings") first.

The paragraph in question starts with
"String literal objects are initialized with the sequence of code unit values..."

The existing text using /string-literal/ in various places makes it clear when
and if a string-literal is converted to a string literal object.
Insofar, I consider P2361 superfluous.

Agreed, I don't see a reason to introduce a dependency between the two papers.

P2361 only changes are in regard to numeric sequences and encoding prefixes

Jens