sg16: Re: [SG16] Feedback on P1854: Conversion to literal encoding should not lead to loss of meaning

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 29 Oct 2021 10:13:24 +0200

On Fri, Oct 29, 2021 at 9:52 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 29/10/2021 04.53, Hubert Tong via SG16 wrote:
> > Thanks Corentin for the paper. I hope this feedback helps the discussion.
> >
> > With respect to the contents of multicharacter literals, the paper does
> not give much motivation for disallowing numeric escape sequences which fit
> within a single unsigned char. Also, the wording says "shall be a member of
> the basic literal character set": this property of "being" is rather
> ambiguous in terms of authorial intent regarding the treatment of UCNs,
> etc. that designate members of the basic literal character set (a name for
> something is usually not the same as the thing it names).
>
> I agree that the restriction
>
> "Each c-char in a multicharacter literal shall be a member of the basic
> literal character set."
>

The goal is to avoid 'é' - which might be 2 code units - hence a
multicharacter literal.
I agree that the wording is currently not satisfactory, i did not intend to
disallow numeric escape sequences

Maybe
> The sequence of characters denoted by each contiguous sequence of
basics-char s, r-char s, simple-escape-sequence s, and
universal-character-name s is encoded to a code unit sequence using the
string-literal’s associated character encoding. If a character lacks
representation in the associated character encoding, then the
string-literal is ill-formed.

is sufficient. Interestingly, This sentence doesn't specify what happens if
a character has representation, which is more than one code unit.

Suggestion:
- Remove "Each c-char in a multicharacter literal shall be a member of the
basic literal character set."
- Change
> If a character lacks representation in the associated character
encoding, then the string-literal is ill-formed.
To
> If a character lacks representation in the associated character
encoding <ins>or is not representable as a single code unit</ins>, then the
string-literal is ill-formed.

(This works because all combining diacritics are represented in more than a
code unit in all encodings I'm aware of.)

>
> is novel and should just go.
>
> (Why would the use of "@" or "$" be particularly bad in a multicharacter
> literal?)
>
> > With respect to the new encodability restriction for strings, I believe
> that unevaluated strings should not be treated the same way as strings that
> need to be translated into a literal encoding. I think we may need to
> advance P2361 ("Unevaluated strings") first.
>
> The paragraph in question starts with
> "String literal objects are initialized with the sequence of code unit
> values..."
>
> The existing text using /string-literal/ in various places makes it clear
> when
> and if a string-literal is converted to a string literal object.
> Insofar, I consider P2361 superfluous.
>

Agreed, I don't see a reason to introduce a dependency between the two
papers.
P2361 only changes are in regard to numeric sequences and encoding prefixes

>
> Jens
>

Received on 2021-10-29 03:13:38