sg16: Re: [SG16] Feedback on P1854: Conversion to literal encoding should not lead to loss of meaning

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 6 Nov 2021 11:22:05 -0400

On Sat, Nov 6, 2021 at 4:17 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Sat, Nov 6, 2021 at 3:05 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> The current R2 draft has this:
>>
>>> A multicharacter literal shall not have an encoding prefix. Each
>>> character represented by a *basic-c-char* or a
>>> *universal-character-name* in a multicharacter literal shall be
>>> encodable as a single code unit in the narrow literal encoding.
>>
>>
>> The above does not provide a restriction on *conditional-escape-sequence*s
>> and *numeric-escape-sequence*s in multicharacter literals. We presumably
>> only want to allow ones that are valid as the sole *c-char* in a
>> *character-literal* with no encoding prefix. Indeed, that general
>> description may be sufficient for all forms of *c-char*.
>>
>
> Why should it?
> My only goal is to forbid multi characters literals visually
> indistinguishable from single character literals, in scenarios where
> multiple codepoints results in a single glyph.
>

The paper is very close to implementing a possible secondary goal of having
the number of bytes contributed by a *c-char* in a multicharacter literal
be exactly one (and also strongly hints at what the value of the
corresponding byte should be).

> Given the implementation-defined nature of multi characters, I do not
> think adding further restrictions on *numeric-escape-sequence*s has any
> value in this scenario. What would be the gain / pitfall avoided by further
> restriction?
>

See above re: the achievement of a possible secondary goal. Also, you're
asking about the numeric escape sequence case, but perhaps it is more
interesting to ask about conditional escape sequences that would contribute
more than one code unit when encountered in a string in the initial shift
state?

Anyhow, if the intent really is to help only with the visual ambiguity
problem, then it would be more consistent to allow
*universal-character-name*s that encode to more than one code unit in
multicharacter literals (because it's in a multicharacter literal already).

With a focus on the visual ambiguity problem (thanks for reminding), the
previous wording to limit *basic-c-char*s to the basic character set is
more capable because lots of Unicode display shenanigans will get through
the current formulation if the ordinary literal encoding is UCS-2 or UTF-16
(which is possible if CHAR_BIT is large enough).

>
>
>>
>> Also, the title of the paper is not particularly helpful in terms of
>> indicating what it proposes. I think something like "Support only
>> straightforward multicharacter literals and encodable string literals"
>> would be better.
>>
>> -- HT
>>
>

Received on 2021-11-06 10:22:35