On Sat, Nov 6, 2021 at 4:17 AM Corentin <corentin.jabot@gmail.com> wrote:

On Sat, Nov 6, 2021 at 3:05 AM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
The current R2 draft has this:
A multicharacter literal shall not have an encoding prefix. Each character represented by a basic-c-char or a universal-character-name in a multicharacter literal shall be encodable as a single code unit in the narrow literal encoding.

The above does not provide a restriction on conditional-escape-sequences and numeric-escape-sequences in multicharacter literals. We presumably only want to allow ones that are valid as the sole c-char in a character-literal with no encoding prefix. Indeed, that general description may be sufficient for all forms of c-char.

Why should it?
My only goal is to forbid multi characters literals visually indistinguishable from single character literals, in scenarios where multiple codepoints results in a single glyph.

The paper is very close to implementing a possible secondary goal of having the number of bytes contributed by a c-char in a multicharacter literal be exactly one (and also strongly hints at what the value of the corresponding byte should be).

Given the implementation-defined nature of multi characters, I do not think adding further restrictions on numeric-escape-sequences has any value in this scenario. What would be the gain / pitfall avoided by further restriction?

See above re: the achievement of a possible secondary goal. Also, you're asking about the numeric escape sequence case, but perhaps it is more interesting to ask about conditional escape sequences that would contribute more than one code unit when encountered in a string in the initial shift state?

Anyhow, if the intent really is to help only with the visual ambiguity problem, then it would be more consistent to allow universal-character-names that encode to more than one code unit in multicharacter literals (because it's in a multicharacter literal already).

With a focus on the visual ambiguity problem (thanks for reminding), the previous wording to limit basic-c-chars to the basic character set is more capable because lots of Unicode display shenanigans will get through the current formulation if the ordinary literal encoding is UCS-2 or UTF-16 (which is possible if CHAR_BIT is large enough).

Also, the title of the paper is not particularly helpful in terms of indicating what it proposes. I think something like "Support only straightforward multicharacter literals and encodable string literals" would be better.

-- HT