sg16: Re: [SG16] Making wide-character literals containing multiple c-char ill-formed

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 31 May 2020 23:15:04 -0400

On 5/31/20 6:44 PM, Corentin via SG16 wrote:
> Hello,
>
> L'ab' currently has an implementation defined value
> GCC, MSVC and Clang treats that value to be equivalent to L'a' and
> emit a warning
>
> However, consider
>
> L'é' which after phase one is represented as L'e\u00B4' (LATIN SMALL
> LETTER E, ACUTE ACCENT).
>
> The author of the code probably intends the character to be a single
> c-char.
>
>
> Therefore, I think this should be made ill-formed.
>
> Note that this is less of an issue for multi character literals as no
> combining character has a representation in any single-byte encoding
> (that I know of).
> (And multi character literals, are, to my dismay, used in production
> code).
>
> However we should probably require that each individual c-char in a
> multi character literal has a representation in the execution encoding
> or is a member of the basic latin block.
>
> What do you think ?

[lex.ccon]p1 <http://eel.is/c++draft/lex.ccon#1> already specifies that
the ordinary (non-wide) form of these cases is conditionally-supported,
so need not be implemented.

> ... A multicharacter literal, or an ordinary character literal
> containing a single c-char <http://eel.is/c++draft/lex.ccon#nt:c-char>
> not representable in the execution character set, is
> conditionally-supported, has type int, and has an
> implementation-defined value.
> <http://eel.is/c++draft/lex.ccon#1.sentence-4>
Similar wording doesn't exist for the wide variants, but I've taken the
perspective that the omission is an oversight. P2029
<https://wg21.link/p2029> specifies wording changes intended to clarify
that the wide variants are also conditionally-supported.

Since the wide character set is implementation-defined, I don't think we
should attempt to make L'é' ill-formed, regardless of any Unicode
normalization concerns. If the implementation-defined wide character
set is able to represent the specified character in a single code unit,
I think that is ok and that there is little motivation to break existing
code.

I'm in favor of updating wording for wide character and string literals
to better specify them and to better reflect existing practice, but I'm
not in favor of investing time improving or changing their behavior
otherwise.

Tom.

Received on 2020-05-31 22:18:12