sg16: [SG16] Fwd: Making wide-character literals containing multiple c-char ill-formed

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 11:13:28 +0200

---------- Forwarded message ---------
From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 at 11:13
Subject: Re: [SG16] Making wide-character literals containing multiple
c-char ill-formed
To: Tom Honermann <tom_at_[hidden]>

On Mon, Jun 1, 2020, 05:15 Tom Honermann <tom_at_[hidden]> wrote:

> On 5/31/20 6:44 PM, Corentin via SG16 wrote:
>
> Hello,
>
> L'ab' currently has an implementation defined value
> GCC, MSVC and Clang treats that value to be equivalent to L'a' and emit a
> warning
>
> However, consider
>
> L'é' which after phase one is represented as L'e\u00B4' (LATIN SMALL
> LETTER E, ACUTE ACCENT).
>
> The author of the code probably intends the character to be a single
> c-char.
>
>
> Therefore, I think this should be made ill-formed.
>
> Note that this is less of an issue for multi character literals as no
> combining character has a representation in any single-byte encoding (that
> I know of).
> (And multi character literals, are, to my dismay, used in production code).
>
> However we should probably require that each individual c-char in a multi
> character literal has a representation in the execution encoding or is a
> member of the basic latin block.
>
> What do you think ?
>
> [lex.ccon]p1 <http://eel.is/c++draft/lex.ccon#1> already specifies that
> the ordinary (non-wide) form of these cases is conditionally-supported, so
> need not be implemented.
>
> ... A multicharacter literal, or an ordinary character literal containing
> a single c-char <http://eel.is/c++draft/lex.ccon#nt:c-char> not
> representable in the execution character set, is conditionally-supported,
> has type int, and has an implementation-defined value.
> <http://eel.is/c++draft/lex.ccon#1.sentence-4>
>
> Similar wording doesn't exist for the wide variants, but I've taken the
> perspective that the omission is an oversight. P2029
> <https://wg21.link/p2029> specifies wording changes intended to clarify
> that the wide variants are also conditionally-supported.
>

I believe the reason the wording is different is that a wide character
with multiple char has type wchar_t so it's not a different entity - only
the value is different.

> Since the wide character set is implementation-defined, I don't think we
> should attempt to make L'é' ill-formed, regardless of any Unicode
> normalization concerns. If the implementation-defined wide character set
> is able to represent the specified character in a single code unit, I think
> that is ok and that there is little motivation to break existing code.
>
It is already a warning in all implementations and the value is 'e'
regardless of whether they can represent ACUTE ACCENT or LATIN SMALL LETTER
E WITH ACCUTE

> I'm in favor of updating wording for wide character and string literals to
> better specify them and to better reflect existing practice, but I'm not in
> favor of investing time improving or changing their behavior otherwise.
>
> Tom.:
>

Received on 2020-06-01 04:16:46