Subject: Fwd: Making wide-character literals containing multiple c-char ill-formed
From: Corentin (corentin.jabot_at_[hidden])
Date: 2020-06-01 04:13:28
---------- Forwarded message ---------
From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 at 11:13
Subject: Re: [SG16] Making wide-character literals containing multiple
To: Tom Honermann <tom_at_[hidden]>
On Mon, Jun 1, 2020, 05:15 Tom Honermann <tom_at_[hidden]> wrote:
> On 5/31/20 6:44 PM, Corentin via SG16 wrote:
> L'ab' currently has an implementation defined value
> GCC, MSVC and Clang treats that value to be equivalent to L'a' and emit a
> However, consider
> L'Ã©' which after phase one is represented as L'e\u00B4' (LATIN SMALL
> LETTER E, ACUTE ACCENT).
> The author of the code probably intends the character to be a single
> Therefore, I think this should be made ill-formed.
> Note that this is less of an issue for multi character literals as no
> combining character has a representation in any single-byte encoding (that
> I know of).
> (And multi character literals, are, to my dismay, used in production code).
> However we should probably require that each individual c-char in a multi
> character literal has a representation in the execution encoding or is a
> member of the basic latin block.
> What do you think ?
> [lex.ccon]p1 <http://eel.is/c++draft/lex.ccon#1> already specifies that
> the ordinary (non-wide) form of these cases is conditionally-supported, so
> need not be implemented.
> ... A multicharacter literal, or an ordinary character literal containing
> a single c-char <http://eel.is/c++draft/lex.ccon#nt:c-char> not
> representable in the execution character set, is conditionally-supported,
> has type int, and has an implementation-defined value.
> Similar wording doesn't exist for the wide variants, but I've taken the
> perspective that the omission is an oversight. P2029
> <https://wg21.link/p2029> specifies wording changes intended to clarify
> that the wide variants are also conditionally-supported.
I believe the reason the wording is different is that a wide character
with multiple char has type wchar_t so it's not a different entity - only
the value is different.
> Since the wide character set is implementation-defined, I don't think we
> should attempt to make L'Ã©' ill-formed, regardless of any Unicode
> normalization concerns. If the implementation-defined wide character set
> is able to represent the specified character in a single code unit, I think
> that is ok and that there is little motivation to break existing code.
It is already a warning in all implementations and the value is 'e'
regardless of whether they can represent ACUTE ACCENT or LATIN SMALL LETTER
E WITH ACCUTE
> I'm in favor of updating wording for wide character and string literals to
> better specify them and to better reflect existing practice, but I'm not in
> favor of investing time improving or changing their behavior otherwise.
SG16 list run by email@example.com