On 6/1/20 1:28 PM, Corentin via SG16 wrote:


On Mon, Jun 1, 2020, 18:55 Tom Honermann <tom@honermann.net> wrote:
On 6/1/20 5:13 AM, Corentin via SG16 wrote:


---------- Forwarded message ---------
From: Corentin <corentin.jabot@gmail.com>
Date: Mon, 1 Jun 2020 at 11:13
Subject: Re: [SG16] Making wide-character literals containing multiple c-char ill-formed
To: Tom Honermann <tom@honermann.net>




On Mon, Jun 1, 2020, 05:15 Tom Honermann <tom@honermann.net> wrote:
On 5/31/20 6:44 PM, Corentin via SG16 wrote:
Hello, 

L'ab' currently has an implementation defined value
GCC, MSVC and Clang treats that value to be equivalent to L'a' and emit a warning

However,  consider

L'é' which after phase one is represented as L'e\u00B4' (LATIN SMALL LETTER E, ACUTE ACCENT).

The representation indicated here is suspicious to me.  Can we clarify the example?  I don't know if you tested L'é' as U+00E9 or { U+0065 U+0301 }.  In either case, introduction of a U+00B4 (a non-combining acute accent) would be surprising to me.  Perhaps the implementation mapped (combining) U+0301 to U+0084?

I meant U+301, you are right.


The author of the code probably intends the character to be a single c-char.


Therefore,  I think this should be made ill-formed.

Note that this is less of an issue for multi character literals as no combining character has a representation in any single-byte encoding (that I know of). 
(And multi character literals, are, to my dismay, used in production code).

However we should probably require that each individual c-char in a multi character literal has a representation in the execution encoding or is a member of the basic latin block.

What do you think ?

[lex.ccon]p1 already specifies that the ordinary (non-wide) form of these cases is conditionally-supported, so need not be implemented. 

... A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Similar wording doesn't exist for the wide variants, but I've taken the perspective that the omission is an oversight.  P2029 specifies wording changes intended to clarify that the wide variants are also conditionally-supported.


I believe the reason the wording is different is that a wide character with multiple char has type wchar_t so it's not a different entity - only the value is different.
That doesn't explain the difference with respect to these forms being conditionally-supported.  The C standard doesn't specify the ordinary/narrow variants as conditionally-supported; I think that is a C++ addition and, if so, suggests that the omission of conditionally-supported for wide variants is unintentional.  At any rate, CWG will rule on this via P2029.

C does not have the notion of conditionally supported - thy use the term implementation-defined

> The value of a wide character constant containing more than one multibyte character or a single multibyte character that maps to multiple members of the extended execution character set, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined
 
But the intent of the wording is the same ( it is a valid behavior for something  implementation-defined to be ill-formed )
I think our discussion via Slack got us on the same page; that implementation-defined does not permit rejection and that C lacks a conditionally-supported concept.
 

Since the wide character set is implementation-defined, I don't think we should attempt to make L'é' ill-formed, regardless of any Unicode normalization concerns.  If the implementation-defined wide character set is able to represent the specified character in a single code unit, I think that is ok and that there is little motivation to break existing code.

It is already a warning in all implementations and the value is 'e'  regardless of whether they can represent ACUTE ACCENT or LATIN SMALL LETTER E WITH ACCUTE

I'm suspicious of your claim that you checked all implementations ;)

Can you share your method of testing?

The command line for MSVC is missing /source-charset:utf-8 (and it is relevant for this example).

I see a different result.  Gcc and Clang both accept that test case with the following static asserts.  It looks like they use the value of the last code unit.

    static_assert(x == wchar_t(0x0301));
    static_assert(y == wchar_t(0x0002));
    static_assert(z == wchar_t(0x0301));

MSVC and Icc both accept with the following static asserts; it looks like they use the first code unit.

    static_assert(x == wchar_t(0x0065));
    static_assert(y == wchar_t(0x0001));
    static_assert(z == wchar_t(0x0065));

Tom.


Tom.

 

I'm in favor of updating wording for wide character and string literals to better specify them and to better reflect existing practice, but I'm not in favor of investing time improving or changing their behavior otherwise.

Tom.: