sg16: Re: [SG16] Fwd: Making wide-character literals containing multiple c-char ill-formed

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 1 Jun 2020 15:48:44 -0400

On 6/1/20 1:28 PM, Corentin via SG16 wrote:
>
>
> On Mon, Jun 1, 2020, 18:55 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/1/20 5:13 AM, Corentin via SG16 wrote:
>>
>>
>> ---------- Forwarded message ---------
>> From: *Corentin* <corentin.jabot_at_[hidden]
>> <mailto:corentin.jabot_at_[hidden]>>
>> Date: Mon, 1 Jun 2020 at 11:13
>> Subject: Re: [SG16] Making wide-character literals containing
>> multiple c-char ill-formed
>> To: Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>>
>>
>>
>>
>>
>> On Mon, Jun 1, 2020, 05:15 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 5/31/20 6:44 PM, Corentin via SG16 wrote:
>>> Hello,
>>>
>>> L'ab' currently has an implementation defined value
>>> GCC, MSVC and Clang treats that value to be equivalent to
>>> L'a' and emit a warning
>>>
>>> However, consider
>>>
>>> L'é' which after phase one is represented as L'e\u00B4'
>>> (LATIN SMALL LETTER E, ACUTE ACCENT).
>>
> The representation indicated here is suspicious to me. Can we
> clarify the example? I don't know if you tested L'é' as U+00E9 or
> { U+0065 U+0301 }. In either case, introduction of a U+00B4 (a
> non-combining acute accent) would be surprising to me. Perhaps
> the implementation mapped (combining) U+0301 to U+0084?
>
> I meant U+301, you are right.
>
>>>
>>> The author of the code probably intends the character to be
>>> a single c-char.
>>>
>>>
>>> Therefore, I think this should be made ill-formed.
>>>
>>> Note that this is less of an issue for multi character
>>> literals as no combining character has a representation in
>>> any single-byte encoding (that I know of).
>>> (And multi character literals, are, to my dismay, used in
>>> production code).
>>>
>>> However we should probably require that each individual
>>> c-char in a multi character literal has a representation in
>>> the execution encoding or is a member of the basic latin block.
>>>
>>> What do you think ?
>>
>> [lex.ccon]p1 <http://eel.is/c++draft/lex.ccon#1> already
>> specifies that the ordinary (non-wide) form of these cases is
>> conditionally-supported, so need not be implemented.
>>
>>> ... A multicharacter literal, or an ordinary character
>>> literal containing a single c-char
>>> <http://eel.is/c++draft/lex.ccon#nt:c-char> not
>>> representable in the execution character set, is
>>> conditionally-supported, has type int, and has an
>>> implementation-defined value.
>>> <http://eel.is/c++draft/lex.ccon#1.sentence-4>
>> Similar wording doesn't exist for the wide variants, but I've
>> taken the perspective that the omission is an oversight.
>> P2029 <https://wg21.link/p2029> specifies wording changes
>> intended to clarify that the wide variants are also
>> conditionally-supported.
>>
>>
>>
>> I believe the reason the wording is different is that a wide
>> character with multiple char has type wchar_t so it's not a
>> different entity - only the value is different.
> That doesn't explain the difference with respect to these forms
> being conditionally-supported. The C standard doesn't specify the
> ordinary/narrow variants as conditionally-supported; I think that
> is a C++ addition and, if so, suggests that the omission of
> conditionally-supported for wide variants is unintentional. At
> any rate, CWG will rule on this via P2029.
>
>
> C does not have the notion of conditionally supported - thy use the
> term implementation-defined
>
> > The value of a wide character constant containing more than one
> multibyte character or a single multibyte character that maps to
> multiple members of the extended execution character set, or
> containing a multibyte character or escape sequence not represented in
> the extended execution character set, is implementation-defined
>
> But the intent of the wording is the same ( it is a valid behavior for
> something implementation-defined to be ill-formed )
I think our discussion via Slack got us on the same page; that
implementation-defined does not permit rejection and that C lacks a
conditionally-supported concept.
>
>> Since the wide character set is implementation-defined, I
>> don't think we should attempt to make L'é' ill-formed,
>> regardless of any Unicode normalization concerns. If the
>> implementation-defined wide character set is able to
>> represent the specified character in a single code unit, I
>> think that is ok and that there is little motivation to break
>> existing code.
>>
>> It is already a warning in all implementations and the value is
>> 'e' regardless of whether they can represent ACUTE ACCENT
>> or LATIN SMALL LETTER E WITH ACCUTE
>
> I'm suspicious of your claim that you checked all implementations ;)
>
> Can you share your method of testing?
>
> https://godbolt.org/z/gZm4Ic

The command line for MSVC is missing /source-charset:utf-8 (and it is
relevant for this example).

I see a different result. Gcc and Clang both accept that test case with
the following static asserts. It looks like they use the value of the
last code unit.

     static_assert(x == wchar_t(0x0301));
     static_assert(y == wchar_t(0x0002));
     static_assert(z == wchar_t(0x0301));

MSVC and Icc both accept with the following static asserts; it looks
like they use the first code unit.

     static_assert(x == wchar_t(0x0065));
     static_assert(y == wchar_t(0x0001));
     static_assert(z == wchar_t(0x0065));

Tom.

>
> Tom.
>
>> I'm in favor of updating wording for wide character and
>> string literals to better specify them and to better reflect
>> existing practice, but I'm not in favor of investing time
>> improving or changing their behavior otherwise.
>>
>> Tom.:
>>
>>
>>
>
>

Received on 2020-06-01 14:51:51