sg16: Re: [SG16] Fwd: Making wide-character literals containing multiple c-char ill-formed

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 1 Jun 2020 12:55:57 -0400

On 6/1/20 5:13 AM, Corentin via SG16 wrote:
>
>
> ---------- Forwarded message ---------
> From: *Corentin* <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>>
> Date: Mon, 1 Jun 2020 at 11:13
> Subject: Re: [SG16] Making wide-character literals containing multiple
> c-char ill-formed
> To: Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>>
>
>
>
>
> On Mon, Jun 1, 2020, 05:15 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 5/31/20 6:44 PM, Corentin via SG16 wrote:
>> Hello,
>>
>> L'ab' currently has an implementation defined value
>> GCC, MSVC and Clang treats that value to be equivalent to L'a'
>> and emit a warning
>>
>> However, consider
>>
>> L'é' which after phase one is represented as L'e\u00B4' (LATIN
>> SMALL LETTER E, ACUTE ACCENT).
>
The representation indicated here is suspicious to me. Can we clarify
the example? I don't know if you tested L'é' as U+00E9 or { U+0065
U+0301 }. In either case, introduction of a U+00B4 (a non-combining
acute accent) would be surprising to me. Perhaps the implementation
mapped (combining) U+0301 to U+0084?

>>
>> The author of the code probably intends the character to be a
>> single c-char.
>>
>>
>> Therefore, I think this should be made ill-formed.
>>
>> Note that this is less of an issue for multi character literals
>> as no combining character has a representation in any single-byte
>> encoding (that I know of).
>> (And multi character literals, are, to my dismay, used in
>> production code).
>>
>> However we should probably require that each individual c-char in
>> a multi character literal has a representation in the execution
>> encoding or is a member of the basic latin block.
>>
>> What do you think ?
>
> [lex.ccon]p1 <http://eel.is/c++draft/lex.ccon#1> already specifies
> that the ordinary (non-wide) form of these cases is
> conditionally-supported, so need not be implemented.
>
>> ... A multicharacter literal, or an ordinary character literal
>> containing a single c-char
>> <http://eel.is/c++draft/lex.ccon#nt:c-char> not representable in
>> the execution character set, is conditionally-supported, has type
>> int, and has an implementation-defined value.
>> <http://eel.is/c++draft/lex.ccon#1.sentence-4>
> Similar wording doesn't exist for the wide variants, but I've
> taken the perspective that the omission is an oversight. P2029
> <https://wg21.link/p2029> specifies wording changes intended to
> clarify that the wide variants are also conditionally-supported.
>
>
>
> I believe the reason the wording is different is that a wide character
> with multiple char has type wchar_t so it's not a different entity -
> only the value is different.
That doesn't explain the difference with respect to these forms being
conditionally-supported. The C standard doesn't specify the
ordinary/narrow variants as conditionally-supported; I think that is a
C++ addition and, if so, suggests that the omission of
conditionally-supported for wide variants is unintentional. At any
rate, CWG will rule on this via P2029.
>
> Since the wide character set is implementation-defined, I don't
> think we should attempt to make L'é' ill-formed, regardless of any
> Unicode normalization concerns. If the implementation-defined
> wide character set is able to represent the specified character in
> a single code unit, I think that is ok and that there is little
> motivation to break existing code.
>
> It is already a warning in all implementations and the value is 'e'
> regardless of whether they can represent ACUTE ACCENT or LATIN SMALL
> LETTER E WITH ACCUTE

I'm suspicious of your claim that you checked all implementations ;)

Can you share your method of testing?

Tom.

> I'm in favor of updating wording for wide character and string
> literals to better specify them and to better reflect existing
> practice, but I'm not in favor of investing time improving or
> changing their behavior otherwise.
>
> Tom.:
>
>
>

Received on 2020-06-01 11:59:05