sg16: Re: [SG16] Fwd: Making wide-character literals containing multiple c-char ill-formed

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 19:28:08 +0200

On Mon, Jun 1, 2020, 18:55 Tom Honermann <tom_at_[hidden]> wrote:

> On 6/1/20 5:13 AM, Corentin via SG16 wrote:
>
>
>
> ---------- Forwarded message ---------
> From: Corentin <corentin.jabot_at_[hidden]>
> Date: Mon, 1 Jun 2020 at 11:13
> Subject: Re: [SG16] Making wide-character literals containing multiple
> c-char ill-formed
> To: Tom Honermann <tom_at_[hidden]>
>
>
>
>
> On Mon, Jun 1, 2020, 05:15 Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 5/31/20 6:44 PM, Corentin via SG16 wrote:
>>
>> Hello,
>>
>> L'ab' currently has an implementation defined value
>> GCC, MSVC and Clang treats that value to be equivalent to L'a' and emit
>> a warning
>>
>> However, consider
>>
>> L'é' which after phase one is represented as L'e\u00B4' (LATIN SMALL
>> LETTER E, ACUTE ACCENT).
>>
>> The representation indicated here is suspicious to me. Can we clarify
> the example? I don't know if you tested L'é' as U+00E9 or { U+0065 U+0301
> }. In either case, introduction of a U+00B4 (a non-combining acute accent)
> would be surprising to me. Perhaps the implementation mapped (combining)
> U+0301 to U+0084?
>
I meant U+301, you are right.

>> The author of the code probably intends the character to be a single
>> c-char.
>>
>>
>> Therefore, I think this should be made ill-formed.
>>
>> Note that this is less of an issue for multi character literals as no
>> combining character has a representation in any single-byte encoding (that
>> I know of).
>> (And multi character literals, are, to my dismay, used in production
>> code).
>>
>> However we should probably require that each individual c-char in a multi
>> character literal has a representation in the execution encoding or is a
>> member of the basic latin block.
>>
>> What do you think ?
>>
>> [lex.ccon]p1 <http://eel.is/c++draft/lex.ccon#1> already specifies that
>> the ordinary (non-wide) form of these cases is conditionally-supported, so
>> need not be implemented.
>>
> ... A multicharacter literal, or an ordinary character literal containing
>> a single c-char <http://eel.is/c++draft/lex.ccon#nt:c-char> not
>> representable in the execution character set, is conditionally-supported,
>> has type int, and has an implementation-defined value.
>> <http://eel.is/c++draft/lex.ccon#1.sentence-4>
>>
>> Similar wording doesn't exist for the wide variants, but I've taken the
>> perspective that the omission is an oversight. P2029
>> <https://wg21.link/p2029> specifies wording changes intended to clarify
>> that the wide variants are also conditionally-supported.
>>
>
>
> I believe the reason the wording is different is that a wide character
> with multiple char has type wchar_t so it's not a different entity - only
> the value is different.
>
> That doesn't explain the difference with respect to these forms being
> conditionally-supported. The C standard doesn't specify the
> ordinary/narrow variants as conditionally-supported; I think that is a C++
> addition and, if so, suggests that the omission of conditionally-supported
> for wide variants is unintentional. At any rate, CWG will rule on this via
> P2029.
>

C does not have the notion of conditionally supported - thy use the term
implementation-defined

> The value of a wide character constant containing more than one multibyte
character or a single multibyte character that maps to multiple members of
the extended execution character set, or containing a multibyte character
or escape sequence not represented in the extended execution character set,
is implementation-defined

But the intent of the wording is the same ( it is a valid behavior for
something implementation-defined to be ill-formed )

>
>
>> Since the wide character set is implementation-defined, I don't think we
>> should attempt to make L'é' ill-formed, regardless of any Unicode
>> normalization concerns. If the implementation-defined wide character set
>> is able to represent the specified character in a single code unit, I think
>> that is ok and that there is little motivation to break existing code.
>>
> It is already a warning in all implementations and the value is 'e'
> regardless of whether they can represent ACUTE ACCENT or LATIN SMALL LETTER
> E WITH ACCUTE
>
> I'm suspicious of your claim that you checked all implementations ;)
>
> Can you share your method of testing?
>
https://godbolt.org/z/gZm4Ic

Tom.
>
>
>
>> I'm in favor of updating wording for wide character and string literals
>> to better specify them and to better reflect existing practice, but I'm not
>> in favor of investing time improving or changing their behavior otherwise.
>>
>> Tom.:
>>
>
>
>
>

Received on 2020-06-01 12:31:26