sg16: Re: [SG16] Fwd: Making wide-character literals containing multiple c-char ill-formed

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 1 Jun 2020 22:19:57 +0200

On Mon, 1 Jun 2020 at 21:48, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/1/20 1:28 PM, Corentin via SG16 wrote:
>
>
>
> On Mon, Jun 1, 2020, 18:55 Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/1/20 5:13 AM, Corentin via SG16 wrote:
>>
>>
>>
>> ---------- Forwarded message ---------
>> From: Corentin <corentin.jabot_at_[hidden]>
>> Date: Mon, 1 Jun 2020 at 11:13
>> Subject: Re: [SG16] Making wide-character literals containing multiple
>> c-char ill-formed
>> To: Tom Honermann <tom_at_[hidden]>
>>
>>
>>
>>
>> On Mon, Jun 1, 2020, 05:15 Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 5/31/20 6:44 PM, Corentin via SG16 wrote:
>>>
>>> Hello,
>>>
>>> L'ab' currently has an implementation defined value
>>> GCC, MSVC and Clang treats that value to be equivalent to L'a' and emit
>>> a warning
>>>
>>> However, consider
>>>
>>> L'é' which after phase one is represented as L'e\u00B4' (LATIN SMALL
>>> LETTER E, ACUTE ACCENT).
>>>
>>> The representation indicated here is suspicious to me. Can we clarify
>> the example? I don't know if you tested L'é' as U+00E9 or { U+0065 U+0301
>> }. In either case, introduction of a U+00B4 (a non-combining acute accent)
>> would be surprising to me. Perhaps the implementation mapped (combining)
>> U+0301 to U+0084?
>>
> I meant U+301, you are right.
>
>
>>> The author of the code probably intends the character to be a single
>>> c-char.
>>>
>>>
>>> Therefore, I think this should be made ill-formed.
>>>
>>> Note that this is less of an issue for multi character literals as no
>>> combining character has a representation in any single-byte encoding (that
>>> I know of).
>>> (And multi character literals, are, to my dismay, used in production
>>> code).
>>>
>>> However we should probably require that each individual c-char in a
>>> multi character literal has a representation in the execution encoding or
>>> is a member of the basic latin block.
>>>
>>> What do you think ?
>>>
>>> [lex.ccon]p1 <http://eel.is/c++draft/lex.ccon#1> already specifies that
>>> the ordinary (non-wide) form of these cases is conditionally-supported, so
>>> need not be implemented.
>>>
>> ... A multicharacter literal, or an ordinary character literal containing
>>> a single c-char <http://eel.is/c++draft/lex.ccon#nt:c-char> not
>>> representable in the execution character set, is conditionally-supported,
>>> has type int, and has an implementation-defined value.
>>> <http://eel.is/c++draft/lex.ccon#1.sentence-4>
>>>
>>> Similar wording doesn't exist for the wide variants, but I've taken the
>>> perspective that the omission is an oversight. P2029
>>> <https://wg21.link/p2029> specifies wording changes intended to clarify
>>> that the wide variants are also conditionally-supported.
>>>
>>
>>
>> I believe the reason the wording is different is that a wide character
>> with multiple char has type wchar_t so it's not a different entity - only
>> the value is different.
>>
>> That doesn't explain the difference with respect to these forms being
>> conditionally-supported. The C standard doesn't specify the
>> ordinary/narrow variants as conditionally-supported; I think that is a C++
>> addition and, if so, suggests that the omission of conditionally-supported
>> for wide variants is unintentional. At any rate, CWG will rule on this via
>> P2029.
>>
>
> C does not have the notion of conditionally supported - thy use the term
> implementation-defined
>
> > The value of a wide character constant containing more than one
> multibyte character or a single multibyte character that maps to multiple
> members of the extended execution character set, or containing a multibyte
> character or escape sequence not represented in the extended execution
> character set, is implementation-defined
>
> But the intent of the wording is the same ( it is a valid behavior for
> something implementation-defined to be ill-formed )
>
> I think our discussion via Slack got us on the same page; that
> implementation-defined does not permit rejection and that C lacks a
> conditionally-supported concept.
>
>
>>
>>> Since the wide character set is implementation-defined, I don't think we
>>> should attempt to make L'é' ill-formed, regardless of any Unicode
>>> normalization concerns. If the implementation-defined wide character set
>>> is able to represent the specified character in a single code unit, I think
>>> that is ok and that there is little motivation to break existing code.
>>>
>> It is already a warning in all implementations and the value is 'e'
>> regardless of whether they can represent ACUTE ACCENT or LATIN SMALL LETTER
>> E WITH ACCUTE
>>
>> I'm suspicious of your claim that you checked all implementations ;)
>>
>> Can you share your method of testing?
>>
> https://godbolt.org/z/gZm4Ic
>
> The command line for MSVC is missing /source-charset:utf-8 (and it is
> relevant for this example).
>
> I see a different result. Gcc and Clang both accept that test case with
> the following static asserts. It looks like they use the value of the last
> code unit.
>
> static_assert(x == wchar_t(0x0301));
> static_assert(y == wchar_t(0x0002));
> static_assert(z == wchar_t(0x0301));
>
> MSVC and Icc both accept with the following static asserts; it looks like
> they use the first code unit.
>
> static_assert(x == wchar_t(0x0065));
> static_assert(y == wchar_t(0x0001));
> static_assert(z == wchar_t(0x0065));
>

Yes my initial message stating that all compilers use the first code point
was wrong - they use one code point.
Now, how is that a useful behavior? How is that an expected hehavior?

>
> Tom.
>
>
> Tom.
>>
>>
>>
>>> I'm in favor of updating wording for wide character and string literals
>>> to better specify them and to better reflect existing practice, but I'm not
>>> in favor of investing time improving or changing their behavior otherwise.
>>>
>>> Tom.:
>>>
>>
>>
>>
>>
>
>

Received on 2020-06-01 15:23:15