C++ Logo

sg16

Advanced search

Re: [SG16] Feedback on P1854: Conversion to literal encoding should not lead to loss of meaning

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 8 Nov 2021 18:05:29 +0100
On Sun, Nov 7, 2021 at 6:31 PM Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Sun, Nov 7, 2021 at 10:58 AM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>>
>>
>> On Sun, Nov 7, 2021 at 4:25 PM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Sun, Nov 7, 2021 at 8:55 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>>
>>>> On 06/11/2021 23.21, Hubert Tong wrote:
>>>> > On Sat, Nov 6, 2021 at 4:07 PM Jens Maurer <Jens.Maurer_at_[hidden]
>>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>> >
>>>> > On 06/11/2021 16.22, Hubert Tong via SG16 wrote:
>>>> > > Anyhow, if the intent really is to help only with the visual
>>>> ambiguity problem, then it would be more consistent to allow
>>>> /universal-character-name/s that encode to more than one code unit in
>>>> multicharacter literals (because it's in a multicharacter literal already).
>>>> >
>>>> > If we use a UCN, we have no source code visual ambiguity
>>>> > (because a UCN is expressed in basic characters).
>>>> > Is that a correct understanding of the situation / motivation?
>>>> >
>>>> >
>>>> > Yes.
>>>> >
>>>> >
>>>> > I can't connect your parenthetical remark to that.
>>>> >
>>>> >
>>>> > The UCN does not itself contribute to the visual ambiguity of the
>>>> character literal as being a single /c-char/.
>>>> >
>>>> >
>>>> >
>>>> > > With a focus on the visual ambiguity problem (thanks for
>>>> reminding), the previous wording to limit /basic-c-char/s to the basic
>>>> character set is more capable because lots of Unicode display shenanigans
>>>> will get through the current formulation if the ordinary literal encoding
>>>> is UCS-2 or UTF-16 (which is possible if CHAR_BIT is large enough).
>>>> >
>>>> > Do we have sufficient implementation experience / understanding of
>>>> > existing practice to estimate how much code will break if we
>>>> > restrict multi-character literals to the basic character set?
>>>> > (Note that neither @ or $ are in the basic character set.)
>>>> >
>>>> > (I'm all for restricting multi-character literals as much as
>>>> possible,
>>>> > but we should probably avoid stepping on people's toes for
>>>> non-portable
>>>> > features that don't really hurt anyone.)
>>>> >
>>>> >
>>>> > We could just restrict "problematic" Unicode characters?
>>>>
>>>> Those are ones that take more than one code unit, I presume?
>>>>
>>>
>>> I meant the ones that don't display. After all, the code units may be
>>> the ones of the UTF-16 or UTF-32 encoding form.
>>>
>>
>> I'm not sure how much we care about this scenario.
>>
>> The options are:
>>
>> * Restrict to basic character set - simple but exclude $, @. I don't know
>> how to measure the impact of that. I expect it to be a non-issue but I
>> don't have data, and I don't know if people would find that palatable.
>> * Restrict to U+0000-U+007F
>>
>
> I'm inclined to favour a combination of this one with the one below. That
> is, disallow $, `, and @ when the encoding does not have them as a single
> code unit in the initial shift state.
>

Something like that?

If a multicharacter literal contains a basic-c-char or a
universal-character-name representing a codepoint that is either outside of
the range U+0000-U+0007F or is not encodable as a single code unit in the
ordinary literal encoding, the program is ill-formed.

We do not need to say anything about escape sequences (at least I did not
intend to do anything specific about conditionally supported escape
sequences, which are already implementation defined).
I am not sure mentioning the shifstate adds any useful information as we do
not (nor want to) specify how these things are encoded.




>
>
>> * Restrict to characters encodable as a single code unit. Which indeed
>> kinda doesn't work great on platforms where CHAR_BIT is != 8. Which isn't
>> really something I'm deeply concerned about.
>>
>> I think anything else is over engineered, as it would only be somewhat
>> relevant to platforms where CHAR_BIT is != 8 and the narrow encoding is
>> UTF16/32 (we know of no such environment).
>> It would involve banning combining characters, zwj and probably many
>> others (Potentially doing grapheme clusterization and making graphemes of
>> size !=1 ill-formed).
>>
>> All 3 of the simple solutions seem satisfactory to me, as they achieve
>> the same goal (preventing accidental creation of a multicharacter literal)
>>
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>>
>>>> Jens
>>>>
>>>>

Received on 2021-11-08 11:05:42