sg16: Re: [SG16] Feedback on P1854: Conversion to literal encoding should not lead to loss of meaning

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sun, 7 Nov 2021 16:58:43 +0100

On Sun, Nov 7, 2021 at 4:25 PM Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Sun, Nov 7, 2021 at 8:55 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 06/11/2021 23.21, Hubert Tong wrote:
>> > On Sat, Nov 6, 2021 at 4:07 PM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>> > On 06/11/2021 16.22, Hubert Tong via SG16 wrote:
>> > > Anyhow, if the intent really is to help only with the visual
>> ambiguity problem, then it would be more consistent to allow
>> /universal-character-name/s that encode to more than one code unit in
>> multicharacter literals (because it's in a multicharacter literal already).
>> >
>> > If we use a UCN, we have no source code visual ambiguity
>> > (because a UCN is expressed in basic characters).
>> > Is that a correct understanding of the situation / motivation?
>> >
>> >
>> > Yes.
>> >
>> >
>> > I can't connect your parenthetical remark to that.
>> >
>> >
>> > The UCN does not itself contribute to the visual ambiguity of the
>> character literal as being a single /c-char/.
>> >
>> >
>> >
>> > > With a focus on the visual ambiguity problem (thanks for
>> reminding), the previous wording to limit /basic-c-char/s to the basic
>> character set is more capable because lots of Unicode display shenanigans
>> will get through the current formulation if the ordinary literal encoding
>> is UCS-2 or UTF-16 (which is possible if CHAR_BIT is large enough).
>> >
>> > Do we have sufficient implementation experience / understanding of
>> > existing practice to estimate how much code will break if we
>> > restrict multi-character literals to the basic character set?
>> > (Note that neither @ or $ are in the basic character set.)
>> >
>> > (I'm all for restricting multi-character literals as much as
>> possible,
>> > but we should probably avoid stepping on people's toes for
>> non-portable
>> > features that don't really hurt anyone.)
>> >
>> >
>> > We could just restrict "problematic" Unicode characters?
>>
>> Those are ones that take more than one code unit, I presume?
>>
>
> I meant the ones that don't display. After all, the code units may be the
> ones of the UTF-16 or UTF-32 encoding form.
>

I'm not sure how much we care about this scenario.

The options are:

* Restrict to basic character set - simple but exclude $, @. I don't know
how to measure the impact of that. I expect it to be a non-issue but I
don't have data, and I don't know if people would find that palatable.
* Restrict to U+0000-U+007F
* Restrict to characters encodable as a single code unit. Which indeed
kinda doesn't work great on platforms where CHAR_BIT is != 8. Which isn't
really something I'm deeply concerned about.

I think anything else is over engineered, as it would only be somewhat
relevant to platforms where CHAR_BIT is != 8 and the narrow encoding is
UTF16/32 (we know of no such environment).
It would involve banning combining characters, zwj and probably many others
(Potentially doing grapheme clusterization and making graphemes of size !=1
ill-formed).

All 3 of the simple solutions seem satisfactory to me, as they achieve the
same goal (preventing accidental creation of a multicharacter literal)

>
>
>>
>> Jens
>>
>>

Received on 2021-11-07 09:58:55