sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Alisdair Meredith <alisdairm_at_[hidden]>
Date: Mon, 24 Aug 2020 16:38:24 -0400

> On Aug 24, 2020, at 16:23, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
> On 24/08/2020 21.44, Alisdair Meredith via SG16 wrote:
>> Got another good corner case for you!
>>
>> In the template form of user defined literals, the template parameter pack
>> is instiated with characters corresponding to the source text, currently
>> mapping non-basic characters to UCNs, so that the template parser can
>> assume all characters are members of the basic source character set:
>>
>> See [lex.ext] 5.13.8p3/4
>>
>> By no longer mapping to UCNs, we break any UDL parsers that work with
>> UCNs today. I don’t know how many there are in production, possibly zero,
>> but it is a risk to address, and provide an entry in compatibility Annex C.
>
> UCNs may only be introduced for characters not in the basic source
> character set. Could please point out which of the characters allowed
> in a user-defined-integer-literal or user-defined-floating-point-literal
> are not in the basic source character set?

I don’t find the part of the spec that restricts the contents of the token
being passed to a numeric literal operator contain some restricted
subset of characters that are meaningful to existing parses built into
the language - only that the eventual result must be either an appropriate
integeral or floating point type.

While I have no examples of users doing this in the wild, I see nothing
in the current spec that forbids such things. - for example base36 literals
will meaningfully parse all 26 letters in addition to the 10 digits - why can
this not be extended (other than common sense) to use extended
characters that map to UCNs in phase 1?

AlisdairM

Received on 2020-08-24 15:41:53