sg16: Re: [SG16] Unicode as the basic compiler character set

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 27 Jan 2021 09:51:14 +0100

On 27/01/2021 09.20, Jens Maurer via SG16 wrote:
> On 27/01/2021 04.53, Hubert Tong wrote:
>> On Tue, Jan 26, 2021 at 5:29 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
>> UCNs are translated eagerly outside of literals, but are kept
>> until phase 7 for literals.
>>
>> I am not sure we discussed stringization behaviour of tokens under this model. We no longer have the "which UCN shows up" problem. We have the "should the UCN stick around" problem.

(re-doing the example)

Reading [cpp.stringize], it seems the expected stringization
of the source-code token

K\u00f6ppe

would be

"K\\u00f6ppe"

and that would be the same if instead the actual Unicode character
for U+00F6 appeared in the source code under the C++20 rules
(because of early replacement of U+00F6 with ASCII-only \u00f6).

But that's not what's actually happening in the real world.
It seems lots of compilers produce a variation of "name<U+00F6>",
where <U+00F6> may be encoded in ISO 8859-1 or in UTF-8.

In order to retain the status quo, it seems do want to keep
early replacement of UCNs outside of literals.

Hubert, any thoughts?

Jens

Received on 2021-01-27 02:51:18