C++ Logo


Advanced search

Subject: Re: Unicode as the basic compiler character set
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-01-27 02:51:14

On 27/01/2021 09.20, Jens Maurer via SG16 wrote:
> On 27/01/2021 04.53, Hubert Tong wrote:
>> On Tue, Jan 26, 2021 at 5:29 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>> UCNs are translated eagerly outside of literals, but are kept
>> until phase 7 for literals.
>> I am not sure we discussed stringization behaviour of tokens under this model. We no longer have the "which UCN shows up" problem. We have the "should the UCN stick around" problem.

(re-doing the example)

Reading [cpp.stringize], it seems the expected stringization
of the source-code token


would be


and that would be the same if instead the actual Unicode character
for U+00F6 appeared in the source code under the C++20 rules
(because of early replacement of U+00F6 with ASCII-only \u00f6).

But that's not what's actually happening in the real world.
It seems lots of compilers produce a variation of "name<U+00F6>",
where <U+00F6> may be encoded in ISO 8859-1 or in UTF-8.

In order to retain the status quo, it seems do want to keep
early replacement of UCNs outside of literals.

Hubert, any thoughts?


SG16 list run by sg16-owner@lists.isocpp.org