C++ Logo

SG16

Advanced search

Subject: Re: Unicode as the basic compiler character set
From: Hubert Tong (hubert.reinterpretcast_at_[hidden])
Date: 2021-01-30 14:48:53


On Wed, Jan 27, 2021 at 3:51 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 27/01/2021 09.20, Jens Maurer via SG16 wrote:
> > On 27/01/2021 04.53, Hubert Tong wrote:
> >> On Tue, Jan 26, 2021 at 5:29 PM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> >> UCNs are translated eagerly outside of literals, but are kept
> >> until phase 7 for literals.
> >>
> >> I am not sure we discussed stringization behaviour of tokens under this
> model. We no longer have the "which UCN shows up" problem. We have the
> "should the UCN stick around" problem.
>
> (re-doing the example)
>
> Reading [cpp.stringize], it seems the expected stringization
> of the source-code token
>
> K\u00f6ppe
>
> would be
>
> "K\\u00f6ppe"
>
> and that would be the same if instead the actual Unicode character
> for U+00F6 appeared in the source code under the C++20 rules
> (because of early replacement of U+00F6 with ASCII-only \u00f6).
>
> But that's not what's actually happening in the real world.
> It seems lots of compilers produce a variation of "name<U+00F6>",
> where <U+00F6> may be encoded in ISO 8859-1 or in UTF-8.
>
> In order to retain the status quo, it seems do want to keep
> early replacement of UCNs outside of literals.
>
> Hubert, any thoughts?
>
I think the early replacement is the right choice given a desire that
reasonable code maintains its meaning when UCN-ified.

>
> Jens
>



SG16 list run by sg16-owner@lists.isocpp.org