sg16: Re: [SG16] Just realized that the UCN is not a single character

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 9 Apr 2020 08:17:53 +0200

On 09/04/2020 00.45, Steve Downey wrote:
> Phase 1 the bare accent is translated to universal-character-name, so I don't think there is a detectable difference, at least that way, for #define accent(x) x ## \u0300 vs #define accent(x) x ## `
> Now, in phase 3 we are converting to preprocessor-tokens and whitespace. So we get {x}{white-space}{##}{white-space} and then I think {\u0300}, not {\}{u}{0}{3}{0}{0}, because \u0300 is a "each non-white-space character that cannot be one of the above"

Stop right here.

There is no single character \u0300 here, there are six characters
spelled \ u 0 3 0 0 from a (low-level) lexer perspective.

Any interpretation of universal-character-names is done when
transitioning to a (slightly) higher level of abstraction.

Jens

on the theory that if it wasn't a combining character, but instead something valid, like \00C0, À, we would expect the result to be the two characters pasted together? That is a 'universal-character-name' names a character.
>
> One of the reasons to eventually get back to my paper trying to clean up 'character' and related terms, because they turn out to be much more complex and vague than expected. It needs substantive revision, though, because some parts that are a bit ambiguous I had interpreted other than how it seems to be intended to be interpreted.
>
> On Wed, Apr 8, 2020 at 6:00 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 08/04/2020 23.49, Zach Laine via SG16 wrote:
> > On Wed, Apr 8, 2020 at 4:38 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]> <mailto:sg16_at_[hidden] <mailto:sg16_at_[hidden]>>> wrote:
> >
> > On 08/04/2020 23.27, Hubert Tong via SG16 wrote:
> > > Seems GCC is right again...
> > >
> > > \u0300, whether the result of forming a UCN or physically present as a UCN is a string of six characters from the basic source character set...
> >
> > So, it's six characters, so
> >
> > #define accent(x) x ## \u0300
> >
> > becomes
> >
> >
> > #define accent(x) x ## \ u0300
> >
> > with \ a lone character and u0300 a separate preprocessing-token / identifier.
> >
> > Disturbing UCNs like that is really counter-intuitive.
> > We should fix that.
> >
> > Jens
> >
> >
> > Does that also imply that this:
> >
> > #define accent(x) x ## \u0300
> >
> > and this:
> >
> > #define accent(x) x ## `
> >
> > are not equivalent? If so, I'm even more disturbed.
>
> Yes. Be disturbed.
>
> Jens
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-04-09 01:20:50