sg16: Re: [SG16] Just realized that the UCN is not a single character

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 9 Apr 2020 09:42:28 -0400

I realized this late last night. The only characters the lexer sees are
elements of the basic source character set. To discuss any character not
in the basic source character set it's "a universal-character-name
designating a character" or similar phrase. And now I understand the reason
for that circumlocution.

The execution character set elements are not seen by the grammar, but are
effectively terminals in phases of translation, such as conversion of
character and string literals later.

On Thu, Apr 9, 2020, 02:17 Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 09/04/2020 00.45, Steve Downey wrote:
> > Phase 1 the bare accent is translated to universal-character-name, so I
> don't think there is a detectable difference, at least that way, for
> #define accent(x) x ## \u0300 vs #define accent(x) x ## `
> > Now, in phase 3 we are converting to preprocessor-tokens and whitespace.
> So we get {x}{white-space}{##}{white-space} and then I think {\u0300}, not
> {\}{u}{0}{3}{0}{0}, because \u0300 is a "each non-white-space character
> that cannot be one of the above"
>
> Stop right here.
>
> There is no single character \u0300 here, there are six characters
> spelled \ u 0 3 0 0 from a (low-level) lexer perspective.
>
> Any interpretation of universal-character-names is done when
> transitioning to a (slightly) higher level of abstraction.
>
> Jens
>
>
> on the theory that if it wasn't a combining character, but instead
> something valid, like \00C0, À, we would expect the result to be the two
> characters pasted together? That is a 'universal-character-name' names a
> character.
> >
> > One of the reasons to eventually get back to my paper trying to clean up
> 'character' and related terms, because they turn out to be much more
> complex and vague than expected. It needs substantive revision, though,
> because some parts that are a bit ambiguous I had interpreted other than
> how it seems to be intended to be interpreted.
> >
> > On Wed, Apr 8, 2020 at 6:00 PM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> > On 08/04/2020 23.49, Zach Laine via SG16 wrote:
> > > On Wed, Apr 8, 2020 at 4:38 PM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]> <mailto:
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>>> wrote:
> > >
> > > On 08/04/2020 23.27, Hubert Tong via SG16 wrote:
> > > > Seems GCC is right again...
> > > >
> > > > \u0300, whether the result of forming a UCN or physically
> present as a UCN is a string of six characters from the basic source
> character set...
> > >
> > > So, it's six characters, so
> > >
> > > #define accent(x) x ## \u0300
> > >
> > > becomes
> > >
> > >
> > > #define accent(x) x ## \ u0300
> > >
> > > with \ a lone character and u0300 a separate
> preprocessing-token / identifier.
> > >
> > > Disturbing UCNs like that is really counter-intuitive.
> > > We should fix that.
> > >
> > > Jens
> > >
> > >
> > > Does that also imply that this:
> > >
> > > #define accent(x) x ## \u0300
> > >
> > > and this:
> > >
> > > #define accent(x) x ## `
> > >
> > > are not equivalent? If so, I'm even more disturbed.
> >
> > Yes. Be disturbed.
> >
> > Jens
> >
> > --
> > SG16 mailing list
> > SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> > https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> >
>
>

Received on 2020-04-09 08:45:35