Date: Mon, 19 Apr 2021 23:46:39 +0200
Jens,
on Mon, 19 Apr 2021 22:40:53 +0200 you (Jens Maurer
<Jens.Maurer_at_[hidden]>) wrote:
> > In all of this, is there any chance that someday we let `\u2264` or
> > similar non-identifier universal characters survive tokenization
> > (as a single token) and leave it to phase 7 to decide if they can do
> > something with it (or not)?
>
> Do you have any particular use-case in mind for such?
Yes, let implementations decide if they want to use universal
characters that are not part of the identifier categories for anything
else (e.g accept as certain punctuators ;-)
But at the same time this could guarantee proper tokenization.
Instead of `\u03c1\u2264` leading to a "wrong universal character in
identifier" error this would cut off the usable part of an identifier
(`\u03c1`) as a first token and then deliver a second universal
character (`\u2264`) as a token of its own.
Could also lead to nicer error messages "universal character \u2264 is
not accepted outside wide characters or strings" and would not talk
about identifiers when the user meant a punctuation.
> Currently, lone UCNs that aren't part of the other main token
> categories are just garbage and thus (eventually) ill-formed,
> I believe.
In C that it seems to be a bit more complicated. Universal characters
may only appear in identifiers, but if they have the wrong category,
the behavior is undefined. So `\u03c1\u2264` could be delivered to
phase 7, but then `\u2264` would be part of that token and not form a
token of its own.
Thanks
Jens
on Mon, 19 Apr 2021 22:40:53 +0200 you (Jens Maurer
<Jens.Maurer_at_[hidden]>) wrote:
> > In all of this, is there any chance that someday we let `\u2264` or
> > similar non-identifier universal characters survive tokenization
> > (as a single token) and leave it to phase 7 to decide if they can do
> > something with it (or not)?
>
> Do you have any particular use-case in mind for such?
Yes, let implementations decide if they want to use universal
characters that are not part of the identifier categories for anything
else (e.g accept as certain punctuators ;-)
But at the same time this could guarantee proper tokenization.
Instead of `\u03c1\u2264` leading to a "wrong universal character in
identifier" error this would cut off the usable part of an identifier
(`\u03c1`) as a first token and then deliver a second universal
character (`\u2264`) as a token of its own.
Could also lead to nicer error messages "universal character \u2264 is
not accepted outside wide characters or strings" and would not talk
about identifiers when the user meant a punctuation.
> Currently, lone UCNs that aren't part of the other main token
> categories are just garbage and thus (eventually) ill-formed,
> I believe.
In C that it seems to be a bit more complicated. Universal characters
may only appear in identifiers, but if they have the wrong category,
the behavior is undefined. So `\u03c1\u2264` could be delivered to
phase 7, but then `\u2264` would be part of that token and not form a
token of its own.
Thanks
Jens
-- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::
Received on 2021-04-19 16:46:46