Date: Thu, 15 Apr 2021 14:54:02 -0400
There's a paper in the works to clean up and change the C++ model so that
the logical conversion from È (U+00C8) to the tokens \u00C8 no longer
happens. No one sensible actually does that, and it causes a bunch of
issues in figuring out how to talk about reverting to the original spelling
when that's required. Instead we're just going to preserve the scalar
value. See P2314, which SG16 forwarded recently. Jens Maurer has been
working on it in SG16, because SG16 really wants to clean up all the
Unicode, char, and string literal problems, and the early phases of
translation make that difficult.
This doesn't change the issue you're raising, though. Lexing still has to
deal with it before we get to #if 0, and identifier characters need to be
valid identifier characters. Which we are also cleaning up in P1949.
On Thu, Apr 15, 2021 at 1:32 PM Joseph Myers via Liaison <
liaison_at_[hidden]> wrote:
> In existing versions of C++, translation phase 1 converts characters not
> in the basic source character set to universal character names. So any
> such character gets converted to a universal character name. Outside of
> strings, such UCNs then match the lexical syntax production for an
> identifier, but are outside of the ranges of characters permitted in
> identifiers. This means the use of such characters yields an invalid
> identifier and is generally invalid *even inside #if 0*, much like e.g.
> unmatched ' or " characters are invalid even inside #if 0.
>
> The matter of being invalid inside #if 0 is an important one. With new
> language features, normally it's possible to write code with #if
> conditionals on the value of __STDC_VERSION__ or __cplusplus, that only
> uses the new feature if the language version is new enough. When a new
> feature involves text that is invalid inside #if 0, that doesn't work.
> So you can't generally use such characters (in C++), or corresponding UCNs
> (in both C and C++), in such conditional code, because that usage is
> invalid in #if 0 for existing language versions; you'd have to put the
> new-language-version code in an entirely separate header, that's included
> by a #include that itself is conditional, so compilers for old language
> versions don't see the new-language-version code at all.
>
> Punctuator pp-tokens that are safe to add because they don't introduce
> this issue (although they could still have compatibility issues if they
> affect the interpretation of existing valid code) involve only characters
> in the basic source character set other than ' and ".
>
> --
> Joseph S. Myers
> joseph_at_[hidden]
> _______________________________________________
> Liaison mailing list
> Liaison_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> Link to this post: http://lists.isocpp.org/liaison/2021/04/0446.php
>
the logical conversion from È (U+00C8) to the tokens \u00C8 no longer
happens. No one sensible actually does that, and it causes a bunch of
issues in figuring out how to talk about reverting to the original spelling
when that's required. Instead we're just going to preserve the scalar
value. See P2314, which SG16 forwarded recently. Jens Maurer has been
working on it in SG16, because SG16 really wants to clean up all the
Unicode, char, and string literal problems, and the early phases of
translation make that difficult.
This doesn't change the issue you're raising, though. Lexing still has to
deal with it before we get to #if 0, and identifier characters need to be
valid identifier characters. Which we are also cleaning up in P1949.
On Thu, Apr 15, 2021 at 1:32 PM Joseph Myers via Liaison <
liaison_at_[hidden]> wrote:
> In existing versions of C++, translation phase 1 converts characters not
> in the basic source character set to universal character names. So any
> such character gets converted to a universal character name. Outside of
> strings, such UCNs then match the lexical syntax production for an
> identifier, but are outside of the ranges of characters permitted in
> identifiers. This means the use of such characters yields an invalid
> identifier and is generally invalid *even inside #if 0*, much like e.g.
> unmatched ' or " characters are invalid even inside #if 0.
>
> The matter of being invalid inside #if 0 is an important one. With new
> language features, normally it's possible to write code with #if
> conditionals on the value of __STDC_VERSION__ or __cplusplus, that only
> uses the new feature if the language version is new enough. When a new
> feature involves text that is invalid inside #if 0, that doesn't work.
> So you can't generally use such characters (in C++), or corresponding UCNs
> (in both C and C++), in such conditional code, because that usage is
> invalid in #if 0 for existing language versions; you'd have to put the
> new-language-version code in an entirely separate header, that's included
> by a #include that itself is conditional, so compilers for old language
> versions don't see the new-language-version code at all.
>
> Punctuator pp-tokens that are safe to add because they don't introduce
> this issue (although they could still have compatibility issues if they
> affect the interpretation of existing valid code) involve only characters
> in the basic source character set other than ' and ".
>
> --
> Joseph S. Myers
> joseph_at_[hidden]
> _______________________________________________
> Liaison mailing list
> Liaison_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> Link to this post: http://lists.isocpp.org/liaison/2021/04/0446.php
>
Received on 2021-04-15 13:54:19