There's a paper in the works to clean up and change the C++ model so that the logical conversion from È (U+00C8) to the tokens \u00C8 no longer happens. No one sensible actually does that, and it causes a bunch of issues in figuring out how to talk about reverting to the original spelling when that's required. Instead we're just going to preserve the scalar value. See P2314, which SG16 forwarded recently. Jens Maurer has been working on it in SG16, because SG16 really wants to clean up all the Unicode, char, and string literal problems, and the early phases of translation make that difficult.
This doesn't change the issue you're raising, though. Lexing still has to deal with it before we get to #if 0, and identifier characters need to be valid identifier characters. Which we are also cleaning up in P1949.
In existing versions of C++, translation phase 1 converts characters not
in the basic source character set to universal character names. So any
such character gets converted to a universal character name. Outside of
strings, such UCNs then match the lexical syntax production for an
identifier, but are outside of the ranges of characters permitted in
identifiers. This means the use of such characters yields an invalid
identifier and is generally invalid *even inside #if 0*, much like e.g.
unmatched ' or " characters are invalid even inside #if 0.
The matter of being invalid inside #if 0 is an important one. With new
language features, normally it's possible to write code with #if
conditionals on the value of __STDC_VERSION__ or __cplusplus, that only
uses the new feature if the language version is new enough. When a new
feature involves text that is invalid inside #if 0, that doesn't work.
So you can't generally use such characters (in C++), or corresponding UCNs
(in both C and C++), in such conditional code, because that usage is
invalid in #if 0 for existing language versions; you'd have to put the
new-language-version code in an entirely separate header, that's included
by a #include that itself is conditional, so compilers for old language
versions don't see the new-language-version code at all.
Punctuator pp-tokens that are safe to add because they don't introduce
this issue (although they could still have compatibility issues if they
affect the interpretation of existing valid code) involve only characters
in the basic source character set other than ' and ".
--
Joseph S. Myers
joseph@codesourcery.com
_______________________________________________
Liaison mailing list
Liaison@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
Link to this post: http://lists.isocpp.org/liaison/2021/04/0446.php