P2071
(Named universal character escapes) was approved for C++23
during the July, 2022 virtual plenary and implementations are in
progress.
While reviewing a proposed implementation for gcc by Jakub Jelinek, Joseph Myers reported that the implementation, which allows use of the \N{<name>} named-universal-character syntax as an extension in C and in prior C++ language modes, caused a failure parsing the glibc sysdeps/powerpc/powerpc64/sysdep.h header file due to use of \NARG within a #ifdef __ASSEMBLER__ code section. Joseph further produced the following example that, prior to P2071, is valid C and C++ code.
#define z(x) 0
#define a z(
int x = a\NARG);
Prior to P2071, a\NARG is lexed as the three tokens a, \, and NARG. The preprocessor, in translation phase 4, then identifies a as a macro and passes the \ NARG token sequence as an unused argument to macro z such that the post-preprocessed statement is int x = 0;.
P2071 introduces named-universal-character (NUC) as a form of universal-character-name (UCN). UCNs are recognized and replaced during translation phase 3. An implementation that interprets \N as signifying the start of a NUC may then diagnose \NARG as ill-formed.
Jakub was kind enough to bring this concern to the attention of the Clang maintainers via a comment on a related code review.
I haven't done any testing of the proposed implementation for gcc and iteration on the proposed patch continues. It looks like the gcc direction will be to only recognize syntactically valid NUCs during translation phase 3. Thus \N, \NARGS, and \N{abc} (lowercase letters are not permitted in Unicode character names) will all lex as multiple preprocessing tokens while \N{ABC} (ABC is syntactically valid, but not a defined name) will be diagnosed as an ill-formed NUC. This appears to me to be 1) consistent with the standard, and 2) useful from a backward compatibility perspective.
Clang currently issues a
diagnostic for the above example and matches what I
understand to be the gcc direction for this case (the warning is,
of course, permissible).
<source>:3:11: warning: incomplete universal character name; treating as '\' followed by identifier [-Wunicode]
int x = a\NARG);
^
Clang's behavior differs from the gcc direction when the example
is changed to have \N{abc}
though; Clang diagnoses this as an ill-formed NUC (twice
apparently!).
<source>:3:11: error: 'abc' is not a valid Unicode character name
int x = a\N{abc});
^
<source>:3:11: error: 'abc' is not a valid Unicode character name
My motivation for bringing this discussion to WG21 is to:
Tom.