ISOCPP sg16 List: Backward compatibility impact from P2071 (Named universal character escapes)

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 31 Aug 2022 14:34:09 -0400

P2071 (Named universal character escapes) <https://wg21.link/p2071> was
approved for C++23 during the July, 2022 virtual plenary and
implementations are in progress.

While reviewing a proposed implementation
<https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600059.html> for
gcc by Jakub Jelinek, Joseph Myers reported
<https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600620.html> that
the implementation, which allows use of the \N{<name>}
/named-universal-character/ syntax as an extension in C and in prior C++
language modes, caused a failure parsing the glibc
sysdeps/powerpc/powerpc64/sysdep.h
<https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/powerpc/powerpc64/sysdep.h;hb=HEAD>
header file due to use of \NARG within a #ifdef __ASSEMBLER__ code
section. Joseph further produced the following example that, prior to
P2071, is valid C and C++ code.

    #define z(x) 0
    #define a z(
    int x = a\NARG);

Prior to P2071, a\NARG is lexed as the three tokens a, \, and NARG. The
preprocessor, in translation phase 4
<http://eel.is/c++draft/lex.phases#1.4>, then identifies a as a macro
and passes the \ NARG token sequence as an unused argument to macro z
such that the post-preprocessed statement is int x = 0;.

P2071 introduces /named-universal-character/
<http://eel.is/c++draft/lex.charset#nt:named-universal-character> (NUC)
as a form of /universal-character-name/
<http://eel.is/c++draft/lex.charset#nt:universal-character-name> (UCN).
UCNs are recognized and replaced during translation phase 3
<http://eel.is/c++draft/lex.phases#1.3>. An implementation that
interprets \N as signifying the start of a NUC may then diagnose \NARG
as ill-formed.

Jakub was kind enough to bring this concern to the attention of the
Clang maintainers via a comment on a related code review
<https://reviews.llvm.org/D129664#3760500>.

I haven't done any testing of the proposed implementation for gcc and
iteration on the proposed patch continues. It looks like the gcc
direction will be to only recognize syntactically valid NUCs during
translation phase 3. Thus \N, \NARGS, and \N{abc} (lowercase letters are
not permitted in Unicode character names) will all lex as multiple
preprocessing tokens while \N{ABC} (ABC is syntactically valid, but not
a defined name) will be diagnosed as an ill-formed NUC. This appears to
me to be 1) consistent with the standard, and 2) useful from a backward
compatibility perspective.

Clang currently issues a diagnostic <https://godbolt.org/z/rfEE8cG4d>
for the above example and matches what I understand to be the gcc
direction for this case (the warning is, of course, permissible).

    <source>:3:11: warning: incomplete universal character name;
    treating as '\' followed by identifier [-Wunicode]
    int x = a\NARG);
               ^

Clang's behavior differs from the gcc direction when the example is
changed to have \N{abc} <https://godbolt.org/z/1MbqbKWx9> though; Clang
diagnoses this as an ill-formed NUC (twice apparently!).

    <source>:3:11: error: 'abc' is not a valid Unicode character name
    int x = a\N{abc});
               ^
    <source>:3:11: error: 'abc' is not a valid Unicode character name

My motivation for bringing this discussion to WG21 is to:

  * Report the implementation experience that NUCs cannot be simply
    recognized by observing only \N (at least, not without impacting
    backward compatibility).
  * Ensure that the direction being pursued for gcc is consistent with
    EWG's intent. Anyone that feels otherwise (e.g., that all instances
    of \N should be diagnosed as ill-formed NUCs) should report such
    perspectives in a reply or via an NB comment. Assuming no contrary
    direction, I'll work with Clang maintainers to bring Clang inline
    with the gcc direction for cases like \N{abc}.
  * Ask if anyone feels the standard is insufficiently clear here. If
    so, I'll file a core issue.

Tom.

Received on 2022-08-31 18:34:10