ISOCPP sg16 List: Re: Backward compatibility impact from P2071 (Named universal character escapes)

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 31 Aug 2022 16:51:23 -0400

It seems I failed to provide an actual example of the backward
compatibility impact in the initial message.

Prior to P2071 (and P2290 (Delimited escape sequences)
<wg21.link/p2290>), the following was well-formed. It is now ill-formed
because a is not expanded (because the escape sequence now combines with
a to form a longer identifier), lookup for the (longer) identifier
fails, and the following ) is extraneous.

    #define z(x) 0
    #define a z(
    int x = a\N{LATIN SMALL LETTER E WITH ACUTE});
    int y = a\u{00E9});

This example is pretty contrived, but perhaps an annex C entry should be
added.

Tom.

On 8/31/22 2:34 PM, Tom Honermann via SG16 wrote:
>
> P2071 (Named universal character escapes) <https://wg21.link/p2071>
> was approved for C++23 during the July, 2022 virtual plenary and
> implementations are in progress.
>
> While reviewing a proposed implementation
> <https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600059.html>
> for gcc by Jakub Jelinek, Joseph Myers reported
> <https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600620.html>
> that the implementation, which allows use of the \N{<name>}
> /named-universal-character/ syntax as an extension in C and in prior
> C++ language modes, caused a failure parsing the glibc
> sysdeps/powerpc/powerpc64/sysdep.h
> <https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/powerpc/powerpc64/sysdep.h;hb=HEAD>
> header file due to use of \NARG within a #ifdef __ASSEMBLER__ code
> section. Joseph further produced the following example that, prior to
> P2071, is valid C and C++ code.
>
> #define z(x) 0
> #define a z(
> int x = a\NARG);
>
> Prior to P2071, a\NARG is lexed as the three tokens a, \, and NARG.
> The preprocessor, in translation phase 4
> <http://eel.is/c++draft/lex.phases#1.4>, then identifies a as a macro
> and passes the \ NARG token sequence as an unused argument to macro z
> such that the post-preprocessed statement is int x = 0;.
>
> P2071 introduces /named-universal-character/
> <http://eel.is/c++draft/lex.charset#nt:named-universal-character>
> (NUC) as a form of /universal-character-name/
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>
> (UCN). UCNs are recognized and replaced during translation phase 3
> <http://eel.is/c++draft/lex.phases#1.3>. An implementation that
> interprets \N as signifying the start of a NUC may then diagnose \NARG
> as ill-formed.
>
> Jakub was kind enough to bring this concern to the attention of the
> Clang maintainers via a comment on a related code review
> <https://reviews.llvm.org/D129664#3760500>.
>
> I haven't done any testing of the proposed implementation for gcc and
> iteration on the proposed patch continues. It looks like the gcc
> direction will be to only recognize syntactically valid NUCs during
> translation phase 3. Thus \N, \NARGS, and \N{abc} (lowercase letters
> are not permitted in Unicode character names) will all lex as multiple
> preprocessing tokens while \N{ABC} (ABC is syntactically valid, but
> not a defined name) will be diagnosed as an ill-formed NUC. This
> appears to me to be 1) consistent with the standard, and 2) useful
> from a backward compatibility perspective.
>
> Clang currently issues a diagnostic <https://godbolt.org/z/rfEE8cG4d>
> for the above example and matches what I understand to be the gcc
> direction for this case (the warning is, of course, permissible).
>
> <source>:3:11: warning: incomplete universal character name;
> treating as '\' followed by identifier [-Wunicode]
> int x = a\NARG);
> ^
>
> Clang's behavior differs from the gcc direction when the example is
> changed to have \N{abc} <https://godbolt.org/z/1MbqbKWx9> though;
> Clang diagnoses this as an ill-formed NUC (twice apparently!).
>
> <source>:3:11: error: 'abc' is not a valid Unicode character name
> int x = a\N{abc});
> ^
> <source>:3:11: error: 'abc' is not a valid Unicode character name
>
> My motivation for bringing this discussion to WG21 is to:
>
> * Report the implementation experience that NUCs cannot be simply
> recognized by observing only \N (at least, not without impacting
> backward compatibility).
> * Ensure that the direction being pursued for gcc is consistent with
> EWG's intent. Anyone that feels otherwise (e.g., that all
> instances of \N should be diagnosed as ill-formed NUCs) should
> report such perspectives in a reply or via an NB comment. Assuming
> no contrary direction, I'll work with Clang maintainers to bring
> Clang inline with the gcc direction for cases like \N{abc}.
> * Ask if anyone feels the standard is insufficiently clear here. If
> so, I'll file a core issue.
>
> Tom.
>
>

Received on 2022-08-31 20:51:27