ISOCPP sg16 List: Re: Backward compatibility impact from P2071 (Named universal character escapes)

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 31 Aug 2022 21:48:16 +0200

On 31/08/2022 20.34, Tom Honermann via SG16 wrote:
> P2071 (Named universal character escapes) <https://wg21.link/p2071> was approved for C++23 during the July, 2022 virtual plenary and implementations are in progress.
>
> While reviewing a proposed implementation <https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600059.html> for gcc by Jakub Jelinek, Joseph Myers reported <https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600620.html> that the implementation, which allows use of the \N{<name>} /named-universal-character/ syntax as an extension in C and in prior C++ language modes, caused a failure parsing the glibc sysdeps/powerpc/powerpc64/sysdep.h <https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/powerpc/powerpc64/sysdep.h;hb=HEAD> header file due to use of \NARG within a #ifdef __ASSEMBLER__ code section. Joseph further produced the following example that, prior to P2071, is valid C and C++ code.
>
> #define z(x) 0
> #define a z(
> int x = a\NARG);
>
> Prior to P2071, a\NARG is lexed as the three tokens a, \, and NARG. The preprocessor, in translation phase 4 <http://eel.is/c++draft/lex.phases#1.4>, then identifies a as a macro and passes the \ NARG token sequence as an unused argument to macro z such that the post-preprocessed statement is int x = 0;.
>
> P2071 introduces /named-universal-character/ <http://eel.is/c++draft/lex.charset#nt:named-universal-character> (NUC) as a form of /universal-character-name/ <http://eel.is/c++draft/lex.charset#nt:universal-character-name> (UCN). UCNs are recognized and replaced during translation phase 3 <http://eel.is/c++draft/lex.phases#1.3>. An implementation that interprets \N as signifying the start of a NUC may then diagnose \NARG as ill-formed.
>
> Jakub was kind enough to bring this concern to the attention of the Clang maintainers via a comment on a related code review <https://reviews.llvm.org/D129664#3760500>.
>
> I haven't done any testing of the proposed implementation for gcc and iteration on the proposed patch continues. It looks like the gcc direction will be to only recognize syntactically valid NUCs during translation phase 3. Thus \N, \NARGS, and \N{abc} (lowercase letters are not permitted in Unicode character names) will all lex as multiple preprocessing tokens while \N{ABC} (ABC is syntactically valid, but not a defined name) will be diagnosed as an ill-formed NUC. This appears to me to be 1) consistent with the standard, and 2) useful from a backward compatibility perspective.
>
> Clang currently issues a diagnostic <https://godbolt.org/z/rfEE8cG4d> for the above example and matches what I understand to be the gcc direction for this case (the warning is, of course, permissible).
>
> <source>:3:11: warning: incomplete universal character name; treating as '\' followed by identifier [-Wunicode]
> int x = a\NARG);
> ^
>
> Clang's behavior differs from the gcc direction when the example is changed to have \N{abc} <https://godbolt.org/z/1MbqbKWx9> though; Clang diagnoses this as an ill-formed NUC (twice apparently!).
>
> <source>:3:11: error: 'abc' is not a valid Unicode character name
> int x = a\N{abc});
> ^
> <source>:3:11: error: 'abc' is not a valid Unicode character name
>
> My motivation for bringing this discussion to WG21 is to:
>
> * Report the implementation experience that NUCs cannot be simply recognized by observing only \N (at least, not without impacting backward compatibility).
> * Ensure that the direction being pursued for gcc is consistent with EWG's intent. Anyone that feels otherwise (e.g., that all instances of \N should be diagnosed as ill-formed NUCs) should report such perspectives in a reply or via an NB comment. Assuming no contrary direction, I'll work with Clang maintainers to bring Clang inline with the gcc direction for cases like \N{abc}.

The lexing grammar is pretty clear that only \N{caps-and-digits-and-space-and-minus}
forms a named-universal-character. Anything else is thus individual tokens;
warnings appreciated. In the examples above, the lexing grammar is not matched.

[lex.charset] could use a sentence that says the program is ill-formed if the
n-char-sequence does not designate any character.

> * Ask if anyone feels the standard is insufficiently clear here. If so, I'll file a core issue.

Jens
#

Received on 2022-08-31 19:48:28