Precomposed characters are the norm only in Western alphabets, where the combinations are few, and there are historical mappings for round trip purposes. In other languages combining characters are the norm, and there do not exist precombined characters at least for the most part. 

Devanagari, for example, has combining marks as the norm. That we can have a z with umlaut, which doesn't exist in natural language, falls out from the general mechanism, but we don't want to preclude the general mechanism. 

However, accomplish that by token pasting is a non-goal. 

My silly example 
std::string operator _̈ (const char*, std::size_t);  
// convert text to heavy metal form.

Has real equivalents in real languages by the rules of unicode, so it should be ok from the lexer's point of view. 

If you can't add an umlaut to _ via macro, I am not sad. 

On Tue, May 5, 2020, 18:18 Jens Maurer via SG16 <> wrote:
On 05/05/2020 23.45, Tom Honermann wrote:
> I phrased that question poorly.  I meant to ask if we *want* that example to be well-formed (after correcting it for Hubert's observation that the constructed identifier is not in NFC as I initially claimed).
> Let's adjust it to:
>> |#define accent(x)x##\u0327\u0300 constexpr int accent(Z) = 2; constexpr int gv2 = Z\u0327\u0300; static_assert(gv2 == 2, "whatever");|
> According to (thanks for that link, Hubert), the constructed identifier (again, assuming that \u0327\u0300 were to be lexed as a single preprocessor token), is in NFC.
> I'm happy with the degenerate "universal-character-name that is none of the above" approach, I'm just wondering if it can/should be extended to munch multiple such characters.  If it can be so extended, do we have a good rationale for why we wouldn't have it do so?  If it can't be so extended, what is the technical reason?

First, there is no reason why you want to write the above stuff.
An identifier containing combining marks that is, in fact, NFC
(i.e. the marks don't actually combine with the preceding character)
seems contrived.

Second, it seems to depend very much on the specific combining
character and the character preceding it whether you get a
combination that is NFC (i.e. well-formed) or not.  Your
example above would probably be ill-formed for accent(A), but
well-formed for accent(Z).  That seems a rather random outcome.
(My opinion would change if we would allow non-NFC identifiers
throughout.  But we don't, for good reason.)

Third, Hubert's observation was that there might be inadvertent
combinations of a combining character with something that precedes
it. Your editor might display the combination, but C++ will lex
the "something that precedes it" separately from the combining
character. That seems unfriendly and should be made ill-formed
as much as we can.

Fourth, given the NFC requirement, it seems to me that combining
marks should never appear in source code outside of string literals
at all.  If you want them in your strings, go put them inside
string literals but don't disturb ## with them.

Finally, it seems you could do what you wanted using something

#define accent(x) x ## \u0327 ## \u0300
constexpr int accent(Z) = 2;

This produces the intermediate token Z\u0327 and the final token
Z\u0327\u0300 . I guess both are NFC, so are fine.

This seems a reasonable work-around for someone dying to do this.

> Rather than just extending "universal-character-name that is none of the above" to munch multiple characters, another approach would be to keep that (so as to avoid tearing of UCNs) and to add an additional non-terminal that matches /identifier-continue/ (but avoids ambiguity with /identifier/).  I believe this would suffice to enable construction via concat of every valid identifier at arbitrary UCN boundaries.  Whether that is a useful design consideration I withhold opinion on.

I'm not convinced such an approach is worth the effort.

Note it took us quite a while to arrive at the current status
in SG16.

SG16 mailing list