Hi Corentin,

Thanks for your work on these papers. The multiple meanings of 'character' (not just in C++) is something that irritates me so I'm very happy to see it removed.

I started replying to the paper with some minor points but then I started to overthink things. I've left it all here in case it's useful but the bits that are directly related to the paper are the first set of bullets :)

---

I'm going through D2749 and have some comments/questions:

"Unicode code point" feels awkward to read (and write!). Is it worth defining a term so it can be shortened? In [lex.ext] you've used "code point" without "Unicode".
[lex.charset]: Table 1 (basic character set) retains "character", as does the text referring to it, but Table 3 (additional control characters in the basic literal character set) replaces "character" with "Unicode code point". The text referencing Table 3 describes the contents of Table 1 as "abstract characters" and those of Table 3 as "control characters". The text also contains a note saying that the u+xxxx values don't have a meaning in this context. Based on that note and the normative text, I think the column heading on both tables and the references in the text should be "abstract character" consistently.
as an alternative to "abstract character" above, I think "Unicode code point" would also work if used everywhere due to all of the abstract characters in the basic set mapping to a single codepoint. It would become incorrect in the (unlikely?) event that the basic set expands to add abstract characters that are represented by combining sequences.
[diff.cpp14.lex] I'm not sure this section makes much sense now. It ends (with your changes) "as part of the implementation-defined mapping from input source file characters to Unicode." But is that still implementation-defined? With the removal of 'translation character set' and with the requirement to accept UTF-8 input files, the conversion from UTF-8 'characters' to Unicode is defined by the Unicode standard. Instead of replacing "the translation character set" with "Unicode", would it be better to strike the whole "as part of the..." clause from that sentence?

---

Expanding on the 2nd and 3rd bullets above: I'm on the fence in terms of usage of "Unicode code point" vs "abstract character" (or "Unicode abstract character") in some places. The code point phrasing is definitely appropriate where we're talking about e.g. named escape sequences and it's probably how everything will be implemented anyway, because the requirement that everything be in Normal Form C means each abstract character has a 1:1 mapping with a combining sequence (I think?).

As an example, the 2nd paragraph of [lex.pptoken] says:

[...] If any ~~characters~~ Unicode code point not in the basic character set matches the last category, the program is ill-formed.

I think it'd be more correct to use "abstract character" rather than "code point" here because even though the mechanism will be codepoints, the underlying thing being talked about is the abstract character.

Or, to put it another way: what is "and͓"?

As specified & implemented, it's an identifier formed from 4 codepoints. So the mechanistic answer is that it's an identifier in the same way that "andd" is; the keyword "and" just happens to be a prefix.

But for a human looking at it, they'd say either "it's different because 'd' and 'd͓' are qualitatively different (i.e they're different abstract characters); the shared prefix with 'and' is 'an'" or "it's the same as 'and'; the ◌͓ mark is not significant". (We've clearly decided combining marks are significant in C++ so we can ignore the latter interpretation.)

Having written this out, I'm mostly convinced that the mechanistic (codepoint) and human (abstract character) interpretations will produce the same results (I'd be interested in any ideas for counter-examples!) so I don't think there would be a behavioural change with either. I still think the intent matters though.

This may have all been discussed before (if so, apologies for making you read all of that) and it was decided that codepoint was the better term and/or that we definitely want to specify in terms of codepoints because of implementation constraints (or even that abstract characters are just too hard to nail down in meaningful stanardese). It's hard to tell though because the current wording uses "character" everywhere and by changing to "Unicode code point", we're coming down on one side of that fence and I want to be sure that's a deliberate decision.

---

Because I've never actually read the lexing bits of the standard closely before there's a few things (completely unrelated to this paper) I noticed and found surprising:

[lex.ccon]/[lex.string]: basic-c-char allows any codepoint except apostrophe, backslash and newlines. Lots of interesting control and presentation chars (e.g. bidi overrides) are allowed! Ditto for basic-s-char but less surprising in string literals than character literals.
[lex.charset]: n-char is similar to the above, despite the restricted alphabet that Unicode uses to name characters. I assume this is for implementation simplicity?
[lex.ccon]: hexadecimal-escape-sequence looks wrong. Raised as a CWG issue.

---

This was meant to be a short email when I started writing it :/

Fraser