Hi Corentin,

Thanks for your work on these papers. The multiple meanings of 'character' (not just in C++) is something that irritates me so I'm very happy to see it removed.

I started replying to the paper with some minor points but then I started to overthink things. I've left it all here in case it's useful but the bits that are directly related to the paper are the first set of bullets :)

---

I'm going through D2749 and have some comments/questions:
---

Expanding on the 2nd and 3rd bullets above: I'm on the fence in terms of usage of "Unicode code point" vs "abstract character" (or "Unicode abstract character") in some places. The code point phrasing is definitely appropriate where we're talking about e.g. named escape sequences and it's probably how everything will be implemented anyway, because the requirement that everything be in Normal Form C means each abstract character has a 1:1 mapping with a combining sequence (I think?).

As an example, the 2nd paragraph of [lex.pptoken] says:
[...] If any characters Unicode code point not in the basic character set matches the last category, the program is ill-formed.
 
I think it'd be more correct to use "abstract character" rather than "code point" here because even though the mechanism will be codepoints, the underlying thing being talked about is the abstract character.

Or, to put it another way: what is "and͓"?

As specified & implemented, it's an identifier formed from 4 codepoints. So the mechanistic answer is that it's an identifier in the same way that "andd" is; the keyword "and" just happens to be a prefix.

But for a human looking at it, they'd say either "it's different because 'd' and 'd͓' are qualitatively different (i.e they're different abstract characters); the shared prefix with 'and' is 'an'" or "it's the same as 'and'; the ◌͓ mark is not significant". (We've clearly decided combining marks are significant in C++ so we can ignore the latter interpretation.)

Having written this out, I'm mostly convinced that the mechanistic (codepoint) and human (abstract character) interpretations will produce the same results (I'd be interested in any ideas for counter-examples!) so I don't think there would be a behavioural change with either. I still think the intent matters though.

This may have all been discussed before (if so, apologies for making you read all of that) and it was decided that codepoint was the better term and/or that we definitely want to specify in terms of codepoints because of implementation constraints (or even that abstract characters are just too hard to nail down in meaningful stanardese). It's hard to tell though because the current wording uses "character" everywhere and by changing to "Unicode code point", we're coming down on one side of that fence and I want to be sure that's a deliberate decision.

---

Because I've never actually read the lexing bits of the standard closely before there's a few things (completely unrelated to this paper) I noticed and found surprising:
---

This was meant to be a short email when I started writing it :/

Fraser

On Thu, 26 Jan 2023 at 04:42, Corentin via SG16 <sg16@lists.isocpp.org> wrote:
Hey folks:

I published https://isocpp.org/files/papers/P2736R1.pdf - with the changes requested by SG16 yesterday as part of the forwarding poll. Interesting change were 

* to constantly mention UAX XX of The Unicode Standard
* __STDC_ISO_10646__ : An integer literal of the form yyyymmL (for example, 199712L). If this symbol is defined, then its value is implementation-defined
* Specifically mention UTF-8, UTF-16 and UTF-32 instead of Unicode encoding

https://isocpp.org/files/papers/D2749R0.pdf down with character:

- Remove the footnote about old linkers
- Apply the character -> codepoint changes to the annexes and [diff] sections
- remove a stale cross reference in phase 1 of translations
- various typos


Thanks,

Corentin
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16