Thanks Steve & Corentin for your detailed replies. I'll try to ask shorter questions in future :)

Corentin - you've definitely convinced me on the "codepoint" vs "abstract character" front. The answer to my question is that yes, it's a deliberate decision. And the explanation you've given contains the answers if anyone does ask if the wording change causes differences in behaviour. Would it be worth summarising it in the paper?

The only thing I'm not 100% sure on is [diff.cpp14.lex] might now be implying that trigraphs are disallowed in UTF-8 source files while that implication wasn't there before (because that UTF-8 -> Unicode mapping isn't implementation-defined but UTF-8 -> "translation character set" was). As long as you're happy it doesn't say that, I'll stop asking about it (like most people, I've only ever "used" trigraphs accidentally).


Corentin & Steve - I knew there'd be reasons but I couldn't think what they were: the lexer is scary and full of unintended consequences.

Fraser

On Fri, 27 Jan 2023 at 04:32, Corentin <corentin.jabot@gmail.com> wrote:


On Fri, Jan 27, 2023 at 4:05 AM Fraser Gordon <fraserjgordon+cpp@gmail.com> wrote:
Hi Corentin,

Thanks for your work on these papers. The multiple meanings of 'character' (not just in C++) is something that irritates me so I'm very happy to see it removed.

I started replying to the paper with some minor points but then I started to overthink things. I've left it all here in case it's useful but the bits that are directly related to the paper are the first set of bullets :)

---

I'm going through D2749 and have some comments/questions:
  • "Unicode code point" feels awkward to read (and write!). Is it worth defining a term so it can be shortened? In [lex.ext] you've used "code point" without "Unicode".
Code point is generally applicable to any encoding, not just to unicode.
When there would not be possible confusion, I did try to not say Unicode redundantly. At the same time, I want to make sure that we can keep talking about a shift-jis code point, for example, especially in the library section.

Generally, the wording is jungling with around five different encodings, which even SG-16 members have historically been struggling to tell apart. Being exact is important here.
I think these changes look a bit dense in isolation but if you were to read the whole standard with these changes applied it would not be insufferable.
 
  • [lex.charset]: Table 1 (basic character set) retains "character", as does the text referring to it, but Table 3 (additional control characters in the basic literal character set) replaces "character" with "Unicode code point". The text referencing Table 3 describes the contents of Table 1 as "abstract characters" and those of Table 3 as "control characters". The text also contains a note saying that the u+xxxx values don't have a meaning in this context. Based on that note and the normative text, I think the column heading on both tables and the references in the text should be "abstract character" consistently.
Yes, that change to Table 1 should be reverted for consistency, thanks
  • as an alternative to "abstract character" above, I think "Unicode code point" would also work if used everywhere due to all of the abstract characters in the basic set mapping to a single codepoint. It would become incorrect in the (unlikely?) event that the basic set expands to add abstract characters that are represented by combining sequences.
We could. In fact, I started doing that, but found it to be maybe too pedantic and too churny, as there is no ambiguity. But we could go further editorially after, if we want to.
For example, everytime we say "a quotation character" we could say "U+0022 quotation mark".
 
  • [diff.cpp14.lex] I'm not sure this section makes much sense now. It ends (with your changes) "as part of the implementation-defined mapping from input source file characters to Unicode." But is that still implementation-defined? With the removal of 'translation character set' and with the requirement to accept UTF-8 input files, the conversion from UTF-8 'characters' to Unicode is defined by the Unicode standard. Instead of replacing "the translation character set" with "Unicode", would it be better to strike the whole "as part of the..." clause from that sentence?

Yes, it is. There are two scenarios now in phase 1. Either the source file is UTF-8, in which case you are just decoding that to codepoints. Or it is anything else (including ebcdic, morse, an ancient cuneiform tablet), and reading that source file requires converting it to unicode code points, somehow, which is left entirely implementation-defined.
Whether that conversion is to Unicode or the translation set (which is isomorphic to Unicode) makes no difference
 
---

Expanding on the 2nd and 3rd bullets above: I'm on the fence in terms of usage of "Unicode code point" vs "abstract character" (or "Unicode abstract character") in some places. The code point phrasing is definitely appropriate where we're talking about e.g. named escape sequences and it's probably how everything will be implemented anyway, because the requirement that everything be in Normal Form C means each abstract character has a 1:1 mapping with a combining sequence (I think?).

As an example, the 2nd paragraph of [lex.pptoken] says:
[...] If any characters Unicode code point not in the basic character set matches the last category, the program is ill-formed.
 
I think it'd be more correct to use "abstract character" rather than "code point" here because even though the mechanism will be codepoints, the underlying thing being talked about is the abstract character.

Lexing is completely unaware of abstract characters, which could be composed of multiple codepoints for example.
We read the next 21 bits value in a sequence of 21 bits value.
Abstract characters in general do not make sense in computer programs, and are only useful when mapping between 2 different encodings (both Unicode and EBCDIC have a codepoint that represents the letter 'a'), or when talking about characters in the abstract.
 

Or, to put it another way: what is "and͓"?

As specified & implemented, it's an identifier formed from 4 codepoints. So the mechanistic answer is that it's an identifier in the same way that "andd" is; the keyword "and" just happens to be a prefix.

But for a human looking at it, they'd say either "it's different because 'd' and 'd͓' are qualitatively different (i.e they're different abstract characters); the shared prefix with 'and' is 'an'" or "it's the same as 'and'; the ◌͓ mark is not significant". (We've clearly decided combining marks are significant in C++ so we can ignore the latter interpretation.)

Having written this out, I'm mostly convinced that the mechanistic (codepoint) and human (abstract character) interpretations will produce the same results (I'd be interested in any ideas for counter-examples!) so I don't think there would be a behavioural change with either. I still think the intent matters though.

This may have all been discussed before (if so, apologies for making you read all of that) and it was decided that codepoint was the better term and/or that we definitely want to specify in terms of codepoints because of implementation constraints (or even that abstract characters are just too hard to nail down in meaningful stanardese). It's hard to tell though because the current wording uses "character" everywhere and by changing to "Unicode code point", we're coming down on one side of that fence and I want to be sure that's a deliberate decision.

Where it matters is that we don't want implementation to be that clever .

Lexing does (and always have, we are just clarifying) look at the next codepoint. Looking at the next abstract character is not possible, because there is no specification that can describe what an abstract character is to a machine, it's a very human-centric concept (You and I know what a 'U' is, or pretend to at least, because we could have a very long philosophical debate about whether 'U' actually is 'U' depending on historical context - but the compiler only knows that U+0055 has XID_Start=true). We could look at the next grapheme, which is a Unicode specification for codepoint sequences representing whole abstract characters, but 1/unicode does not specify identifiers that way 2/we don't want to force an implementation to look at grapheme boundaries (because they have not historically been, because it would force implementation to do extra lookup during lexing, which would have performance implications, because it would make the behavior of lexing dependant on unicode versions, none of which are desirable properties of a lexer), and because there is no clear benefit of behaving that way as it would be mostly unobservable.
One could argue that it would produce better diagnostics, but then again, whether a grapheme would be a valid identifier does, as you note, depend not on any property of that grapheme but on the normalization form the source file is in. We also found that diagnostics are more actionable when they mention codepoints, because of invisible glyphs, normalization form and so forth.

Another way to look at it is that, except for the purpose of extracting graphemes, Unicode never considers graphemes, all behaviors are specified on codepoints, and the usefulness of graphemes is mostly limited to text processing and counting the length of tweets.  

 

---

Because I've never actually read the lexing bits of the standard closely before there's a few things (completely unrelated to this paper) I noticed and found surprising:
  • [lex.ccon]/[lex.string]: basic-c-char allows any codepoint except apostrophe, backslash and newlines. Lots of interesting control and presentation chars (e.g. bidi overrides) are allowed! Ditto for basic-s-char but less surprising in string literals than character literals. 
  • [lex.charset]: n-char is similar to the above, despite the restricted alphabet that Unicode uses to name characters. I assume this is for implementation simplicity?
That was changed fairly recently, see https://cplusplus.github.io/CWG/issues/2640.html
I certainly do not think it made the implementation easier, but it changes when diagnostics are produced when these things appear in macros.
 
  • [lex.ccon]: hexadecimal-escape-sequence looks wrong. Raised as a CWG issue.
Thanks 
---

This was meant to be a short email when I started writing it :/

Fraser

On Thu, 26 Jan 2023 at 04:42, Corentin via SG16 <sg16@lists.isocpp.org> wrote:
Hey folks:

I published https://isocpp.org/files/papers/P2736R1.pdf - with the changes requested by SG16 yesterday as part of the forwarding poll. Interesting change were 

* to constantly mention UAX XX of The Unicode Standard
* __STDC_ISO_10646__ : An integer literal of the form yyyymmL (for example, 199712L). If this symbol is defined, then its value is implementation-defined
* Specifically mention UTF-8, UTF-16 and UTF-32 instead of Unicode encoding


- Remove the footnote about old linkers
- Apply the character -> codepoint changes to the annexes and [diff] sections
- remove a stale cross reference in phase 1 of translations
- various typos


Thanks,

Corentin
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16