sg16

From: Fraser Gordon <fraserjgordon+cpp_at_[hidden]> · Date: Thu, 26 Jan 2023 22:05:37 -0500

---
I'm going through D2749 and have some comments/questions:
   - "Unicode code point" feels awkward to read (and write!). Is it worth
   defining a term so it can be shortened? In [lex.ext] you've used "code
   point" without "Unicode".
   - [lex.charset]: Table 1 (basic character set) retains "character", as
   does the text referring to it, but Table 3 (additional control characters
   in the basic literal character set) replaces "character" with "Unicode code
   point". The text referencing Table 3 describes the contents of Table 1 as
   "abstract characters" and those of Table 3 as "control characters". The
   text also contains a note saying that the u+xxxx values don't have a
   meaning in this context. Based on that note and the normative text, I think
   the column heading on both tables and the references in the text should be
   "abstract character" consistently.
   - as an alternative to "abstract character" above, I think "Unicode code
   point" would also work if used everywhere due to all of the abstract
   characters in the basic set mapping to a single codepoint. It would become
   incorrect in the (unlikely?) event that the basic set expands to add
   abstract characters that are represented by combining sequences.
   - [diff.cpp14.lex] I'm not sure this section makes much sense now. It
   ends (with your changes) "as part of the implementation-defined mapping
   from input source file characters to Unicode." But is that still
   implementation-defined? With the removal of 'translation character set' and
   with the requirement to accept UTF-8 input files, the conversion from UTF-8
   'characters' to Unicode is defined by the Unicode standard. Instead of
   replacing "the translation character set" with "Unicode", would it be
   better to strike the whole "as part of the..." clause from that sentence?
---
Expanding on the 2nd and 3rd bullets above: I'm on the fence in terms of
usage of "Unicode code point" vs "abstract character" (or "Unicode abstract
character") in some places. The code point phrasing is definitely
appropriate where we're talking about e.g. named escape sequences and it's
probably how everything will be implemented anyway, because the requirement
that everything be in Normal Form C means each abstract character has a 1:1
mapping with a combining sequence (I think?).
As an example, the 2nd paragraph of [lex.pptoken] says:
> [...] If any characters Unicode code point not in the basic character set
> matches the last category, the program is ill-formed.
I think it'd be more correct to use "abstract character" rather than "code
point" here because even though the mechanism will be codepoints, the
underlying thing being talked about is the abstract character.
Or, to put it another way: what is "and͓"?
As specified & implemented, it's an identifier formed from 4 codepoints. So
the mechanistic answer is that it's an identifier in the same way that
"andd" is; the keyword "and" just happens to be a prefix.
But for a human looking at it, they'd say either "it's different because
'd' and 'd͓' are qualitatively different (i.e they're different abstract
characters); the shared prefix with 'and' is 'an'" or "it's the same as
'and'; the ◌͓ mark is not significant". (We've clearly decided combining
marks are significant in C++ so we can ignore the latter interpretation.)
Having written this out, I'm mostly convinced that the mechanistic
(codepoint) and human (abstract character) interpretations will produce the
same results (I'd be interested in any ideas for counter-examples!) so I
don't think there would be a behavioural change with either. I still think
the intent matters though.
This may have all been discussed before (if so, apologies for making you
read all of that) and it was decided that codepoint was the better term
and/or that we definitely want to specify in terms of codepoints because of
implementation constraints (or even that abstract characters are just too
hard to nail down in meaningful stanardese). It's hard to tell though
because the current wording uses "character" everywhere and by changing to
"Unicode code point", we're coming down on one side of that fence and I
want to be sure that's a deliberate decision.
---
Because I've never actually read the lexing bits of the standard closely
before there's a few things (completely unrelated to this paper) I noticed
and found surprising:
   - [lex.ccon]/[lex.string]: *basic-c-char* allows any codepoint except
   apostrophe, backslash and newlines. Lots of interesting control and
   presentation chars (e.g. bidi overrides) are allowed! Ditto for
   * basic-s-char* but less surprising in string literals than character
   literals.
   - [lex.charset]: *n-char* is similar to the above, despite the
   restricted alphabet that Unicode uses to name characters. I assume this is
   for implementation simplicity?
   - [lex.ccon]: *hexadecimal-escape-sequence* looks wrong. Raised as a CWG
   issue <https://cplusplus.github.io/CWG/issues/2691.html>.
---
This was meant to be a short email when I started writing it :/
Fraser
On Thu, 26 Jan 2023 at 04:42, Corentin via SG16 <sg16_at_[hidden]>
wrote:
> Hey folks:
>
> I published https://isocpp.org/files/papers/P2736R1.pdf - with the
> changes requested by SG16 yesterday as part of the forwarding poll.
> Interesting change were
>
> * to constantly mention UAX XX of The Unicode Standard
> * __STDC_ISO_10646__ : An integer literal of the form yyyymmL (for
> example, 199712L). If this symbol is defined, then its value is
> implementation-defined
> * Specifically mention UTF-8, UTF-16 and UTF-32 instead of Unicode encoding
>
> https://isocpp.org/files/papers/D2749R0.pdf down with character:
>
> - Remove the footnote about old linkers
> - Apply the character -> codepoint changes to the annexes and [diff]
> sections
> - remove a stale cross reference in phase 1 of translations
> - various typos
>
>
> Thanks,
>
> Corentin
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>