On Thu, Jan 26, 2023 at 10:05 PM Fraser Gordon via SG16 <sg16@lists.isocpp.org> wrote:

<snip>
---

Because I've never actually read the lexing bits of the standard closely before there's a few things (completely unrelated to this paper) I noticed and found surprising:

The wording and machinery in lex predates all the modern encoding work, and it's probably for the best that hardly any implementor pays strict attention to it.

[lex.ccon]/[lex.string]: basic-c-char allows any codepoint except apostrophe, backslash and newlines. Lots of interesting control and presentation chars (e.g. bidi overrides) are allowed! Ditto for basic-s-char but less surprising in string literals than character literals.

Before Jens introduced the translation character set, the standard lexing quite literally converted everything not in the basic character set to a UCN, the \uxxxxxxx form, in theory spelled that way during parsing, and there might be places where it's still not entirely cleaned up? We want to be very liberal before we have any syntax though, so permissive is mostly better.

[lex.charset]: n-char is similar to the above, despite the restricted alphabet that Unicode uses to name characters. I assume this is for implementation simplicity?

Specification simplicity. C++ implementations do much more sensible things these days. C implementations may not, but we try to be not gratuitously incompatible.

[lex.ccon]: hexadecimal-escape-sequence looks wrong. Raised as a CWG issue.

Ooops. And we should really fix that before we ship it as it is new.
Octal avoids it by only allowing 3 octal digits. Which I'm fine with, I don't really want to spell char32_t types in raw octal. Just file permissions.

---