C++ Logo


Advanced search

Re: P2749 (down with "character") and P2736 (Referencing the Unicode Standard) updates

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 26 Jan 2023 23:00:51 -0500
On Thu, Jan 26, 2023 at 10:05 PM Fraser Gordon via SG16 <
sg16_at_[hidden]> wrote:

> <snip>
> ---
> Because I've never actually read the lexing bits of the standard closely
> before there's a few things (completely unrelated to this paper) I noticed
> and found surprising:
The wording and machinery in lex predates all the modern encoding work, and
it's probably for the best that hardly any implementor pays strict
attention to it.

> - [lex.ccon]/[lex.string]: *basic-c-char* allows any codepoint except
> apostrophe, backslash and newlines. Lots of interesting control and
> presentation chars (e.g. bidi overrides) are allowed! Ditto for
> * basic-s-char* but less surprising in string literals than character
> literals.
> Before Jens introduced the translation character set, the standard lexing
quite literally converted everything not in the basic character set to a
UCN, the \uxxxxxxx form, in theory spelled that way during parsing, and
there might be places where it's still not entirely cleaned up? We want to
be very liberal before we have any syntax though, so permissive is mostly

> - [lex.charset]: *n-char* is similar to the above, despite the
> restricted alphabet that Unicode uses to name characters. I assume this is
> for implementation simplicity?
> Specification simplicity. C++ implementations do much more sensible things
these days. C implementations may not, but we try to be not gratuitously

> - [lex.ccon]: *hexadecimal-escape-sequence* looks wrong. Raised as a CWG
> issue <https://cplusplus.github.io/CWG/issues/2691.html>.
> Ooops. And we should really fix that before we ship it as it is new.
Octal avoids it by only allowing 3 octal digits. Which I'm fine with, I
don't really want to spell char32_t types in raw octal. Just file

> ---

Received on 2023-01-27 04:01:04