sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 8 Sep 2020 19:21:32 +0200

On 08/09/2020 16.09, Peter Brett via SG16 wrote:
> Hi all,
>
> If you have already formulated any comments or suggestions with regard to this
> paper, I'd really appreciate that you share them now so that I have the chance
> to think about them in advance of tomorrow's SG16 meeting.

The paper talks a lot about "Unicode characters".

We should be clear whether we really mean
"characters with an assignment in Unicode"
(this changes with every revision) or whether
we mean "code point except surrogate code points"
which is (I think) equivalent to "Unicode scalar
value". This is the range 0-0x10ffff minus
the surrogate code points, and this range is
stable.

Also, I'd like to point out that Unicode apparently
has expressly declared control characters as
out-of-scope (because control characters are not
related to glyphs at all, I guess), but C++ does
expressly recognize several control characters
during lexing ("new-line", "whitespace") as well as
in string-literals. This feels a bit like an
impedance mismatch.

I understand that EBCDIC has a bunch of control
characters whose number can be mapped to some number
of a control character in Unicode, but the assumed
semantics are not preserved. That seems at least
fragile.

I'd like to hear from Hubert and other people
related to non-ASCII environments whether they
would like to have a door open for e.g. distinguishing
an EBCDIC control character from a UCN that happens
to evaluate to the EBCDIC-to-Unicode mapping of
that EBCDIC control character.

Jens

Received on 2020-09-08 12:25:10