C++ Logo


Advanced search

Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 8 Sep 2020 19:55:45 +0200
On Tue, 8 Sep 2020 at 19:21, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 08/09/2020 16.09, Peter Brett via SG16 wrote:
> > Hi all,
> >
> > If you have already formulated any comments or suggestions with regard
> to this
> > paper, I'd really appreciate that you share them now so that I have the
> chance
> > to think about them in advance of tomorrow's SG16 meeting.
> The paper talks a lot about "Unicode characters".
> We should be clear whether we really mean
> "characters with an assignment in Unicode"
> (this changes with every revision) or whether
> we mean "code point except surrogate code points"
> which is (I think) equivalent to "Unicode scalar
> value". This is the range 0-0x10ffff minus
> the surrogate code points, and this range is
> stable.

Unicode scalar value, with the caveat that the meaning of these values has
to be known when a conversion is happening (which is always the case) or
when a property of the Unicode scalar value is observed (XID_Start for

> Also, I'd like to point out that Unicode apparently
> has expressly declared control characters as
> out-of-scope (because control characters are not
> related to glyphs at all, I guess), but C++ does
> expressly recognize several control characters
> during lexing ("new-line", "whitespace") as well as
> in string-literals. This feels a bit like an
> impedance mismatch.

A few inaccuracy here:

   - White space are not considered control characters
   - What unicode considers control characters are characters that have no
   "spacial representation", which whitespace does (it is "visible even if it
   doesn't use ink)
   - Unicode cares about control characters which are many ( bidi, language
   tags, the null characters, variation selectors are control characters which
   a specific meaning in unicode - some of them are here
   - In addition, Unicode reserves a number of code points considered
   control characters, to be used by the program for their purposes. These are
   the C0 and C1 control characters blocks. These exist, are fully documented
   and supported by unicode but have no semantic ( Unicode does however
   provide a default interpretation which maps to iso 646 and iso 6429)

I understand that EBCDIC has a bunch of control
> characters whose number can be mapped to some number
> of a control character in Unicode, but the assumed
> semantics are not preserved. That seems at least
> fragile.

The semantic is preserved. the value of the code point would pass through
unicode unmodified. The semantic is not observable by the compiler.

> I'd like to hear from Hubert and other people
> related to non-ASCII environments whether they
> would like to have a door open for e.g. distinguishing
> an EBCDIC control character from a UCN that happens
> to evaluate to the EBCDIC-to-Unicode mapping of
> that EBCDIC control character.

> Jens

Received on 2020-09-08 12:59:26