On Tue, 8 Sep 2020 at 19:21, Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 08/09/2020 16.09, Peter Brett via SG16 wrote:
> Hi all,
> If you have already formulated any comments or suggestions with regard to this
> paper, I'd really appreciate that you share them now so that I have the chance
> to think about them in advance of tomorrow's SG16 meeting.

The paper talks a lot about "Unicode characters".

We should be clear whether we really mean
"characters with an assignment in Unicode"
(this changes with every revision) or whether
we mean "code point except surrogate code points"
which is (I think) equivalent to "Unicode scalar
value".  This is the range 0-0x10ffff minus
the surrogate code points, and this range is

Unicode scalar value, with the caveat that the meaning of these values has to be known when a conversion is happening (which is always the case) or when a property of the Unicode scalar value is observed (XID_Start for example)
Also, I'd like to point out that Unicode apparently
has expressly declared control characters as
out-of-scope (because control characters are not
related to glyphs at all, I guess), but C++ does
expressly recognize several control characters
during lexing ("new-line", "whitespace") as well as
in string-literals.  This feels a bit like an
impedance mismatch.

A few inaccuracy here:
  • White space are not considered control characters
  • What unicode considers control characters are characters that have no "spacial representation", which whitespace does (it is "visible even if it doesn't use ink)
  • Unicode cares about control characters which are many ( bidi, language tags, the null characters, variation selectors are control characters which a specific meaning in unicode - some of them are here https://www.fileformat.info/info/unicode/category/Cc/list.htm
  • In addition, Unicode reserves a number of code points considered control characters, to be used by the program for their purposes. These are the C0 and C1 control characters blocks. These exist, are fully documented and supported by unicode but have no semantic ( Unicode does however provide a default interpretation which maps to iso 646 and iso 6429)
I understand that EBCDIC has a bunch of control
characters whose number can be mapped to some number
of a control character in Unicode, but the assumed
semantics are not preserved.  That seems at least

The semantic is preserved. the value of the code point would pass through unicode unmodified. The semantic is not observable by the compiler.

I'd like to hear from Hubert and other people
related to non-ASCII environments whether they
would like to have a door open for e.g. distinguishing
an EBCDIC control character from a UCN that happens
to evaluate to the EBCDIC-to-Unicode mapping of
that EBCDIC control character.