sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 8 Sep 2020 20:00:38 +0200

On 08/09/2020 19.55, Corentin wrote:
>
>
> On Tue, 8 Sep 2020 at 19:21, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 08/09/2020 16.09, Peter Brett via SG16 wrote:
> > Hi all,
> >
> > If you have already formulated any comments or suggestions with regard to this
> > paper, I'd really appreciate that you share them now so that I have the chance
> > to think about them in advance of tomorrow's SG16 meeting.
>
> The paper talks a lot about "Unicode characters".
>
> We should be clear whether we really mean
> "characters with an assignment in Unicode"
> (this changes with every revision) or whether
> we mean "code point except surrogate code points"
> which is (I think) equivalent to "Unicode scalar
> value". This is the range 0-0x10ffff minus
> the surrogate code points, and this range is
> stable.
>
> Unicode scalar value, with the caveat that the meaning of these values has to be known when a conversion is happening (which is always the case) or when a property of the Unicode scalar value is observed (XID_Start for example)

Then P2194R0 should use the intended term here, instead of "Unicode character".

> Also, I'd like to point out that Unicode apparently
> has expressly declared control characters as
> out-of-scope (because control characters are not
> related to glyphs at all, I guess), but C++ does
> expressly recognize several control characters
> during lexing ("new-line", "whitespace") as well as
> in string-literals. This feels a bit like an
> impedance mismatch.
>
>
> A few inaccuracy here:
>
> * White space are not considered control characters

I believe "horizontal tab" is a control character, and it's
whitespace in the C++ sense. Maybe there are more of these.

> * What unicode considers control characters are characters that have no "spacial representation", which whitespace does (it is "visible even if it doesn't use ink)
> * Unicode cares about control characters which are many ( bidi, language tags, the null characters, variation selectors are control characters which a specific meaning in unicode - some of them are here https://www.fileformat.info/info/unicode/category/Cc/list.htm
> * In addition, Unicode reserves a number of code points considered control characters, to be used by the program for their purposes. These are the C0 and C1 control characters blocks. These exist, are fully documented and supported by unicode but have no semantic ( Unicode does however provide a default interpretation which maps to iso 646 and iso 6429)

I'm sorry, "fully documented" and "no semantic" doesn't add up for me,
except as a glorified way of saying "out of scope".

Jens

Received on 2020-09-08 13:04:14