sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Martinho Fernandes <rmf_at_[hidden]>
Date: Wed, 9 Sep 2020 10:04:02 +0200

On Tue, Sep 8, 2020 at 7:21 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> Also, I'd like to point out that Unicode apparently
> has expressly declared control characters as
> out-of-scope (because control characters are not
> related to glyphs at all, I guess), but C++ does
> expressly recognize several control characters
> during lexing ("new-line", "whitespace") as well as
> in string-literals. This feels a bit like an
> impedance mismatch.
>

To be clear, Unicode does assign semantics to a subset of control codes
(see table 23-1 of Unicode 13) and the only ones that C++ recognises that
are not part of that subset are backspace (\b) and alert (\a). AFAICT C++
doesn't really assign semantics to them, it just allows them to be used in
string and character literals.

But still, I don't think this is an issue. See section 23.1 of Unicode 13:

> In general, the use of control codes constitutes a higher-level
protocol and is beyond the scope of the Unicode Standard. For example,
the use of ISO/IEC 6429 control sequences for controlling bidirectional
formatting would be a legitimate higher-level protocol layered on top of
the plain text of the Unicode Standard. Higher-level protocols are not
specifiedby the Unicode Standard; their existence cannot be assumed without
a separate agreement between the parties interchanging such data.

The way I see it, even if \b and \a do have semantics in C++, the C++
standard is an instance of such a "separate agreement", and thus can define
any "higher-level protocol" for their use.

Received on 2020-09-09 03:07:47