C++ Logo

sg16

Advanced search

Updated D2558 : "Add @, $, and ` to the basic character set"

From: Steve Downey <sdowney_at_[hidden]>
Date: Wed, 27 Apr 2022 00:38:48 -0400
Uploaded : https://isocpp.org/files/papers/D2558R1.html

New section with implications and consequences,
Please ignore the {add} green below, I've given up fighting between
markdown, html, the paper system and gmail for the evening.

3 Implications and Consequences
<https://isocpp.org/files/papers/D2558R1.html#implications-and-consequences>

Because this proposal is not making these characters available for
syntactic purposes, the changes are limited to how these characters encoded
today, or are represented in source.
3.1 Literal Encoding
<https://isocpp.org/files/papers/D2558R1.html#literal-encoding>

Adding these characters to the basic character set means these will have to
be encoded in a single byte, with positive value when used as a char. This
is true for all POSIX encoded character sets, as @, $, and ` are part of
the portable character set. This also implies they are available in all
POSIX locales, and in particular the “POSIX” locale, which is equivalent to
the “C” locale. [POSIX
<https://isocpp.org/files/papers/D2558R1.html#ref-POSIX>] See 6. Character
Set
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html>
3.2 Runtime Encoding
<https://isocpp.org/files/papers/D2558R1.html#runtime-encoding>

A locale that does not provide for these characters would be
non-conforming. Interpreting the literal encoding in any encoded character
set, including the “C” LC_CTYPE character set if it does not match the
literal encoding, is already at best unspecified. Substitution ciphers are
apparently conforming, although misleading. There is a long history of
interpreting the Yen sign, ¥, as a path separator on Windows exactly
because of these encoding aliasing issues.
3.3 Source Encoding and Representation
<https://isocpp.org/files/papers/D2558R1.html#source-encoding-and-representation>

There is a rule that characters in the basic character set may not be
expressed as UCNs, unless inside a character or sting literal. For C there
are issues for characters in comments. This is not the case for C++. In
non-comment contexts, these characters are currently not allowed in
portable source, so the spelling of the character is irrelevant.

For extensions that allow, for example, $ in identifiers, no one outside of
compiler test suites, is using a UCN to spell that.

This should break no C++ source.

C++ places no constraints on source encoding. The closest we have is the
in-flight requirement that implementations that accept files be required to
accept UTF-8, and UTF-8 encodes these characters.

Received on 2022-04-27 04:39:02