Date: Wed, 27 Apr 2022 00:38:48 -0400
Uploaded : https://isocpp.org/files/papers/D2558R1.html
New section with implications and consequences,
Please ignore the {add} green below, I've given up fighting between
markdown, html, the paper system and gmail for the evening.
3 Implications and Consequences
<https://isocpp.org/files/papers/D2558R1.html#implications-and-consequences>
Because this proposal is not making these characters available for
syntactic purposes, the changes are limited to how these characters encoded
today, or are represented in source.
3.1 Literal Encoding
<https://isocpp.org/files/papers/D2558R1.html#literal-encoding>
Adding these characters to the basic character set means these will have to
be encoded in a single byte, with positive value when used as a char. This
is true for all POSIX encoded character sets, as @, $, and ` are part of
the portable character set. This also implies they are available in all
POSIX locales, and in particular the “POSIX” locale, which is equivalent to
the “C” locale. [POSIX
<https://isocpp.org/files/papers/D2558R1.html#ref-POSIX>] See 6. Character
Set
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html>
3.2 Runtime Encoding
<https://isocpp.org/files/papers/D2558R1.html#runtime-encoding>
A locale that does not provide for these characters would be
non-conforming. Interpreting the literal encoding in any encoded character
set, including the “C” LC_CTYPE character set if it does not match the
literal encoding, is already at best unspecified. Substitution ciphers are
apparently conforming, although misleading. There is a long history of
interpreting the Yen sign, ¥, as a path separator on Windows exactly
because of these encoding aliasing issues.
3.3 Source Encoding and Representation
<https://isocpp.org/files/papers/D2558R1.html#source-encoding-and-representation>
There is a rule that characters in the basic character set may not be
expressed as UCNs, unless inside a character or sting literal. For C there
are issues for characters in comments. This is not the case for C++. In
non-comment contexts, these characters are currently not allowed in
portable source, so the spelling of the character is irrelevant.
For extensions that allow, for example, $ in identifiers, no one outside of
compiler test suites, is using a UCN to spell that.
This should break no C++ source.
C++ places no constraints on source encoding. The closest we have is the
in-flight requirement that implementations that accept files be required to
accept UTF-8, and UTF-8 encodes these characters.
New section with implications and consequences,
Please ignore the {add} green below, I've given up fighting between
markdown, html, the paper system and gmail for the evening.
3 Implications and Consequences
<https://isocpp.org/files/papers/D2558R1.html#implications-and-consequences>
Because this proposal is not making these characters available for
syntactic purposes, the changes are limited to how these characters encoded
today, or are represented in source.
3.1 Literal Encoding
<https://isocpp.org/files/papers/D2558R1.html#literal-encoding>
Adding these characters to the basic character set means these will have to
be encoded in a single byte, with positive value when used as a char. This
is true for all POSIX encoded character sets, as @, $, and ` are part of
the portable character set. This also implies they are available in all
POSIX locales, and in particular the “POSIX” locale, which is equivalent to
the “C” locale. [POSIX
<https://isocpp.org/files/papers/D2558R1.html#ref-POSIX>] See 6. Character
Set
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html>
3.2 Runtime Encoding
<https://isocpp.org/files/papers/D2558R1.html#runtime-encoding>
A locale that does not provide for these characters would be
non-conforming. Interpreting the literal encoding in any encoded character
set, including the “C” LC_CTYPE character set if it does not match the
literal encoding, is already at best unspecified. Substitution ciphers are
apparently conforming, although misleading. There is a long history of
interpreting the Yen sign, ¥, as a path separator on Windows exactly
because of these encoding aliasing issues.
3.3 Source Encoding and Representation
<https://isocpp.org/files/papers/D2558R1.html#source-encoding-and-representation>
There is a rule that characters in the basic character set may not be
expressed as UCNs, unless inside a character or sting literal. For C there
are issues for characters in comments. This is not the case for C++. In
non-comment contexts, these characters are currently not allowed in
portable source, so the spelling of the character is irrelevant.
For extensions that allow, for example, $ in identifiers, no one outside of
compiler test suites, is using a UCN to spell that.
This should break no C++ source.
C++ places no constraints on source encoding. The closest we have is the
in-flight requirement that implementations that accept files be required to
accept UTF-8, and UTF-8 encodes these characters.
Received on 2022-04-27 04:39:02