Date: Wed, 27 Apr 2022 07:55:11 +0000
Hi Steve,
Thank you for these updates.
I had been hoping that something would be added to the paper regarding consequences for h-char-sequence and q-char-sequence in #include directives.
Best wishes,
Peter
From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Steve Downey via SG16
Sent: 27 April 2022 05:39
To: SG16 <sg16_at_[hidden]>
Cc: Steve Downey <sdowney_at_[hidden]>
Subject: [SG16] Updated D2558 : "Add @, $, and ` to the basic character set"
EXTERNAL MAIL
Uploaded : https://isocpp.org/files/papers/D2558R1.html<https://urldefense.com/v3/__https:/isocpp.org/files/papers/D2558R1.html__;!!EHscmS1ygiU1lA!Ht8mcD9YQCnZM5EPNz2IS9cPkamJp3tkl2HxrYgTzrAFwgNdAw7HoD26mlZ6Qs6c3m4yahUK4RQ2kdc$>
New section with implications and consequences,
Please ignore the {add} green below, I've given up fighting between markdown, html, the paper system and gmail for the evening.
3 Implications and Consequences
Because this proposal is not making these characters available for syntactic purposes, the changes are limited to how these characters encoded today, or are represented in source.
3.1 Literal Encoding
Adding these characters to the basic character set means these will have to be encoded in a single byte, with positive value when used as a char. This is true for all POSIX encoded character sets, as @, $, and ` are part of the portable character set. This also implies they are available in all POSIX locales, and in particular the “POSIX” locale, which is equivalent to the “C” locale. [POSIX<https://urldefense.com/v3/__https:/isocpp.org/files/papers/D2558R1.html*ref-POSIX__;Iw!!EHscmS1ygiU1lA!Ht8mcD9YQCnZM5EPNz2IS9cPkamJp3tkl2HxrYgTzrAFwgNdAw7HoD26mlZ6Qs6c3m4yahUKdBQf2d4$>] See 6. Character Set<https://urldefense.com/v3/__https:/pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html__;!!EHscmS1ygiU1lA!Ht8mcD9YQCnZM5EPNz2IS9cPkamJp3tkl2HxrYgTzrAFwgNdAw7HoD26mlZ6Qs6c3m4yahUK3JUgOTY$>
3.2 Runtime Encoding
A locale that does not provide for these characters would be non-conforming. Interpreting the literal encoding in any encoded character set, including the “C” LC_CTYPE character set if it does not match the literal encoding, is already at best unspecified. Substitution ciphers are apparently conforming, although misleading. There is a long history of interpreting the Yen sign, ¥, as a path separator on Windows exactly because of these encoding aliasing issues.
3.3 Source Encoding and Representation
There is a rule that characters in the basic character set may not be expressed as UCNs, unless inside a character or sting literal. For C there are issues for characters in comments. This is not the case for C++. In non-comment contexts, these characters are currently not allowed in portable source, so the spelling of the character is irrelevant.
For extensions that allow, for example, $ in identifiers, no one outside of compiler test suites, is using a UCN to spell that.
This should break no C++ source.
C++ places no constraints on source encoding. The closest we have is the in-flight requirement that implementations that accept files be required to accept UTF-8, and UTF-8 encodes these characters.
Thank you for these updates.
I had been hoping that something would be added to the paper regarding consequences for h-char-sequence and q-char-sequence in #include directives.
Best wishes,
Peter
From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Steve Downey via SG16
Sent: 27 April 2022 05:39
To: SG16 <sg16_at_[hidden]>
Cc: Steve Downey <sdowney_at_[hidden]>
Subject: [SG16] Updated D2558 : "Add @, $, and ` to the basic character set"
EXTERNAL MAIL
Uploaded : https://isocpp.org/files/papers/D2558R1.html<https://urldefense.com/v3/__https:/isocpp.org/files/papers/D2558R1.html__;!!EHscmS1ygiU1lA!Ht8mcD9YQCnZM5EPNz2IS9cPkamJp3tkl2HxrYgTzrAFwgNdAw7HoD26mlZ6Qs6c3m4yahUK4RQ2kdc$>
New section with implications and consequences,
Please ignore the {add} green below, I've given up fighting between markdown, html, the paper system and gmail for the evening.
3 Implications and Consequences
Because this proposal is not making these characters available for syntactic purposes, the changes are limited to how these characters encoded today, or are represented in source.
3.1 Literal Encoding
Adding these characters to the basic character set means these will have to be encoded in a single byte, with positive value when used as a char. This is true for all POSIX encoded character sets, as @, $, and ` are part of the portable character set. This also implies they are available in all POSIX locales, and in particular the “POSIX” locale, which is equivalent to the “C” locale. [POSIX<https://urldefense.com/v3/__https:/isocpp.org/files/papers/D2558R1.html*ref-POSIX__;Iw!!EHscmS1ygiU1lA!Ht8mcD9YQCnZM5EPNz2IS9cPkamJp3tkl2HxrYgTzrAFwgNdAw7HoD26mlZ6Qs6c3m4yahUKdBQf2d4$>] See 6. Character Set<https://urldefense.com/v3/__https:/pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html__;!!EHscmS1ygiU1lA!Ht8mcD9YQCnZM5EPNz2IS9cPkamJp3tkl2HxrYgTzrAFwgNdAw7HoD26mlZ6Qs6c3m4yahUK3JUgOTY$>
3.2 Runtime Encoding
A locale that does not provide for these characters would be non-conforming. Interpreting the literal encoding in any encoded character set, including the “C” LC_CTYPE character set if it does not match the literal encoding, is already at best unspecified. Substitution ciphers are apparently conforming, although misleading. There is a long history of interpreting the Yen sign, ¥, as a path separator on Windows exactly because of these encoding aliasing issues.
3.3 Source Encoding and Representation
There is a rule that characters in the basic character set may not be expressed as UCNs, unless inside a character or sting literal. For C there are issues for characters in comments. This is not the case for C++. In non-comment contexts, these characters are currently not allowed in portable source, so the spelling of the character is irrelevant.
For extensions that allow, for example, $ in identifiers, no one outside of compiler test suites, is using a UCN to spell that.
This should break no C++ source.
C++ places no constraints on source encoding. The closest we have is the in-flight requirement that implementations that accept files be required to accept UTF-8, and UTF-8 encodes these characters.
Received on 2022-04-27 07:55:18