C++ Logo


Advanced search

Re: SG16 meeting summary for April 13th, 2022

From: Steve Downey <sdowney_at_[hidden]>
Date: Wed, 27 Apr 2022 11:47:58 -0400
On Wed, Apr 27, 2022 at 11:32 AM Tom Honermann <tom_at_[hidden]> wrote:

> On 4/27/22 12:05 AM, Steve Downey wrote:
> On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>> On Tue, Apr 26, 2022 at 10:39 PM Steve Downey via SG16 <
>> sg16_at_[hidden]> wrote:
>>> <snip/>

> That's what I'm planning. The complicated bit is the implications for the
>>> "C" locale, although it's not an issue for the "POSIX" locale, although I
>>> don't think it's a real world concern these days that the default encoded
>>> character set doesn't have what POSIX calls the portable character set.
>>> Tracing the requirements is tedious because C++ defers to C, which in turn
>>> defers to the ISO version of the POSIX specification for much of the locale
>>> machinery.
>> Hey Steve,
>> Can you please explain in the paper what this buys us?
> Aside from C compatibility, I think there are two benefits:
> - The ability to (portably) use these characters in a character
> literal (this requires the single-code-unit encoding).
> - The ability to (portably) use these characters in string literals
> without having to use UCNs (thus avoiding pedantic warnings in some tools
> about use of a non-portable character).
>> * It doesn't change the set of identifiers (nor should it)
>> * It makes \N{DOLLAR} ill-formed. Is that desirable?
> Only in syntactic contexts, if I have understood correctly. You could use
> that in a literal, but not in an identifier.
> * It makes an hypothetical implementation that would not support $ in the
>> literal encoding non-conforming, even if no such character is present in
>> any source file. Is that desirable?
> C and POSIX already require this. POSIX in straight out normative text.
> You can't even write a POSIX charmap for a locale without specifying $.
> For reference, the POSIX portable character set is defined here
> <https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html>
> .
POSIX requires them, and the C standard refers to the ISO POSIX standard
for how locales and charmaps work. It's possible they didn't intend to
require that locales encode the portable character set, but that seems to
be the end result.
It's not adding them to the basic character set, it's a result of how
locale character sets are defined that makes these characters required to
be available in C locales. C also still has the messy imprecision about
literal encoding vs locale, so it's harder to talk about.

IBM documents the invariant subset of EBCDIC here
> <https://www.ibm.com/docs/en/i/7.1?topic=sets-invariant-character-set>
> (for IBMi) along with at least some of the EBCDIC code pages that do not
> align with it. I find the presentation on Wikipedia
> <https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout> easier to view.
> There are a few interesting things to note:
> 1. The following current members of the basic character set are not in
> the invariant EBCDIC set:
> |, !, #, ~, ^, [, ], {, }, \
> 2. Alternative tokens are defined for these with the following
> exceptions:
> \
> 3. @ and $ are both members of the portable EBCDIC set
> <https://www.ibm.com/docs/en/i/7.1?topic=sets-portable-character-set>,
> ` is not. This is weird as this set is apparently intended to align with
> the POSIX portable character set, but includes U+00B4 (ACUTE ACCENT) and
> not U+0060 (GRAVE ACCENT). I suppose that could be a doc bug.
> 4. If we were to start using the new characters outside of literals,
> adding new alternative tokens would be appropriate (but perhaps
> unnecessary).
> Steve, perhaps this would be useful information to add to the paper?
I will add.

> Tom.

Received on 2022-04-27 15:48:11