ISOCPP sg16 List: Re: SG16 meeting summary for April 13th, 2022

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 Apr 2022 11:32:02 -0400

On 4/27/22 12:05 AM, Steve Downey wrote:
>
>
> On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot
> <corentinjabot_at_[hidden]> wrote:
>
>
>
> On Tue, Apr 26, 2022 at 10:39 PM Steve Downey via SG16
> <sg16_at_[hidden]> wrote:
>
>
>
> On Tue, Apr 26, 2022 at 4:20 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> On 4/26/22 4:12 PM, Jens Maurer via SG16 wrote:
> > On 26/04/2022 22.06, Tom Honermann via SG16 wrote:
> >> The summary for the SG16 meeting held April 13th, 2022
> is now available. For those that attended, please review
> and suggest corrections.
> >>
> >> *
> https://github.com/sg16-unicode/sg16-meetings#april-13th-2022
> >>
> >> No decisions were made at this meeting.
> >>
> >> I again apologize for being so delinquent getting the
> summary published.
> >>
> >> Jens, I fear I misunderstood or incorrectly captured
> some of your comments. Please see the editor's note
> starting with "This behavior doesn't seem related to the
> proposed change since ...". If you recall the discussion
> being different than I wrote, I'll update it to reflect
> your recollection.
> > I think you should just strike all of this:
> >
> > Jens stated that this makes such intended use in
> identifiers ill-formed since, after this change, such a
> character would appear as a lone preprocessing-token.
> > [ Editor's note: This behavior doesn't seem related to
> the proposed change since, previously, a UCN naming one of
> these characters would also appear as a lone
> preprocessing-token. The editor is concerned that this
> portion of the discussion was not captured accurately. ]
>
> Done, thank you!
>
> >
> > I think there was some development during the discussion
> > about the current and future state with these new
> > characters. Having an updated paper clearly stating
> > the current and with-paper situations would be helpful.
>
> Agreed, I suspect Steve intends to provide that.
>
> Tom.
>
>
> That's what I'm planning. The complicated bit is the
> implications for the "C" locale, although it's not an issue
> for the "POSIX" locale, although I don't think it's a real
> world concern these days that the default encoded character
> set doesn't have what POSIX calls the portable character set.
> Tracing the requirements is tedious because C++ defers to C,
> which in turn defers to the ISO version of the POSIX
> specification for much of the locale machinery.
>
>
> Hey Steve,
> Can you please explain in the paper what this buys us?
>
Aside from C compatibility, I think there are two benefits:

  * The ability to (portably) use these characters in a character
    literal (this requires the single-code-unit encoding).
  * The ability to (portably) use these characters in string literals
    without having to use UCNs (thus avoiding pedantic warnings in some
    tools about use of a non-portable character).

>
> * It doesn't change the set of identifiers (nor should it)
> * It makes \N{DOLLAR} ill-formed. Is that desirable?
>
> Only in syntactic contexts, if I have understood correctly. You could
> use that in a literal, but not in an identifier.
>
> * It makes an hypothetical implementation that would not support $
> in the literal encoding non-conforming, even if no such character
> is present in any source file. Is that desirable?
>
> C and POSIX already require this. POSIX in straight out normative
> text. You can't even write a POSIX charmap for a locale without
> specifying $.

For reference, the POSIX portable character set is defined here
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html>.

IBM documents the invariant subset of EBCDIC here
<https://www.ibm.com/docs/en/i/7.1?topic=sets-invariant-character-set>
(for IBMi) along with at least some of the EBCDIC code pages that do not
align with it. I find the presentation on Wikipedia
<https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout> easier to view.
There are a few interesting things to note:

1. The following current members of the basic character set are not in
    the invariant EBCDIC set:
    |, !, #, ~, ^, [, ], {, }, \
2. Alternative tokens are defined for these with the following exceptions:
    \
3. @ and $ are both members of the portable EBCDIC set
    <https://www.ibm.com/docs/en/i/7.1?topic=sets-portable-character-set>,
    ` is not. This is weird as this set is apparently intended to align
    with the POSIX portable character set, but includes U+00B4 (ACUTE
    ACCENT) and not U+0060 (GRAVE ACCENT). I suppose that could be a doc
    bug.
4. If we were to start using the new characters outside of literals,
    adding new alternative tokens would be appropriate (but perhaps
    unnecessary).

Steve, perhaps this would be useful information to add to the paper?

Tom.

Received on 2022-04-27 15:32:07