ISOCPP sg16 List: Re: SG16 meeting summary for April 13th, 2022

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 27 Apr 2022 00:18:11 +0200

On Tue, Apr 26, 2022 at 10:39 PM Steve Downey via SG16 <
sg16_at_[hidden]> wrote:

>
>
> On Tue, Apr 26, 2022 at 4:20 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> On 4/26/22 4:12 PM, Jens Maurer via SG16 wrote:
>> > On 26/04/2022 22.06, Tom Honermann via SG16 wrote:
>> >> The summary for the SG16 meeting held April 13th, 2022 is now
>> available. For those that attended, please review and suggest corrections.
>> >>
>> >> * https://github.com/sg16-unicode/sg16-meetings#april-13th-2022
>> >>
>> >> No decisions were made at this meeting.
>> >>
>> >> I again apologize for being so delinquent getting the summary
>> published.
>> >>
>> >> Jens, I fear I misunderstood or incorrectly captured some of your
>> comments. Please see the editor's note starting with "This behavior doesn't
>> seem related to the proposed change since ...". If you recall the
>> discussion being different than I wrote, I'll update it to reflect your
>> recollection.
>> > I think you should just strike all of this:
>> >
>> > Jens stated that this makes such intended use in identifiers ill-formed
>> since, after this change, such a character would appear as a lone
>> preprocessing-token.
>> > [ Editor's note: This behavior doesn't seem related to the proposed
>> change since, previously, a UCN naming one of these characters would also
>> appear as a lone preprocessing-token. The editor is concerned that this
>> portion of the discussion was not captured accurately. ]
>>
>> Done, thank you!
>>
>> >
>> > I think there was some development during the discussion
>> > about the current and future state with these new
>> > characters. Having an updated paper clearly stating
>> > the current and with-paper situations would be helpful.
>>
>> Agreed, I suspect Steve intends to provide that.
>>
>> Tom.
>>
>>
>> That's what I'm planning. The complicated bit is the implications for the
> "C" locale, although it's not an issue for the "POSIX" locale, although I
> don't think it's a real world concern these days that the default encoded
> character set doesn't have what POSIX calls the portable character set.
> Tracing the requirements is tedious because C++ defers to C, which in turn
> defers to the ISO version of the POSIX specification for much of the locale
> machinery.
>

Hey Steve,
Can you please explain in the paper what this buys us?

* It doesn't change the set of identifiers (nor should it)
* It makes \N{DOLLAR} ill-formed. Is that desirable?
* It makes an hypothetical implementation that would not support $ in the
literal encoding non-conforming, even if no such character is present in
any source file. Is that desirable?
* It doesn't affect the set of characters usable in grammar production -
nor should it, these are orthogonal concerns. P2342 does not advocate for a
change to literal or execution character sets.
* A static analysis tool won't warn on dollars but may keep warning about
non-ascii characters in string literals.

There is a simpler model here:
* The grammar determines the set of characters that are parts of grammar
elements
* String literals that cannot be encoded in whatever the literal encoding
is are ill-formed.

I'm not saying that we should not be doing this, I would even argue that "C
did it" might be reason enough, but I want to make sure we all agree on the
intent

But i wish both committee would consider completing the separation between
source encoding and literal encoding - which C++ mostly did, "basic
character set" is at this point almost vestigial

> In this document, glyphs are used to identify elements of the basic
character set
Glyphs can equally identify unicode codepoints

> If any character not in the basic character set matches the last
category, the program is ill-formed.
<https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
Do we have examples of lone characters in the basic character set not
matching a pp-token that would not be ill-formed ?

> conditional-escape-sequence-char:
<https://eel.is/c++draft/full#nt:conditional-escape-sequence-char> any
member of the basic character set that is not ...
Any reason that \α could not be conditionally supported?

> d-char: <https://eel.is/c++draft/full#nt:d-char>
any member of the basic character set except:
U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE
SOLIDUS,
U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED, and
new-line
This one would need some massaging.

> A *letter* <https://eel.is/c++draft/full#def:letter> is any of the 26
lowercase or 26 uppercase letters in the basic character set.
<https://eel.is/c++draft/full#character.seq.general-1.3.sentence-1>
This one could be replaced easily

There are 2 more uses in [locale].

Properly defining these things seems like it would be somewhat worthwhile.
At which point, the literal and execution character sets do not have to be
defined as supersets of something used in the grammar specification,
and it might be less confusing?

Even if we want to temporarily keep a basic character set around until we
get rid of all mentions of it,
we could make the literal/execution encodings not depend on it.

Anyway, sorry for the rambling!
Corentin

>
>
> Paper soon.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-04-26 22:18:24