On Tue, Apr 26, 2022 at 10:39 PM Steve Downey via SG16 <sg16@lists.isocpp.org> wrote:

On Tue, Apr 26, 2022 at 4:20 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
On 4/26/22 4:12 PM, Jens Maurer via SG16 wrote:
> On 26/04/2022 22.06, Tom Honermann via SG16 wrote:
>> The summary for the SG16 meeting held April 13th, 2022 is now available. For those that attended, please review and suggest corrections.
>>
>> * https://github.com/sg16-unicode/sg16-meetings#april-13th-2022
>>
>> No decisions were made at this meeting.
>>
>> I again apologize for being so delinquent getting the summary published.
>>
>> Jens, I fear I misunderstood or incorrectly captured some of your comments. Please see the editor's note starting with "This behavior doesn't seem related to the proposed change since ...". If you recall the discussion being different than I wrote, I'll update it to reflect your recollection.
> I think you should just strike all of this:
>
> Jens stated that this makes such intended use in identifiers ill-formed since, after this change, such a character would appear as a lone preprocessing-token.
> [ Editor's note: This behavior doesn't seem related to the proposed change since, previously, a UCN naming one of these characters would also appear as a lone preprocessing-token. The editor is concerned that this portion of the discussion was not captured accurately. ]

Done, thank you!

>
> I think there was some development during the discussion
> about the current and future state with these new
> characters. Having an updated paper clearly stating
> the current and with-paper situations would be helpful.

Agreed, I suspect Steve intends to provide that.

Tom.

That's what I'm planning. The complicated bit is the implications for the "C" locale, although it's not an issue for the "POSIX" locale, although I don't think it's a real world concern these days that the default encoded character set doesn't have what POSIX calls the portable character set. Tracing the requirements is tedious because C++ defers to C, which in turn defers to the ISO version of the POSIX specification for much of the locale machinery.

Hey Steve,

Can you please explain in the paper what this buys us?

* It doesn't change the set of identifiers (nor should it)

* It makes \N{DOLLAR} ill-formed. Is that desirable?

* It makes an hypothetical implementation that would not support $ in the literal encoding non-conforming, even if no such character is present in any source file. Is that desirable?

* It doesn't affect the set of characters usable in grammar production - nor should it, these are orthogonal concerns. P2342 does not advocate for a change to literal or execution character sets.

* A static analysis tool won't warn on dollars but may keep warning about non-ascii characters in string literals.

There is a simpler model here:

* The grammar determines the set of characters that are parts of grammar elements

* String literals that cannot be encoded in whatever the literal encoding is are ill-formed.

I'm not saying that we should not be doing this, I would even argue that "C did it" might be reason enough, but I want to make sure we all agree on the intent

But i wish both committee would consider completing the separation between source encoding and literal encoding - which C++ mostly did, "basic character set" is at this point almost vestigial

> In this document, glyphs are used to identify elements of the basic character set

Glyphs can equally identify unicode codepoints

> If any character not in the basic character set matches the last category, the program is ill-formed.

Do we have examples of lone characters in the basic character set not matching a pp-token that would not be ill-formed ?

> conditional-escape-sequence-char: any member of the basic character set that is not ...

Any reason that \α could not be conditionally supported?

> d-char:

any member of the basic character set except:
U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE SOLIDUS,
U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED, and new-line

This one would need some massaging.

> A letter is any of the 26 lowercase or 26 uppercase letters in the basic character set.

This one could be replaced easily

There are 2 more uses in [locale].

Properly defining these things seems like it would be somewhat worthwhile.

At which point, the literal and execution character sets do not have to be defined as supersets of something used in the grammar specification,

and it might be less confusing?

Even if we want to temporarily keep a basic character set around until we get rid of all mentions of it,

we could make the literal/execution encodings not depend on it.

Anyway, sorry for the rambling!

Corentin

Paper soon.
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16