On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot@gmail.com> wrote:


On Tue, Apr 26, 2022 at 10:39 PM Steve Downey via SG16 <sg16@lists.isocpp.org> wrote:


On Tue, Apr 26, 2022 at 4:20 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
On 4/26/22 4:12 PM, Jens Maurer via SG16 wrote:
> On 26/04/2022 22.06, Tom Honermann via SG16 wrote:
>> The summary for the SG16 meeting held April 13th, 2022 is now available.  For those that attended, please review and suggest corrections.
>>
>>    * https://github.com/sg16-unicode/sg16-meetings#april-13th-2022
>>
>> No decisions were made at this meeting.
>>
>> I again apologize for being so delinquent getting the summary published.
>>
>> Jens, I fear I misunderstood or incorrectly captured some of your comments. Please see the editor's note starting with "This behavior doesn't seem related to the proposed change since ...". If you recall the discussion being different than I wrote, I'll update it to reflect your recollection.
> I think you should just strike all of this:
>
> Jens stated that this makes such intended use in identifiers ill-formed since, after this change, such a character would appear as a lone preprocessing-token.
> [ Editor's note: This behavior doesn't seem related to the proposed change since, previously, a UCN naming one of these characters would also appear as a lone preprocessing-token. The editor is concerned that this portion of the discussion was not captured accurately. ]

Done, thank you!

>
> I think there was some development during the discussion
> about the current and future state with these new
> characters.  Having an updated paper clearly stating
> the current and with-paper situations would be helpful.

Agreed, I suspect Steve intends to provide that.

Tom.


That's what I'm planning. The complicated bit is the implications for the "C" locale, although it's not an issue for the "POSIX" locale, although I don't think it's a real world concern these days that the default encoded character set doesn't have what POSIX calls the portable character set. Tracing the requirements is tedious because C++ defers to C, which in turn defers to the ISO version of the POSIX specification for much of the locale machinery.

Hey Steve, 
Can you please explain in the paper what this buys us?

* It doesn't change the set of identifiers (nor should it)
* It makes \N{DOLLAR} ill-formed. Is that desirable?
Only in syntactic contexts, if I have understood correctly. You could use that in a literal, but not in an identifier. 

* It makes an hypothetical implementation that would not support $ in the literal encoding non-conforming, even if no such character is present in any source file. Is that desirable?
C and POSIX already require this. POSIX in straight out normative text. You can't even write a POSIX charmap for a locale without specifying $. 
 
* It doesn't affect the set of characters usable in grammar production - nor should it, these are orthogonal concerns. P2342 does not advocate for a change to literal or execution character sets.
They aren't quite orthogonal, although C++ has lost some rationale. The basic character set is the set of abstract characters that allows you to express C, and C++. Adding other characters to the grammar would break a lot of assumptions. C can be generated. 

* A static analysis tool won't warn on dollars but may keep warning about non-ascii characters in string literals.

There is a simpler model here:
* The grammar determines the set of characters that are parts of grammar elements
* String literals that cannot be encoded in whatever the literal encoding is are ill-formed.

I'm not saying that we should not be doing this, I would even argue that "C did it" might be reason enough, but I want to make sure we all agree on the intent

But i wish both committee would consider completing the separation between source encoding and literal encoding - which C++ mostly did, "basic character set" is at this point almost vestigial

A agree we don't have source "encoding" but we do have the abstract character set that comprises what is necessary to express any C++ program.  

In this document, glyphs are used to identify elements of the basic character set
Glyphs can equally identify unicode codepoints

I think that's a one way mapping. Not sufficient?
 
If any character not in the basic character set matches the last category, the program is ill-formed.
Do we have examples of lone characters in the basic character set not matching a pp-token that would not be ill-formed ?

I don't think this can happen now?
 
conditional-escape-sequence-char: any member of the basic character set that is not ...
Any reason that \α could not be conditionally supported?

Portability?  If someone wants it, explain how to write it in Latin-1?

any member of the basic character set except:
   U+0020 SPACEU+0028 LEFT PARENTHESISU+0029 RIGHT PARENTHESISU+005C REVERSE SOLIDUS,
   U+0009 CHARACTER TABULATIONU+000B LINE TABULATIONU+000C FORM FEED, and new-line
This one would need some massaging. 

Not clear to me that it does? Not white-space-ish or starting an escape? 
A letter is any of the 26 lowercase or 26 uppercase letters in the basic character set.
This one could be replaced easily

Not letters?
 
There are 2 more uses in [locale].

Properly defining these things seems like it would be somewhat worthwhile.
At which point, the literal and execution character sets do not have to be defined as supersets of something used in the grammar specification,
and it might be less confusing?

Even if we want to temporarily keep a basic character set around until we get rid of all mentions of it, 
we could make the literal/execution encodings not depend on it.

Anyway, sorry for the rambling!
Corentin

I think the next goal is getting rid of the poorly defined term "execution encoding". Literal encoding is clear. There's a default "C" and "POSIX" LC_CTYPE encoding, that could be better nailed down, and then there's whatever is in the current locale. I think "execution encoding" could be made redundant, and replaced with, as much as I dislike locale, what the C locale, current locale, and particular locale represent?