ISOCPP sg16 List: Re: SG16 meeting summary for April 13th, 2022

From: Steve Downey <sdowney_at_[hidden]>
Date: Wed, 27 Apr 2022 00:05:34 -0400

On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Tue, Apr 26, 2022 at 10:39 PM Steve Downey via SG16 <
> sg16_at_[hidden]> wrote:
>
>>
>>
>> On Tue, Apr 26, 2022 at 4:20 PM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On 4/26/22 4:12 PM, Jens Maurer via SG16 wrote:
>>> > On 26/04/2022 22.06, Tom Honermann via SG16 wrote:
>>> >> The summary for the SG16 meeting held April 13th, 2022 is now
>>> available. For those that attended, please review and suggest corrections.
>>> >>
>>> >> * https://github.com/sg16-unicode/sg16-meetings#april-13th-2022
>>> >>
>>> >> No decisions were made at this meeting.
>>> >>
>>> >> I again apologize for being so delinquent getting the summary
>>> published.
>>> >>
>>> >> Jens, I fear I misunderstood or incorrectly captured some of your
>>> comments. Please see the editor's note starting with "This behavior doesn't
>>> seem related to the proposed change since ...". If you recall the
>>> discussion being different than I wrote, I'll update it to reflect your
>>> recollection.
>>> > I think you should just strike all of this:
>>> >
>>> > Jens stated that this makes such intended use in identifiers
>>> ill-formed since, after this change, such a character would appear as a
>>> lone preprocessing-token.
>>> > [ Editor's note: This behavior doesn't seem related to the proposed
>>> change since, previously, a UCN naming one of these characters would also
>>> appear as a lone preprocessing-token. The editor is concerned that this
>>> portion of the discussion was not captured accurately. ]
>>>
>>> Done, thank you!
>>>
>>> >
>>> > I think there was some development during the discussion
>>> > about the current and future state with these new
>>> > characters. Having an updated paper clearly stating
>>> > the current and with-paper situations would be helpful.
>>>
>>> Agreed, I suspect Steve intends to provide that.
>>>
>>> Tom.
>>>
>>>
>>> That's what I'm planning. The complicated bit is the implications for
>> the "C" locale, although it's not an issue for the "POSIX" locale, although
>> I don't think it's a real world concern these days that the default encoded
>> character set doesn't have what POSIX calls the portable character set.
>> Tracing the requirements is tedious because C++ defers to C, which in turn
>> defers to the ISO version of the POSIX specification for much of the locale
>> machinery.
>>
>
> Hey Steve,
> Can you please explain in the paper what this buys us?
>
> * It doesn't change the set of identifiers (nor should it)
> * It makes \N{DOLLAR} ill-formed. Is that desirable?
>
Only in syntactic contexts, if I have understood correctly. You could use
that in a literal, but not in an identifier.

* It makes an hypothetical implementation that would not support $ in the
> literal encoding non-conforming, even if no such character is present in
> any source file. Is that desirable?
>
C and POSIX already require this. POSIX in straight out normative text. You
can't even write a POSIX charmap for a locale without specifying $.

> * It doesn't affect the set of characters usable in grammar production -
> nor should it, these are orthogonal concerns. P2342 does not advocate for a
> change to literal or execution character sets.
>
They aren't quite orthogonal, although C++ has lost some rationale. The
basic character set is the set of abstract characters that allows you to
express C, and C++. Adding other characters to the grammar would break a
lot of assumptions. C can be generated.

* A static analysis tool won't warn on dollars but may keep warning about
> non-ascii characters in string literals.
>
> There is a simpler model here:
> * The grammar determines the set of characters that are parts of grammar
> elements
> * String literals that cannot be encoded in whatever the literal encoding
> is are ill-formed.
>
> I'm not saying that we should not be doing this, I would even argue that
> "C did it" might be reason enough, but I want to make sure we all agree on
> the intent
>
> But i wish both committee would consider completing the separation between
> source encoding and literal encoding - which C++ mostly did, "basic
> character set" is at this point almost vestigial
>
> A agree we don't have source "encoding" but we do have the abstract
character set that comprises what is necessary to express any C++ program.

> In this document, glyphs are used to identify elements of the basic
> character set
> Glyphs can equally identify unicode codepoints
>
> I think that's a one way mapping. Not sufficient?

> > If any character not in the basic character set matches the last
> category, the program is ill-formed.
> <https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
> Do we have examples of lone characters in the basic character set not
> matching a pp-token that would not be ill-formed ?
>
> I don't think this can happen now?

> > conditional-escape-sequence-char:
> <https://eel.is/c++draft/full#nt:conditional-escape-sequence-char> any
> member of the basic character set that is not ...
> Any reason that \α could not be conditionally supported?
>
> Portability? If someone wants it, explain how to write it in Latin-1?

> d-char: <https://eel.is/c++draft/full#nt:d-char>
> any member of the basic character set except:
> U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE
> SOLIDUS,
> U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED,
> and new-line
> This one would need some massaging.
>
> Not clear to me that it does? Not white-space-ish or starting an escape?

> > A *letter* <https://eel.is/c++draft/full#def:letter> is any of the 26
> lowercase or 26 uppercase letters in the basic character set.
> <https://eel.is/c++draft/full#character.seq.general-1.3.sentence-1>
> This one could be replaced easily
>
> Not letters?

> There are 2 more uses in [locale].
>
> Properly defining these things seems like it would be somewhat worthwhile.
> At which point, the literal and execution character sets do not have to be
> defined as supersets of something used in the grammar specification,
> and it might be less confusing?
>
> Even if we want to temporarily keep a basic character set around until we
> get rid of all mentions of it,
> we could make the literal/execution encodings not depend on it.
>
> Anyway, sorry for the rambling!
> Corentin
>

I think the next goal is getting rid of the poorly defined term "execution
encoding". Literal encoding is clear. There's a default "C" and "POSIX"
LC_CTYPE encoding, that could be better nailed down, and then there's
whatever is in the current locale. I think "execution encoding" could be
made redundant, and replaced with, as much as I dislike locale, what the C
locale, current locale, and particular locale represent?

Received on 2022-04-27 04:05:48