C++ Logo

sg16

Advanced search

Re: SG16 meeting summary for April 13th, 2022

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 27 Apr 2022 17:50:21 +0200
On Wed, Apr 27, 2022 at 11:11 AM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Wed, Apr 27, 2022 at 9:06 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 27/04/2022 06.05, Steve Downey via SG16 wrote:
>> >
>> >
>> > On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot_at_[hidden]
>> <mailto:corentinjabot_at_[hidden]>> wrote:
>> >
>> > Can you please explain in the paper what this buys us?
>> >
>> > * It doesn't change the set of identifiers (nor should it)
>> > * It makes \N{DOLLAR} ill-formed. Is that desirable?
>> >
>> > Only in syntactic contexts, if I have understood correctly. You could
>> use that in a literal, but not in an identifier.
>>
>> Yes.
>>
>> > * It makes an hypothetical implementation that would not support $
>> in the literal encoding non-conforming, even if no such character is
>> present in any source file. Is that desirable?
>>
>> We have discussed this in SG16, I believe.
>> I think nobody raised specific concerns;
>> the minutes quote Hubert as saying WG14
>> had sufficient quorum when discussing this
>> paper, but no concern was raised there.
>>
>> > C
>>
>> As in "to-be-published C23" ?
>>
>> My understanding is there is no published C standard that requires this.
>>
>> > and POSIX already require this. POSIX in straight out normative text.
>> You can't even write a POSIX charmap for a locale without specifying $.
>> >
>> >
>> > * It doesn't affect the set of characters usable in grammar
>> production - nor should it, these are orthogonal concerns. P2342 does not
>> advocate for a change to literal or execution character sets.
>>
>> I'm not seeing what "grammar production" and "literal/execution character
>> set"
>> should have in common.
>>
>> > They aren't quite orthogonal, although C++ has lost some rationale. The
>> basic character set is the set of abstract characters that allows you to
>> express C, and C++. Adding other characters to the grammar would break a
>> lot of assumptions. C can be generated.
>>
>> Corentin found an example where we say "any character of the basic
>> character set
>> except X". The effects here change when you change "basic character set".
>>
>> > * A static analysis tool won't warn on dollars but may keep warning
>> about non-ascii characters in string literals.
>> >
>> > There is a simpler model here:
>> > * The grammar determines the set of characters that are parts of
>> grammar elements
>> > * String literals that cannot be encoded in whatever the literal
>> encoding is are ill-formed.
>> >
>> > I'm not saying that we should not be doing this, I would even argue
>> that "C did it" might be reason enough, but I want to make sure we all
>> agree on the intent
>>
>> What, do you think, is the intent?
>>
>> > But i wish both committee would consider completing the separation
>> between source encoding and literal encoding - which C++ mostly did, "basic
>> character set" is at this point almost vestigial
>> Except that it establishes minimum requirements on the
>> execution character sets, too.
>>
>> > > If any character not in the basic character set matches the last
>> category, the program is ill-formed. <
>> https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
>> > Do we have examples of lone characters in the basic character set
>> not matching a pp-token that would not be ill-formed ?
>> >
>> > I don't think this can happen now?
>>
>>
>> Here's an example:
>>
>>
>> #include <stdio.h>
>>
>> #define STR(x) #x
>>
>> int main()
>> {
>> printf("%s", STR(\\));
>> }
>>
>>
>> Neither of the two backslashes is part of any other pp-token
>> grammar production.
>>
>
> Thanks, that was helpful
> I don't know if that rules is useful, and it doesn't seemed enforced
> https://godbolt.org/z/ezKr6jffj
> (clang 14's behavior seems like a regression, I'll look into that)
>

Turns out clang's behavior was consistent with past versions of Clang, as
well as GCC - I just accidentally tested with a character that used to be a
valid identifiers,
which is why neither clang nor gcc complained
https://godbolt.org/z/KvjhE7oxG

It turns out that not making the program ill-formed in that scenario before
phase 7 would require some surgery in clang as we currently assume unicode
characters are always
part of an identifier and diagnose anything else as invalid identifier
during phase 3 - which is consistent with that restriction.


>
>
>
>> > > d-char: <https://eel.is/c++draft/full#nt:d-char>
>> > any member of the basic character set except:
>> > U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT
>> PARENTHESIS, U+005C REVERSE SOLIDUS,
>> > U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM
>> FEED, and new-line
>> > This one would need some massaging.
>> >
>> > Not clear to me that it does? Not white-space-ish or starting an
>> escape?
>>
>> This restricts the delimiter characters in a raw string literal.
>> If you don't change this list, you'll add $ @ ` to the list of valid
>> delimiter characters in a raw string literal.
>>
>> Either you amend this exclusion list (retaining the status quo), or
>> you should highlight this change in behavior in the prose part of
>> your paper.
>>
>
> Yes, I think that is my broader point.
> "basic character set" is currently used to impose different sets of
> restrictions to grammar constructs and <locale> functions, but it is not
> used consistently,
> all of these uses are either imprecise or have restriction lists and we
> should look at each of these instances.
> Arguably, there are very few places that should support a $ or @ that
> should not also support other unicode characters
>
> And if each of the places restricted to "basic character set" had a well
> defined rationale and more precise description, we might not have any use
> of it.
>
>
>
>
>>
>> Jens
>>
>>

Received on 2022-04-27 15:50:33