On Wed, Apr 27, 2022 at 9:06 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 27/04/2022 06.05, Steve Downey via SG16 wrote:
>
>
> On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot@gmail.com <mailto:corentinjabot@gmail.com>> wrote:
>
> Can you please explain in the paper what this buys us?
>
> * It doesn't change the set of identifiers (nor should it)
> * It makes \N{DOLLAR} ill-formed. Is that desirable?
>
> Only in syntactic contexts, if I have understood correctly. You could use that in a literal, but not in an identifier.

Yes.

> * It makes an hypothetical implementation that would not support $ in the literal encoding non-conforming, even if no such character is present in any source file. Is that desirable?

We have discussed this in SG16, I believe.
I think nobody raised specific concerns;
the minutes quote Hubert as saying WG14
had sufficient quorum when discussing this
paper, but no concern was raised there.

> C

As in "to-be-published C23" ?

My understanding is there is no published C standard that requires this.

> and POSIX already require this. POSIX in straight out normative text. You can't even write a POSIX charmap for a locale without specifying $.
>
>
> * It doesn't affect the set of characters usable in grammar production - nor should it, these are orthogonal concerns. P2342 does not advocate for a change to literal or execution character sets.

I'm not seeing what "grammar production" and "literal/execution character set"
should have in common.

> They aren't quite orthogonal, although C++ has lost some rationale. The basic character set is the set of abstract characters that allows you to express C, and C++. Adding other characters to the grammar would break a lot of assumptions. C can be generated.

Corentin found an example where we say "any character of the basic character set
except X". The effects here change when you change "basic character set".

> * A static analysis tool won't warn on dollars but may keep warning about non-ascii characters in string literals.
>
> There is a simpler model here:
> * The grammar determines the set of characters that are parts of grammar elements
> * String literals that cannot be encoded in whatever the literal encoding is are ill-formed.
>
> I'm not saying that we should not be doing this, I would even argue that "C did it" might be reason enough, but I want to make sure we all agree on the intent

What, do you think, is the intent?

> But i wish both committee would consider completing the separation between source encoding and literal encoding - which C++ mostly did, "basic character set" is at this point almost vestigial
Except that it establishes minimum requirements on the
execution character sets, too.

> > If any character not in the basic character set matches the last category, the program is ill-formed. <https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
> Do we have examples of lone characters in the basic character set not matching a pp-token that would not be ill-formed ?
>
> I don't think this can happen now?

Here's an example:

#include <stdio.h>

#define STR(x) #x

int main()
{
printf("%s", STR(\\));
}

Neither of the two backslashes is part of any other pp-token
grammar production.

Thanks, that was helpful

I don't know if that rules is useful, and it doesn't seemed enforced https://godbolt.org/z/ezKr6jffj

(clang 14's behavior seems like a regression, I'll look into that)

> > d-char: <https://eel.is/c++draft/full#nt:d-char>
> any member of the basic character set except:
> U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE SOLIDUS,
> U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED, and new-line
> This one would need some massaging.
>
> Not clear to me that it does? Not white-space-ish or starting an escape?

This restricts the delimiter characters in a raw string literal.
If you don't change this list, you'll add $ @ ` to the list of valid
delimiter characters in a raw string literal.

Either you amend this exclusion list (retaining the status quo), or
you should highlight this change in behavior in the prose part of
your paper.

Yes, I think that is my broader point.

"basic character set" is currently used to impose different sets of restrictions to grammar constructs and <locale> functions, but it is not used consistently,

all of these uses are either imprecise or have restriction lists and we should look at each of these instances.

Arguably, there are very few places that should support a $ or @ that should not also support other unicode characters

And if each of the places restricted to "basic character set" had a well defined rationale and more precise description, we might not have any use of it.

Jens