Date: Wed, 27 Apr 2022 09:06:17 +0200
On 27/04/2022 06.05, Steve Downey via SG16 wrote:
>
>
> On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> wrote:
>
> Can you please explain in the paper what this buys us?
>
> * It doesn't change the set of identifiers (nor should it)
> * It makes \N{DOLLAR} ill-formed. Is that desirable?
>
> Only in syntactic contexts, if I have understood correctly. You could use that in a literal, but not in an identifier.
Yes.
> * It makes an hypothetical implementation that would not support $ in the literal encoding non-conforming, even if no such character is present in any source file. Is that desirable?
We have discussed this in SG16, I believe.
I think nobody raised specific concerns;
the minutes quote Hubert as saying WG14
had sufficient quorum when discussing this
paper, but no concern was raised there.
> C
As in "to-be-published C23" ?
My understanding is there is no published C standard that requires this.
> and POSIX already require this. POSIX in straight out normative text. You can't even write a POSIX charmap for a locale without specifying $.
>
>
> * It doesn't affect the set of characters usable in grammar production - nor should it, these are orthogonal concerns. P2342 does not advocate for a change to literal or execution character sets.
I'm not seeing what "grammar production" and "literal/execution character set"
should have in common.
> They aren't quite orthogonal, although C++ has lost some rationale. The basic character set is the set of abstract characters that allows you to express C, and C++. Adding other characters to the grammar would break a lot of assumptions. C can be generated.
Corentin found an example where we say "any character of the basic character set
except X". The effects here change when you change "basic character set".
> * A static analysis tool won't warn on dollars but may keep warning about non-ascii characters in string literals.
>
> There is a simpler model here:
> * The grammar determines the set of characters that are parts of grammar elements
> * String literals that cannot be encoded in whatever the literal encoding is are ill-formed.
>
> I'm not saying that we should not be doing this, I would even argue that "C did it" might be reason enough, but I want to make sure we all agree on the intent
What, do you think, is the intent?
> But i wish both committee would consider completing the separation between source encoding and literal encoding - which C++ mostly did, "basic character set" is at this point almost vestigial
Except that it establishes minimum requirements on the
execution character sets, too.
> > If any character not in the basic character set matches the last category, the program is ill-formed. <https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
> Do we have examples of lone characters in the basic character set not matching a pp-token that would not be ill-formed ?
>
> I don't think this can happen now?
Here's an example:
#include <stdio.h>
#define STR(x) #x
int main()
{
printf("%s", STR(\\));
}
Neither of the two backslashes is part of any other pp-token
grammar production.
> > d-char: <https://eel.is/c++draft/full#nt:d-char>
> any member of the basic character set except:
> U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE SOLIDUS,
> U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED, and new-line
> This one would need some massaging.
>
> Not clear to me that it does? Not white-space-ish or starting an escape?
This restricts the delimiter characters in a raw string literal.
If you don't change this list, you'll add $ @ ` to the list of valid
delimiter characters in a raw string literal.
Either you amend this exclusion list (retaining the status quo), or
you should highlight this change in behavior in the prose part of
your paper.
Jens
>
>
> On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> wrote:
>
> Can you please explain in the paper what this buys us?
>
> * It doesn't change the set of identifiers (nor should it)
> * It makes \N{DOLLAR} ill-formed. Is that desirable?
>
> Only in syntactic contexts, if I have understood correctly. You could use that in a literal, but not in an identifier.
Yes.
> * It makes an hypothetical implementation that would not support $ in the literal encoding non-conforming, even if no such character is present in any source file. Is that desirable?
We have discussed this in SG16, I believe.
I think nobody raised specific concerns;
the minutes quote Hubert as saying WG14
had sufficient quorum when discussing this
paper, but no concern was raised there.
> C
As in "to-be-published C23" ?
My understanding is there is no published C standard that requires this.
> and POSIX already require this. POSIX in straight out normative text. You can't even write a POSIX charmap for a locale without specifying $.
>
>
> * It doesn't affect the set of characters usable in grammar production - nor should it, these are orthogonal concerns. P2342 does not advocate for a change to literal or execution character sets.
I'm not seeing what "grammar production" and "literal/execution character set"
should have in common.
> They aren't quite orthogonal, although C++ has lost some rationale. The basic character set is the set of abstract characters that allows you to express C, and C++. Adding other characters to the grammar would break a lot of assumptions. C can be generated.
Corentin found an example where we say "any character of the basic character set
except X". The effects here change when you change "basic character set".
> * A static analysis tool won't warn on dollars but may keep warning about non-ascii characters in string literals.
>
> There is a simpler model here:
> * The grammar determines the set of characters that are parts of grammar elements
> * String literals that cannot be encoded in whatever the literal encoding is are ill-formed.
>
> I'm not saying that we should not be doing this, I would even argue that "C did it" might be reason enough, but I want to make sure we all agree on the intent
What, do you think, is the intent?
> But i wish both committee would consider completing the separation between source encoding and literal encoding - which C++ mostly did, "basic character set" is at this point almost vestigial
Except that it establishes minimum requirements on the
execution character sets, too.
> > If any character not in the basic character set matches the last category, the program is ill-formed. <https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
> Do we have examples of lone characters in the basic character set not matching a pp-token that would not be ill-formed ?
>
> I don't think this can happen now?
Here's an example:
#include <stdio.h>
#define STR(x) #x
int main()
{
printf("%s", STR(\\));
}
Neither of the two backslashes is part of any other pp-token
grammar production.
> > d-char: <https://eel.is/c++draft/full#nt:d-char>
> any member of the basic character set except:
> U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE SOLIDUS,
> U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED, and new-line
> This one would need some massaging.
>
> Not clear to me that it does? Not white-space-ish or starting an escape?
This restricts the delimiter characters in a raw string literal.
If you don't change this list, you'll add $ @ ` to the list of valid
delimiter characters in a raw string literal.
Either you amend this exclusion list (retaining the status quo), or
you should highlight this change in behavior in the prose part of
your paper.
Jens
Received on 2022-04-27 07:06:23