Date: Wed, 27 Apr 2022 11:11:38 +0200
On Wed, Apr 27, 2022 at 9:06 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 27/04/2022 06.05, Steve Downey via SG16 wrote:
> >
> >
> > On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot_at_[hidden]
> <mailto:corentinjabot_at_[hidden]>> wrote:
> >
> > Can you please explain in the paper what this buys us?
> >
> > * It doesn't change the set of identifiers (nor should it)
> > * It makes \N{DOLLAR} ill-formed. Is that desirable?
> >
> > Only in syntactic contexts, if I have understood correctly. You could
> use that in a literal, but not in an identifier.
>
> Yes.
>
> > * It makes an hypothetical implementation that would not support $
> in the literal encoding non-conforming, even if no such character is
> present in any source file. Is that desirable?
>
> We have discussed this in SG16, I believe.
> I think nobody raised specific concerns;
> the minutes quote Hubert as saying WG14
> had sufficient quorum when discussing this
> paper, but no concern was raised there.
>
> > C
>
> As in "to-be-published C23" ?
>
> My understanding is there is no published C standard that requires this.
>
> > and POSIX already require this. POSIX in straight out normative text.
> You can't even write a POSIX charmap for a locale without specifying $.
> >
> >
> > * It doesn't affect the set of characters usable in grammar
> production - nor should it, these are orthogonal concerns. P2342 does not
> advocate for a change to literal or execution character sets.
>
> I'm not seeing what "grammar production" and "literal/execution character
> set"
> should have in common.
>
> > They aren't quite orthogonal, although C++ has lost some rationale. The
> basic character set is the set of abstract characters that allows you to
> express C, and C++. Adding other characters to the grammar would break a
> lot of assumptions. C can be generated.
>
> Corentin found an example where we say "any character of the basic
> character set
> except X". The effects here change when you change "basic character set".
>
> > * A static analysis tool won't warn on dollars but may keep warning
> about non-ascii characters in string literals.
> >
> > There is a simpler model here:
> > * The grammar determines the set of characters that are parts of
> grammar elements
> > * String literals that cannot be encoded in whatever the literal
> encoding is are ill-formed.
> >
> > I'm not saying that we should not be doing this, I would even argue
> that "C did it" might be reason enough, but I want to make sure we all
> agree on the intent
>
> What, do you think, is the intent?
>
> > But i wish both committee would consider completing the separation
> between source encoding and literal encoding - which C++ mostly did, "basic
> character set" is at this point almost vestigial
> Except that it establishes minimum requirements on the
> execution character sets, too.
>
> > > If any character not in the basic character set matches the last
> category, the program is ill-formed. <
> https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
> > Do we have examples of lone characters in the basic character set
> not matching a pp-token that would not be ill-formed ?
> >
> > I don't think this can happen now?
>
>
> Here's an example:
>
>
> #include <stdio.h>
>
> #define STR(x) #x
>
> int main()
> {
> printf("%s", STR(\\));
> }
>
>
> Neither of the two backslashes is part of any other pp-token
> grammar production.
>
Thanks, that was helpful
I don't know if that rules is useful, and it doesn't seemed enforced
https://godbolt.org/z/ezKr6jffj
(clang 14's behavior seems like a regression, I'll look into that)
> > > d-char: <https://eel.is/c++draft/full#nt:d-char>
> > any member of the basic character set except:
> > U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT
> PARENTHESIS, U+005C REVERSE SOLIDUS,
> > U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM
> FEED, and new-line
> > This one would need some massaging.
> >
> > Not clear to me that it does? Not white-space-ish or starting an escape?
>
> This restricts the delimiter characters in a raw string literal.
> If you don't change this list, you'll add $ @ ` to the list of valid
> delimiter characters in a raw string literal.
>
> Either you amend this exclusion list (retaining the status quo), or
> you should highlight this change in behavior in the prose part of
> your paper.
>
Yes, I think that is my broader point.
"basic character set" is currently used to impose different sets of
restrictions to grammar constructs and <locale> functions, but it is not
used consistently,
all of these uses are either imprecise or have restriction lists and we
should look at each of these instances.
Arguably, there are very few places that should support a $ or @ that
should not also support other unicode characters
And if each of the places restricted to "basic character set" had a well
defined rationale and more precise description, we might not have any use
of it.
>
> Jens
>
>
> On 27/04/2022 06.05, Steve Downey via SG16 wrote:
> >
> >
> > On Tue, Apr 26, 2022 at 6:18 PM Corentin Jabot <corentinjabot_at_[hidden]
> <mailto:corentinjabot_at_[hidden]>> wrote:
> >
> > Can you please explain in the paper what this buys us?
> >
> > * It doesn't change the set of identifiers (nor should it)
> > * It makes \N{DOLLAR} ill-formed. Is that desirable?
> >
> > Only in syntactic contexts, if I have understood correctly. You could
> use that in a literal, but not in an identifier.
>
> Yes.
>
> > * It makes an hypothetical implementation that would not support $
> in the literal encoding non-conforming, even if no such character is
> present in any source file. Is that desirable?
>
> We have discussed this in SG16, I believe.
> I think nobody raised specific concerns;
> the minutes quote Hubert as saying WG14
> had sufficient quorum when discussing this
> paper, but no concern was raised there.
>
> > C
>
> As in "to-be-published C23" ?
>
> My understanding is there is no published C standard that requires this.
>
> > and POSIX already require this. POSIX in straight out normative text.
> You can't even write a POSIX charmap for a locale without specifying $.
> >
> >
> > * It doesn't affect the set of characters usable in grammar
> production - nor should it, these are orthogonal concerns. P2342 does not
> advocate for a change to literal or execution character sets.
>
> I'm not seeing what "grammar production" and "literal/execution character
> set"
> should have in common.
>
> > They aren't quite orthogonal, although C++ has lost some rationale. The
> basic character set is the set of abstract characters that allows you to
> express C, and C++. Adding other characters to the grammar would break a
> lot of assumptions. C can be generated.
>
> Corentin found an example where we say "any character of the basic
> character set
> except X". The effects here change when you change "basic character set".
>
> > * A static analysis tool won't warn on dollars but may keep warning
> about non-ascii characters in string literals.
> >
> > There is a simpler model here:
> > * The grammar determines the set of characters that are parts of
> grammar elements
> > * String literals that cannot be encoded in whatever the literal
> encoding is are ill-formed.
> >
> > I'm not saying that we should not be doing this, I would even argue
> that "C did it" might be reason enough, but I want to make sure we all
> agree on the intent
>
> What, do you think, is the intent?
>
> > But i wish both committee would consider completing the separation
> between source encoding and literal encoding - which C++ mostly did, "basic
> character set" is at this point almost vestigial
> Except that it establishes minimum requirements on the
> execution character sets, too.
>
> > > If any character not in the basic character set matches the last
> category, the program is ill-formed. <
> https://eel.is/c++draft/full#lex.pptoken-2.sentence-5>
> > Do we have examples of lone characters in the basic character set
> not matching a pp-token that would not be ill-formed ?
> >
> > I don't think this can happen now?
>
>
> Here's an example:
>
>
> #include <stdio.h>
>
> #define STR(x) #x
>
> int main()
> {
> printf("%s", STR(\\));
> }
>
>
> Neither of the two backslashes is part of any other pp-token
> grammar production.
>
Thanks, that was helpful
I don't know if that rules is useful, and it doesn't seemed enforced
https://godbolt.org/z/ezKr6jffj
(clang 14's behavior seems like a regression, I'll look into that)
> > > d-char: <https://eel.is/c++draft/full#nt:d-char>
> > any member of the basic character set except:
> > U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT
> PARENTHESIS, U+005C REVERSE SOLIDUS,
> > U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM
> FEED, and new-line
> > This one would need some massaging.
> >
> > Not clear to me that it does? Not white-space-ish or starting an escape?
>
> This restricts the delimiter characters in a raw string literal.
> If you don't change this list, you'll add $ @ ` to the list of valid
> delimiter characters in a raw string literal.
>
> Either you amend this exclusion list (retaining the status quo), or
> you should highlight this change in behavior in the prose part of
> your paper.
>
Yes, I think that is my broader point.
"basic character set" is currently used to impose different sets of
restrictions to grammar constructs and <locale> functions, but it is not
used consistently,
all of these uses are either imprecise or have restriction lists and we
should look at each of these instances.
Arguably, there are very few places that should support a $ or @ that
should not also support other unicode characters
And if each of the places restricted to "basic character set" had a well
defined rationale and more precise description, we might not have any use
of it.
>
> Jens
>
>
Received on 2022-04-27 09:11:50