Date: Sun, 10 Jul 2022 00:03:56 +0200
On Sat, Jul 9, 2022 at 9:03 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 09/07/2022 18.25, Corentin Jabot via SG16 wrote:
> > Hey Tom.
> > If you are *really* bored, I guess you could take a gander at
> >
> > https://isocpp.org/files/papers/P2620R0.pdf <
> https://isocpp.org/files/papers/P2620R0.pdf>
>
> This feels thoroughly confused.
>
> Quote from the paper:
>
> "This is by no mean a major issue in C++, as we don’t put restrictions
> on universal-character-names (unlike C)"
>
Indeed, the important bit is missing. This should read
"This is by no mean a major issue in C++, as we don’t put restrictions
on universal-character-names in string literals (unlike C)"
>
> Quote from the standard:
>
> "If a universal-character-name outside the c-char-sequence,
> s-char-sequence, or r-char-sequence
> of a character-literal or string-literal (in either case, including within
> a user-defined-literal) corresponds
> to a control character or to a character in the basic character set, the
> program is ill-formed."
>
> So, there are restrictions in C++ as well.
>
> If the comparison with C wants to highlight the UCN vs. char/string-literal
> treatment, it should say so.
>
> Are there places other than identifiers where we can have UCNs
> outside of char/string literals? If not, maybe we should massage
> the grammar definition of _identifier_ instead of persisting
> the handwaving in lex.phases p4.
>
The idea of doing it there, as we form preprocessor tokens, is that we
don't want to
int i\N{SEMICOLON} to do something (I don't think implementers would like
that). and i\N{SEMICOLON} doesn't match the grammar of an identifier, so I
think an eager replacement makes sense.
We should avoid having to carry these things around through phase 3.
>
> > https://isocpp.org/files/papers/P2621R0.pdf <
> https://isocpp.org/files/papers/P2621R0.pdf>
>
> Should this go to SG12, because it discusses undefined behavior?
>
I'm hoping that removing UB from a context in which no UB can meaningfully
exist doesn't require
to visit all the groups - I however want SG22 to look at that to increase
consistency between c and C++
>
> > * Thinking about tailoring, unicode/cldr locales, localized numbers and
> dates formatting. Personally I think this is out of scope for 26 and I'm
> kinda hoping libicu 4x matures.
> > But there are questions worth asking. Namely can C++ mandate a
> dependency on a lib like icu/icu4x, share a common implementation for all
> vendors, or are we not concerned about the implementation burden of that?
> Because I am. I think a locale object compliant with unicode is necessary,
> but if implementers can't take dependencies, an implementation of the cldr
> seems... asking too much.
>
> In general, I think we can't mandate the dependency on a particular
> third-party (non-ISO standardized) library, but we can certainly
> standardize an interface that makes it easy to use a specific
> third-party library to implement that interface.
>
> But I'd really like to see a roadmap of proposed modern C++-style
> interfaces before buying into anything at all.
>
> > I'm already not really comfortable with the implementation burden for
> non tailored things (and at the same time unwilling to make the design
> amenable to ICU).
>
> What does "non-tailored" mean?
>
Non-localized.
Unicode, in general defines local agnostic text transformations with the
express recommandation that the algorithms
should be tailored with locale specific data and behavior, which adds a lot
of complexity.
To take a simple example, Greek has some specific rules for upper casing
sigma, German has ever changing rules for SS, ß, ẞ, etc.
Normalization is locale agnostic, casing has locale specific behavior for a
handful of languages but a default algorithm is usually useful.
Grapheme clusterization can be localized (ch is a single grapheme in
Slovak).
Some algorithms, namely collation, are always localized or not useful in
their non-localized form.
There is at least an order of magnitude greater implementation complexity
for the tailored versions - in part because of the reliance on ever
changing data,
the need for locale specific behaviors - which benefits from domain
expertise - and it's also harder to specify.
We need to decide the extent to which these things are reasonable for the
committee to take on.
Both "this is an important feature for users" and "neither the committee
nor implementers have the expertise of bandwidth to deal with arbitrary
locales" seem reasonable positions.
> > * If you folks think text processing would need some rope-like
> structure, I'm not sure it's text related but... I guess you could talk
> about that!
> > * Some unicode algorithms are unbounded, and may require allocation. A
> small_vector would help specification,
>
> Why would it help with the specification?
>
I guess we could say "does not allocate if some variable is smaller than
X" but implementers will have to have a small vector as an implementation
detail so if
people have the idea that such a thing should be standardized anyway, it's
something that could be useful.
> > but again that doesn't seem very in scope of this group.
> > * We need *some* unicode properties, I'm not sure which to be honest. My
> current intent is to only provide things that would be generally useful
> outside of the algorithms that are otherwise provided.
> > You could look at this swift paper
> https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md
> <
> https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md
> >
>
> Jens
>
> On 09/07/2022 18.25, Corentin Jabot via SG16 wrote:
> > Hey Tom.
> > If you are *really* bored, I guess you could take a gander at
> >
> > https://isocpp.org/files/papers/P2620R0.pdf <
> https://isocpp.org/files/papers/P2620R0.pdf>
>
> This feels thoroughly confused.
>
> Quote from the paper:
>
> "This is by no mean a major issue in C++, as we don’t put restrictions
> on universal-character-names (unlike C)"
>
Indeed, the important bit is missing. This should read
"This is by no mean a major issue in C++, as we don’t put restrictions
on universal-character-names in string literals (unlike C)"
>
> Quote from the standard:
>
> "If a universal-character-name outside the c-char-sequence,
> s-char-sequence, or r-char-sequence
> of a character-literal or string-literal (in either case, including within
> a user-defined-literal) corresponds
> to a control character or to a character in the basic character set, the
> program is ill-formed."
>
> So, there are restrictions in C++ as well.
>
> If the comparison with C wants to highlight the UCN vs. char/string-literal
> treatment, it should say so.
>
> Are there places other than identifiers where we can have UCNs
> outside of char/string literals? If not, maybe we should massage
> the grammar definition of _identifier_ instead of persisting
> the handwaving in lex.phases p4.
>
The idea of doing it there, as we form preprocessor tokens, is that we
don't want to
int i\N{SEMICOLON} to do something (I don't think implementers would like
that). and i\N{SEMICOLON} doesn't match the grammar of an identifier, so I
think an eager replacement makes sense.
We should avoid having to carry these things around through phase 3.
>
> > https://isocpp.org/files/papers/P2621R0.pdf <
> https://isocpp.org/files/papers/P2621R0.pdf>
>
> Should this go to SG12, because it discusses undefined behavior?
>
I'm hoping that removing UB from a context in which no UB can meaningfully
exist doesn't require
to visit all the groups - I however want SG22 to look at that to increase
consistency between c and C++
>
> > * Thinking about tailoring, unicode/cldr locales, localized numbers and
> dates formatting. Personally I think this is out of scope for 26 and I'm
> kinda hoping libicu 4x matures.
> > But there are questions worth asking. Namely can C++ mandate a
> dependency on a lib like icu/icu4x, share a common implementation for all
> vendors, or are we not concerned about the implementation burden of that?
> Because I am. I think a locale object compliant with unicode is necessary,
> but if implementers can't take dependencies, an implementation of the cldr
> seems... asking too much.
>
> In general, I think we can't mandate the dependency on a particular
> third-party (non-ISO standardized) library, but we can certainly
> standardize an interface that makes it easy to use a specific
> third-party library to implement that interface.
>
> But I'd really like to see a roadmap of proposed modern C++-style
> interfaces before buying into anything at all.
>
> > I'm already not really comfortable with the implementation burden for
> non tailored things (and at the same time unwilling to make the design
> amenable to ICU).
>
> What does "non-tailored" mean?
>
Non-localized.
Unicode, in general defines local agnostic text transformations with the
express recommandation that the algorithms
should be tailored with locale specific data and behavior, which adds a lot
of complexity.
To take a simple example, Greek has some specific rules for upper casing
sigma, German has ever changing rules for SS, ß, ẞ, etc.
Normalization is locale agnostic, casing has locale specific behavior for a
handful of languages but a default algorithm is usually useful.
Grapheme clusterization can be localized (ch is a single grapheme in
Slovak).
Some algorithms, namely collation, are always localized or not useful in
their non-localized form.
There is at least an order of magnitude greater implementation complexity
for the tailored versions - in part because of the reliance on ever
changing data,
the need for locale specific behaviors - which benefits from domain
expertise - and it's also harder to specify.
We need to decide the extent to which these things are reasonable for the
committee to take on.
Both "this is an important feature for users" and "neither the committee
nor implementers have the expertise of bandwidth to deal with arbitrary
locales" seem reasonable positions.
> > * If you folks think text processing would need some rope-like
> structure, I'm not sure it's text related but... I guess you could talk
> about that!
> > * Some unicode algorithms are unbounded, and may require allocation. A
> small_vector would help specification,
>
> Why would it help with the specification?
>
I guess we could say "does not allocate if some variable is smaller than
X" but implementers will have to have a small vector as an implementation
detail so if
people have the idea that such a thing should be standardized anyway, it's
something that could be useful.
> > but again that doesn't seem very in scope of this group.
> > * We need *some* unicode properties, I'm not sure which to be honest. My
> current intent is to only provide things that would be generally useful
> outside of the algorithms that are otherwise provided.
> > You could look at this swift paper
> https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md
> <
> https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md
> >
>
> Jens
>
Received on 2022-07-09 22:04:08