On Sat, Jul 9, 2022 at 9:03 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 09/07/2022 18.25, Corentin Jabot via SG16 wrote:
> Hey Tom.
> If you are *really* bored, I guess you could take a gander at
>
> https://isocpp.org/files/papers/P2620R0.pdf <https://isocpp.org/files/papers/P2620R0.pdf>

This feels thoroughly confused.

Quote from the paper:

"This is by no mean a major issue in C++, as we don’t put restrictions
on universal-character-names (unlike C)"

Indeed, the important bit is missing. This should read

"This is by no mean a major issue in C++, as we don’t put restrictions
on universal-character-names in string literals (unlike C)"

Quote from the standard:

"If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence
of a character-literal or string-literal (in either case, including within a user-defined-literal) corresponds
to a control character or to a character in the basic character set, the program is ill-formed."

So, there are restrictions in C++ as well.

If the comparison with C wants to highlight the UCN vs. char/string-literal
treatment, it should say so.

Are there places other than identifiers where we can have UCNs
outside of char/string literals? If not, maybe we should massage
the grammar definition of _identifier_ instead of persisting
the handwaving in lex.phases p4.

The idea of doing it there, as we form preprocessor tokens, is that we don't want to

int i\N{SEMICOLON} to do something (I don't think implementers would like that). and i\N{SEMICOLON} doesn't match the grammar of an identifier, so I think an eager replacement makes sense.

We should avoid having to carry these things around through phase 3.

> https://isocpp.org/files/papers/P2621R0.pdf <https://isocpp.org/files/papers/P2621R0.pdf>

Should this go to SG12, because it discusses undefined behavior?

I'm hoping that removing UB from a context in which no UB can meaningfully exist doesn't require

to visit all the groups - I however want SG22 to look at that to increase consistency between c and C++

> * Thinking about tailoring, unicode/cldr locales, localized numbers and dates formatting. Personally I think this is out of scope for 26 and I'm kinda hoping libicu 4x matures.
> But there are questions worth asking. Namely can C++ mandate a dependency on a lib like icu/icu4x, share a common implementation for all vendors, or are we not concerned about the implementation burden of that? Because I am. I think a locale object compliant with unicode is necessary, but if implementers can't take dependencies, an implementation of the cldr seems... asking too much.

In general, I think we can't mandate the dependency on a particular
third-party (non-ISO standardized) library, but we can certainly
standardize an interface that makes it easy to use a specific
third-party library to implement that interface.

But I'd really like to see a roadmap of proposed modern C++-style
interfaces before buying into anything at all.

> I'm already not really comfortable with the implementation burden for non tailored things (and at the same time unwilling to make the design amenable to ICU).

What does "non-tailored" mean?

Non-localized.

Unicode, in general defines local agnostic text transformations with the express recommandation that the algorithms

should be tailored with locale specific data and behavior, which adds a lot of complexity.

To take a simple example, Greek has some specific rules for upper casing sigma, German has ever changing rules for SS, ß, ẞ, etc.

Normalization is locale agnostic, casing has locale specific behavior for a handful of languages but a default algorithm is usually useful.

Grapheme clusterization can be localized (ch is a single grapheme in Slovak).

Some algorithms, namely collation, are always localized or not useful in their non-localized form.

There is at least an order of magnitude greater implementation complexity for the tailored versions - in part because of the reliance on ever changing data,

the need for locale specific behaviors - which benefits from domain expertise - and it's also harder to specify.

We need to decide the extent to which these things are reasonable for the committee to take on.

Both "this is an important feature for users" and "neither the committee nor implementers have the expertise of bandwidth to deal with arbitrary locales" seem reasonable positions.

> * If you folks think text processing would need some rope-like structure, I'm not sure it's text related but... I guess you could talk about that!
> * Some unicode algorithms are unbounded, and may require allocation. A small_vector would help specification,

Why would it help with the specification?

I guess we could say "does not allocate if some variable is smaller than X" but implementers will have to have a small vector as an implementation detail so if

people have the idea that such a thing should be standardized anyway, it's something that could be useful.

> but again that doesn't seem very in scope of this group.
> * We need *some* unicode properties, I'm not sure which to be honest. My current intent is to only provide things that would be generally useful outside of the algorithms that are otherwise provided.
> You could look at this swift paper https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md <https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md>

Jens