C++ Logo


Advanced search

Re: Considerations for Unicode algorithms

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 1 Feb 2023 10:22:15 +0100
On Mon, Jan 30, 2023 at 10:33 PM Zach Laine <whatwasthataddress_at_[hidden]>

> On Mon, Jan 30, 2023 at 7:36 AM Corentin <corentin.jabot_at_[hidden]> wrote:
> >
> > Hey folks.
> > As promised eons ago, I put some of my thoughts on Unicode algorithms in
> a paper.
> > I'll try to improve the form when I have time, but I wanted to give Zach
> and everyone else time to look at it before Issaquah, if we want to have
> something to discuss in the corridor track.
> >
> > https://isocpp.org/files/papers/D2773R0.pdf
> Thanks, Corentin. I agree with most of what you said, with the
> exception of these things:
> First, I think user tailoring is really important, and should be a
> first-order feature (especially w.r.t word breaking!). I think that
> language tailoring, even if incomplete, can be useful -- the more
> robust version done in terms of a proper std::ulocale can be added
> later, as another overload. To me, the important way to think of this
> is that any algorithm that is language-tailorable covers a subset of
> all possible languages one's program might process. This implies that
> the primary concern is to reduce the number N of unsupported
> languages. If I can have support for all but 5 languages, that's
> better than support for all but 10. Neither one is perfect, of
> course.

We agree tailoring is _super_ important.
But if the conclusion is that we have to deliver tailoring in the first set
of features... I think someone will have to come up with a ulocale object
very soon and that may be stretching us (individually, SG16, LEWG, LWG) too
I would need to be convinced that having for example a special case for ie,
Turkish would be useful to users.

Not as in "do Turkish people care about capitalization", that's a given,
but how will developers (who statistically probably don't want to think
about Turkish capitalization), use that ad-hoc feature?
We don't have a way to map a std::locale to either a language or an
algorithm, so I don't know how I'd write

uppercase(is_turkish(some_std_locale_object) ?
uppercase_locale_options::turkish : uppercase_locale_options::default,

And if we'd known how to write that, would users do it?
And then if we can write that, should we pass a std::locale object? How
reliable would that be?

It gets trickier for rules that may not just be lang dependants but may be
region dependent, or that should only be applied in some cases.
And then can ICU be used? Is ICU4x mature enough yet? We'd need to answer
these questions quite soon.

> Also, I think the algorithms should be generic. They should not work
> only with char32_t, or only with int, etc. Users should be free to
> use char8_t, char, unsigned char, etc., for UTF-8. std::byte if
> you're nasty.

cf other mails :)

> Finally, I don't fully understand this part of your paper:
> "
> Unsurprisingly, Zach Laine came to the same conclusion, and the
> solution proposed in his work is to let iterators peek through layers
> of encode/decode iterators through their base() function - (such
> function exists on views and views’ iterators of the design and return
> an underlying iterator).
> The conclusion I came to is that we can let range adaptors add or
> remove these implicit decode functions when they are chained before
> any view is constructed. The effect is the same, although the
> implementation is simpler, as maintaining a base() iterator can be
> tricky, especially in the presence of a non-common or reverse
> iterator.
> "
> Could you be more specific on how your approach works? Some code
> would help I think.

I guess I should because I think it's basically the one original trick I
came up with.

for any given view, we have the view itself, and its adaptor, such
that foo_view(V)
and V | view

V | view is defined by an operator|() taking a range (which need not be a
view, though it needs to be viewable).
And most of the time, that is the interface people are going to use when
using view, it's just much more convenient.

So, instead of defining a single operator|(range_of_char32_t) on that
range adaptor, we can defined additional overloads that take char8/16_t and
produce a unicode_algo_view<codepoint_view<charN_t>>
instead of unicode_algo_view<all_t<...>>.

The benefits are:

   - We can support implicit decode/encode for the sake of ergonomy, in the
   place ergonomy is most desirable
   - The only places we need to specify decode/encode steps and their error
   policies is in these range adaptors overloads, which we could probably do
   once for all unicode algorithms (both in term of spec and implementation),
   without the view themselves having to take on double duties
   - We can ellide these implicit decode/encode steps when chaining
   algorithms (r | normalize | word_break) as these pipe operators
   basically construct a graph of algorithms, we can remove these implicit
   nodes when they do redundant work ie decode | normalize | encode |
   decode | word_break | encode should just be decode | normalize |
   word_break | encode

Does that clarify?

> Zach

Received on 2023-02-01 09:22:30