C++ Logo


Advanced search

Re: Considerations for Unicode algorithms

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Mon, 30 Jan 2023 15:33:09 -0600
On Mon, Jan 30, 2023 at 7:36 AM Corentin <corentin.jabot_at_[hidden]> wrote:
> Hey folks.
> As promised eons ago, I put some of my thoughts on Unicode algorithms in a paper.
> I'll try to improve the form when I have time, but I wanted to give Zach and everyone else time to look at it before Issaquah, if we want to have something to discuss in the corridor track.
> https://isocpp.org/files/papers/D2773R0.pdf

Thanks, Corentin. I agree with most of what you said, with the
exception of these things:

First, I think user tailoring is really important, and should be a
first-order feature (especially w.r.t word breaking!). I think that
language tailoring, even if incomplete, can be useful -- the more
robust version done in terms of a proper std::ulocale can be added
later, as another overload. To me, the important way to think of this
is that any algorithm that is language-tailorable covers a subset of
all possible languages one's program might process. This implies that
the primary concern is to reduce the number N of unsupported
languages. If I can have support for all but 5 languages, that's
better than support for all but 10. Neither one is perfect, of

Also, I think the algorithms should be generic. They should not work
only with char32_t, or only with int, etc. Users should be free to
use char8_t, char, unsigned char, etc., for UTF-8. std::byte if
you're nasty.

Finally, I don't fully understand this part of your paper:

Unsurprisingly, Zach Laine came to the same conclusion, and the
solution proposed in his work is to let iterators peek through layers
of encode/decode iterators through their base() function - (such
function exists on views and views’ iterators of the design and return
an underlying iterator).

The conclusion I came to is that we can let range adaptors add or
remove these implicit decode functions when they are chained before
any view is constructed. The effect is the same, although the
implementation is simpler, as maintaining a base() iterator can be
tricky, especially in the presence of a non-common or reverse

Could you be more specific on how your approach works? Some code
would help I think.


Received on 2023-01-30 21:33:22