ISOCPP sg16 List: Re: utfN

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Thu, 4 May 2023 10:44:17 -0500

On Thu, May 4, 2023 at 1:58 AM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
>
> On 04/05/2023 03.47, Zach Laine wrote:
> > Sorry for the late reply SG-16, I was waiting until I had done more
> > experimentation with the code before replying to this thread. Also,
> > I've updated the paper: https://isocpp.org/files/papers/D2728R1.html
>
> Thanks for the update.
>
>
> section 4.1
>
> template<class UTF16Range>
> void process_input(UTF16Range && r);
>
> We should have a concept for a UTF16Range and use it here.
> (Maybe just "value_type is char16_t".)

It's just an example.

> section 4.2
>
> The use / presence of std::uc::utf_8_to_16_iterator seems to be undermotivated;
> the views-based equivalent is so much shorter and nicer to look at.
>
>
> Please add more of the use-cases from R0 (e.g. transcode a buffer) and show
> code how to express that with the range adapter. (Addressing several
> use-cases with a single facility is good design.)

Sure, I can add them back in.

> section 5.3
>
> null_sentinel_t requires a utf_code_unit<T> for its operator==. I don't think
> we want that. I think this works for all null-terminated sequences, so we shouldn't
> constrain "T" except maybe for the ability to compare *p with literal 0.
> (You could have a null-terminated sequence of pointer values, for example.)
> Hm... Maybe a comparison against a value-initalized T is even better?
> That covers 0 and pointers and other stuff.

You're not the first to suggest this, and I like the idea as well. In
fact, I think with such a change, I think I want to take it out of
std::uc and just put it in std. I keep forgetting to ask for a poll
on this.

> section 5.6.4
>
> "A simple way to represent a transcoding view is as a pair of transcoding iterators. However, there is a problem with that approach, since a utf32_view<utf_8_to_32_iterator<char const *>> would be a range the size of 6 pointers. Worse yet, a utf32_view<utf_8_to_16_iterator<utf_16_to_32_iterator<char const *>>> would be the size of 18 pointers! Further, such a view would do a UTF-8 to UTF-16 to UTF-32 conversion, when it could have done a direct UTF-8 to UTF-32 conversion instead."
>
> That's why you should focus on ranges, not iterators and sentinels.
> And any optimizations such as just returning a subrange (because
> the input is already a char8_t-range) should be done in the range
> adaptor object, not at the level of individual iterators.
> See [range.take.overview] for an example.

That's already how things work. I don't understand why you think that
the unpacking happens on the iterators. It only happens when forming
ranges. In fact, the text you quoted above explicitly talks about
utfN32_views.

> If you've implemented this (plus the eager algorithm-based stuff), please add a few
> performance figures comparing views-based performance with eager performance to the
> text. We should document what we're buying into here.

That was already there in R0. It's about 2-3X, depending on whether
you use SIMD. Since everyone (except me) decided perf does not
matter, I removed it.

> section 5.4
>
> find_invalid_encoding takes an (iterator, sentinel) pair. It should instead
> simply take a view. Same for is_encoded. (Why can't the latter deal with
> sentinels?) Same for starts_encoded and ends_encoded.

Sure, I can change this too.

> section 5.6
>
> utf8_view still has a utf_iter template parameter and an (iterator, sentinel)
> pair for the constructor parameters. This is not how range adapters work.
> Please have a look at [range.transform.view]. You need to deal in views or
> ranges, not in iterators. Also, how about having a utf_view<charT>,
> where charT is one of char8_t, char16_t, char32_t and indicates both the
> "output type" and the desired UTF-x encoding? Then, utf8_view and friends
> become simple typedefs (and we can decide separately whether we want such
> typedefs or not).

Ah, you're right. I'm really used to using the iterator interface, so
I only did a partial transformation. I'll change this as well.

Zach

Received on 2023-05-04 15:44:32