ISOCPP sg16 List: Re: utfN

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Thu, 4 May 2023 11:28:15 -0500

On Thu, May 4, 2023 at 11:02 AM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
> On 04/05/2023 17.44, Zach Laine wrote:
> > On Thu, May 4, 2023 at 1:58 AM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
> >> section 4.1
> >>
> >> template<class UTF16Range>
> >> void process_input(UTF16Range && r);
> >>
> >> We should have a concept for a UTF16Range and use it here.
> >> (Maybe just "value_type is char16_t".)
> >
> > It's just an example.
>
> People will want to write such code, and might find such
> a concept useful as a facility provided by the standard
> library.

There is such a concept in the paper: utf16_range.

> >> section 5.3
> >>
> >> null_sentinel_t requires a utf_code_unit<T> for its operator==. I don't think
> >> we want that. I think this works for all null-terminated sequences, so we shouldn't
> >> constrain "T" except maybe for the ability to compare *p with literal 0.
> >> (You could have a null-terminated sequence of pointer values, for example.)
> >> Hm... Maybe a comparison against a value-initalized T is even better?
> >> That covers 0 and pointers and other stuff.
> >
> > You're not the first to suggest this, and I like the idea as well. In
> > fact, I think with such a change, I think I want to take it out of
> > std::uc and just put it in std.
>
> Agreed.
>
> > I keep forgetting to ask for a poll
> > on this.
>
> Why do you need a poll for improving a facility?

Because consensus is important, and I've only heard from about 3 people on this.

> >> section 5.6.4
> >>
> >> "A simple way to represent a transcoding view is as a pair of transcoding iterators. However, there is a problem with that approach, since a utf32_view<utf_8_to_32_iterator<char const *>> would be a range the size of 6 pointers. Worse yet, a utf32_view<utf_8_to_16_iterator<utf_16_to_32_iterator<char const *>>> would be the size of 18 pointers! Further, such a view would do a UTF-8 to UTF-16 to UTF-32 conversion, when it could have done a direct UTF-8 to UTF-32 conversion instead."
> >>
> >> That's why you should focus on ranges, not iterators and sentinels.
> >> And any optimizations such as just returning a subrange (because
> >> the input is already a char8_t-range) should be done in the range
> >> adaptor object, not at the level of individual iterators.
> >> See [range.take.overview] for an example.
> >
> > That's already how things work. I don't understand why you think that
> > the unpacking happens on the iterators. It only happens when forming
> > ranges. In fact, the text you quoted above explicitly talks about
> > utfN32_views.
>
> Well, the text I quoted talks about chaining iterators and then adding
> a view on top. That's not the "range adaptor" approach at all.
>
> >> If you've implemented this (plus the eager algorithm-based stuff), please add a few
> >> performance figures comparing views-based performance with eager performance to the
> >> text. We should document what we're buying into here.
> >
> > That was already there in R0. It's about 2-3X, depending on whether
> > you use SIMD. Since everyone (except me) decided perf does not
> > matter, I removed it.
>
> Please leave design decisions and rationale and related information
> in all paper revisions. This is a standing request by LEWG, too.
> (They will see the latest revision only and won't tread through
> older papers to reconstruct rationale.)
>
> In any case, I think SG16 recognized this was a huge proposal and we
> definitely want views, so we should start with views. We can add
> more stuff later (including in separate papers) at any time.

Understood. I just think that information is better off in those
separate papers. LEWG has never seen any version of this paper, and
perf metrics are not part of the rationale.

> >> section 5.6
> >>
> >> utf8_view still has a utf_iter template parameter and an (iterator, sentinel)
> >> pair for the constructor parameters. This is not how range adapters work.
> >> Please have a look at [range.transform.view]. You need to deal in views or
> >> ranges, not in iterators. Also, how about having a utf_view<charT>,
> >> where charT is one of char8_t, char16_t, char32_t and indicates both the
> >> "output type" and the desired UTF-x encoding? Then, utf8_view and friends
> >> become simple typedefs (and we can decide separately whether we want such
> >> typedefs or not).
> >
> > Ah, you're right. I'm really used to using the iterator interface, so
> > I only did a partial transformation. I'll change this as well.
>
> Great, thanks. Please check out [range.take.overview] and formulate
> some phrasing around the utf_view range adaptor object that provides
> some of the optimizations you seek (e.g. chaining two transcodings
> is short-circuited, and a no-op transcoding on a std::span delivers
> that exact std::span). Once we have the framework in place, we can
> discuss which optimizations in particular we want to provide in the
> standard. (Unfortunately, this area is not QoI, because decltype
> can determine which actual view type you get.)

Will do.

Zach

Received on 2023-05-04 16:28:30