ISOCPP sg16 List: Re: utfN

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Thu, 4 May 2023 18:02:30 +0200

On 04/05/2023 17.44, Zach Laine wrote:
> On Thu, May 4, 2023 at 1:58 AM Jens Maurer <jens.maurer_at_[hidden]> wrote:

>> section 4.1
>>
>> template<class UTF16Range>
>> void process_input(UTF16Range && r);
>>
>> We should have a concept for a UTF16Range and use it here.
>> (Maybe just "value_type is char16_t".)
>
> It's just an example.

People will want to write such code, and might find such
a concept useful as a facility provided by the standard
library.

>> section 5.3
>>
>> null_sentinel_t requires a utf_code_unit<T> for its operator==. I don't think
>> we want that. I think this works for all null-terminated sequences, so we shouldn't
>> constrain "T" except maybe for the ability to compare *p with literal 0.
>> (You could have a null-terminated sequence of pointer values, for example.)
>> Hm... Maybe a comparison against a value-initalized T is even better?
>> That covers 0 and pointers and other stuff.
>
> You're not the first to suggest this, and I like the idea as well. In
> fact, I think with such a change, I think I want to take it out of
> std::uc and just put it in std.

Agreed.

> I keep forgetting to ask for a poll
> on this.

Why do you need a poll for improving a facility?

>> section 5.6.4
>>
>> "A simple way to represent a transcoding view is as a pair of transcoding iterators. However, there is a problem with that approach, since a utf32_view<utf_8_to_32_iterator<char const *>> would be a range the size of 6 pointers. Worse yet, a utf32_view<utf_8_to_16_iterator<utf_16_to_32_iterator<char const *>>> would be the size of 18 pointers! Further, such a view would do a UTF-8 to UTF-16 to UTF-32 conversion, when it could have done a direct UTF-8 to UTF-32 conversion instead."
>>
>> That's why you should focus on ranges, not iterators and sentinels.
>> And any optimizations such as just returning a subrange (because
>> the input is already a char8_t-range) should be done in the range
>> adaptor object, not at the level of individual iterators.
>> See [range.take.overview] for an example.
>
> That's already how things work. I don't understand why you think that
> the unpacking happens on the iterators. It only happens when forming
> ranges. In fact, the text you quoted above explicitly talks about
> utfN32_views.

Well, the text I quoted talks about chaining iterators and then adding
a view on top. That's not the "range adaptor" approach at all.

>> If you've implemented this (plus the eager algorithm-based stuff), please add a few
>> performance figures comparing views-based performance with eager performance to the
>> text. We should document what we're buying into here.
>
> That was already there in R0. It's about 2-3X, depending on whether
> you use SIMD. Since everyone (except me) decided perf does not
> matter, I removed it.

Please leave design decisions and rationale and related information
in all paper revisions. This is a standing request by LEWG, too.
(They will see the latest revision only and won't tread through
older papers to reconstruct rationale.)

In any case, I think SG16 recognized this was a huge proposal and we
definitely want views, so we should start with views. We can add
more stuff later (including in separate papers) at any time.

>> section 5.6
>>
>> utf8_view still has a utf_iter template parameter and an (iterator, sentinel)
>> pair for the constructor parameters. This is not how range adapters work.
>> Please have a look at [range.transform.view]. You need to deal in views or
>> ranges, not in iterators. Also, how about having a utf_view<charT>,
>> where charT is one of char8_t, char16_t, char32_t and indicates both the
>> "output type" and the desired UTF-x encoding? Then, utf8_view and friends
>> become simple typedefs (and we can decide separately whether we want such
>> typedefs or not).
>
> Ah, you're right. I'm really used to using the iterator interface, so
> I only did a partial transformation. I'll change this as well.

Great, thanks. Please check out [range.take.overview] and formulate
some phrasing around the utf_view range adaptor object that provides
some of the optimizations you seek (e.g. chaining two transcodings
is short-circuited, and a no-op transcoding on a std::span delivers
that exact std::span). Once we have the framework in place, we can
discuss which optimizations in particular we want to provide in the
standard. (Unfortunately, this area is not QoI, because decltype
can determine which actual view type you get.)

Thanks,
Jens

Received on 2023-05-04 16:02:41