ISOCPP sg16 List: Re: utfN

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Thu, 4 May 2023 08:57:53 +0200

On 04/05/2023 03.47, Zach Laine wrote:
> Sorry for the late reply SG-16, I was waiting until I had done more
> experimentation with the code before replying to this thread. Also,
> I've updated the paper: https://isocpp.org/files/papers/D2728R1.html

Thanks for the update.

section 4.1

template<class UTF16Range>
void process_input(UTF16Range && r);

We should have a concept for a UTF16Range and use it here.
(Maybe just "value_type is char16_t".)

section 4.2

The use / presence of std::uc::utf_8_to_16_iterator seems to be undermotivated;
the views-based equivalent is so much shorter and nicer to look at.

Please add more of the use-cases from R0 (e.g. transcode a buffer) and show
code how to express that with the range adapter. (Addressing several
use-cases with a single facility is good design.)

section 5.3

null_sentinel_t requires a utf_code_unit<T> for its operator==. I don't think
we want that. I think this works for all null-terminated sequences, so we shouldn't
constrain "T" except maybe for the ability to compare *p with literal 0.
(You could have a null-terminated sequence of pointer values, for example.)
Hm... Maybe a comparison against a value-initalized T is even better?
That covers 0 and pointers and other stuff.

section 5.6.4

"A simple way to represent a transcoding view is as a pair of transcoding iterators. However, there is a problem with that approach, since a utf32_view<utf_8_to_32_iterator<char const *>> would be a range the size of 6 pointers. Worse yet, a utf32_view<utf_8_to_16_iterator<utf_16_to_32_iterator<char const *>>> would be the size of 18 pointers! Further, such a view would do a UTF-8 to UTF-16 to UTF-32 conversion, when it could have done a direct UTF-8 to UTF-32 conversion instead."

That's why you should focus on ranges, not iterators and sentinels.
And any optimizations such as just returning a subrange (because
the input is already a char8_t-range) should be done in the range
adaptor object, not at the level of individual iterators.
See [range.take.overview] for an example.

If you've implemented this (plus the eager algorithm-based stuff), please add a few
performance figures comparing views-based performance with eager performance to the
text. We should document what we're buying into here.

section 5.4

find_invalid_encoding takes an (iterator, sentinel) pair. It should instead
simply take a view. Same for is_encoded. (Why can't the latter deal with
sentinels?) Same for starts_encoded and ends_encoded.

section 5.6

utf8_view still has a utf_iter template parameter and an (iterator, sentinel)
pair for the constructor parameters. This is not how range adapters work.
Please have a look at [range.transform.view]. You need to deal in views or
ranges, not in iterators. Also, how about having a utf_view<charT>,
where charT is one of char8_t, char16_t, char32_t and indicates both the
"output type" and the desired UTF-x encoding? Then, utf8_view and friends
become simple typedefs (and we can decide separately whether we want such
typedefs or not).

Thanks,
Jens

Received on 2023-05-04 06:58:08