ISOCPP sg16 List: Re: utfN

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Sun, 16 Apr 2023 21:54:47 +0200

On 16/04/2023 20.54, Zach Laine via SG16 wrote:
> We again talked about utfN_view at the last meeting. I was trying to
> justify their existence, and again I could not remember the salient
> point during the discussion. Now I have.

Thanks.

> Here is one of them:

What are the others?

> template<utf8_iter I, sentinel_for<I> S = I>
> struct utf8_view : view_interface<utf8_view<I, S>> {
> using iterator = I;
> using sentinel = S;
>
> constexpr utf8_view() {}
> constexpr utf8_view(iterator first, sentinel last);
>
> constexpr iterator begin() const;
> constexpr sentinel end() const;
>
> friend constexpr bool operator==(utf8_view lhs, utf8_view rhs)
> { return lhs.begin() == rhs.begin() && lhs.end() == rhs.end(); }
>
> template<class CharT, class Traits>
> friend basic_ostream<CharT, Traits>&
> operator<<(basic_ostream<CharT, Traits>& os, utf8_view v);
>
> private:
> using iterator_t = unspecified; // exposition only
> using sentinel_t = unspecified; // exposition only
>
> iterator_t first_; // exposition only
> [[no_unique_address]] sentinel_t last_; // exposition only
> };

So, I'm seeing several differences between subrange and your utf8_view.
subrange doesn't have operator== and doesn't have operator<<.
subrange can turn a non-sized range into a sized range.
subrange can use arbitrary iterators, utf8_view can only take a utf8_iter,
which requires a bidirectional iterator whose value type takes one byte.

I notice that the CharT template parameter of utf8_view's operator<<
has no relationship to the value type of the utf8_view, which seems
surprising. I wouldn't expect to be able to stream a utf8_view to
a basic_ostream<int>.

What's the purpose of "operator=="? std::subrange doesn't seem to have
it, and the absence appears to be a good idea given that it's unclear
from the outside whether == compares the iterators or the values in the
range.

Why is utf8_view limited to bidirectional iterators? Streaming from
a forward or input iterator should be entirely fine.

Can this be a view derived from std::subrange? After all, a utf8_view
is-a subrange, it seems.

I notice that utf8_view::operator<< is the only view operation that
looks at the contents of each element of the range; maybe that is
better represented by a "range consumer" similar to "std::ranges::to".

While looking at the concepts in

https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html

I noticed the "enum class format". I sensed consensus in SG16 around
the idea that char8_t, char16_t, char32_t are the canonical UTF-x
code unit types, so I'd suggest to parameterize everything with these
types instead of spelling out the "utfN_something" names everywhere
and/or employing a helper enum. If, for particular entities, you
feel a typedef or other alias is appropriate, feel free to introduce
that at those points.

Looking some more at the concepts, I'm wondering why the value types
of the utfN_iters need to be exactly of the right size. After all,
a code unit sequence is just a sequence of integers, and (on input)
considering a sequence of 7-bit ASCII character values a valid UTF32
sequence is sound. (When producing UTF-32 output, having just 7 bits
doesn't work, of course.)

> Note the operator<<. I don't know how to provide a general-purpose
> way to stream out a subrange<I, S>, when we know that it happens to
> contain UTF-8, so I created utf8_view, and added an operator<<. I
> have a similar concern about adding support for
> std::format-/std::print-ing ranges of UTF.

Does "utf8_view" embody the semantic constraint that it refers to
a valid UTF8 code unit sequence? In other words, is it undefined
behavior if you construct a utf8_view over a sequence of elements
that isn't actually a valid sequence of UTF8 code units?

Apparently, "utf8_iter" does not embody that constraint, because
we have functions such as find_invalid_encoding() that can find
non-UTF-8.

Oh, why is there no find_invalid_encoding that takes a range
instead of an iterator pair?

General design question: Do we want to differentiate in the type
system these two situations?

- a valid sequence of UTF-x code units
- a sequence of integers that may or may not be a valid sequence of UTF-x code units

Transcoding facilities would always produce "valid" sequences,
which might save the next step a potentially costly validation.
But maybe validation is so cheap that we'd prefer to always avoid
the undefined behavior inherent in the "type represents valid
sequence" option.

> Streaming or printing a utfN_view "just works", and this convenience
> is used throughout Boost.Text and the examples in the papers I'm
> proposing. I think the value of this convenience is evident in the
> examples.

I've searched for "<<" in https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html
and the only matches are those for the definitions of operator<<.
Similar for "format(" and "print(". As it stands, the facility
seems undermotivated to me. I'd like to point out that a WG21
proposal should have complete motivation for any proposed feature,
if only for historical record.

I do notice that you could create formatter specializations for
"subrange" with the constraint that the subrange's values are
sufficiently UTF-x-like. However, I'd like to point out that there
is currently no facility to print char8_t data via std::format
(I think), and I'm hesitant to introduce that as a drive-by with
a transcoding facility.

I'm not finding range adaptors in the style of [range.adaptors]
that can be chained with "|" and would transcode a UTF-8 range
into a (say) UTF-32 range. Could you point me to those in your
paper?

Jens

Received on 2023-04-16 19:54:52