ISOCPP sg16 List: Re: utfN

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Wed, 3 May 2023 20:47:51 -0500

Sorry for the late reply SG-16, I was waiting until I had done more
experimentation with the code before replying to this thread. Also,
I've updated the paper: https://isocpp.org/files/papers/D2728R1.html

One thing I have not yet done in the paper is change the concepts for
utfN code units. More on that later in this thread.

On Sun, Apr 16, 2023 at 2:54 PM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
>
> On 16/04/2023 20.54, Zach Laine via SG16 wrote:
> > We again talked about utfN_view at the last meeting. I was trying to
> > justify their existence, and again I could not remember the salient
> > point during the discussion. Now I have.
>
> Thanks.
>
> > Here is one of them:
>
> What are the others?

No, I mean this is one of the views, not one of the reasons.

> > template<utf8_iter I, sentinel_for<I> S = I>
> > struct utf8_view : view_interface<utf8_view<I, S>> {
> > using iterator = I;
> > using sentinel = S;
> >
> > constexpr utf8_view() {}
> > constexpr utf8_view(iterator first, sentinel last);
> >
> > constexpr iterator begin() const;
> > constexpr sentinel end() const;
> >
> > friend constexpr bool operator==(utf8_view lhs, utf8_view rhs)
> > { return lhs.begin() == rhs.begin() && lhs.end() == rhs.end(); }
> >
> > template<class CharT, class Traits>
> > friend basic_ostream<CharT, Traits>&
> > operator<<(basic_ostream<CharT, Traits>& os, utf8_view v);
> >
> > private:
> > using iterator_t = unspecified; // exposition only
> > using sentinel_t = unspecified; // exposition only
> >
> > iterator_t first_; // exposition only
> > [[no_unique_address]] sentinel_t last_; // exposition only
> > };
>
> So, I'm seeing several differences between subrange and your utf8_view.
> subrange doesn't have operator== and doesn't have operator<<.
> subrange can turn a non-sized range into a sized range.
> subrange can use arbitrary iterators, utf8_view can only take a utf8_iter,
> which requires a bidirectional iterator whose value type takes one byte.
>
> I notice that the CharT template parameter of utf8_view's operator<<
> has no relationship to the value type of the utf8_view, which seems
> surprising. I wouldn't expect to be able to stream a utf8_view to
> a basic_ostream<int>.
>
> What's the purpose of "operator=="? std::subrange doesn't seem to have
> it, and the absence appears to be a good idea given that it's unclear
> from the outside whether == compares the iterators or the values in the
> range.
>
> Why is utf8_view limited to bidirectional iterators? Streaming from
> a forward or input iterator should be entirely fine.
>
> Can this be a view derived from std::subrange? After all, a utf8_view
> is-a subrange, it seems.

I think all the above is moot now; I've changed the utfN_views to be
more like the ones in ranges, except for operator<<.

> I notice that utf8_view::operator<< is the only view operation that
> looks at the contents of each element of the range; maybe that is
> better represented by a "range consumer" similar to "std::ranges::to".

Most of the interfaces I'm proposing return a view, maybe a utfN_view,
and being able to print all of those things is important. Printing
text is always important.

> While looking at the concepts in
>
> https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html
>
> I noticed the "enum class format". I sensed consensus in SG16 around
> the idea that char8_t, char16_t, char32_t are the canonical UTF-x
> code unit types, so I'd suggest to parameterize everything with these
> types instead of spelling out the "utfN_something" names everywhere
> and/or employing a helper enum. If, for particular entities, you
> feel a typedef or other alias is appropriate, feel free to introduce
> that at those points.
>
> Looking some more at the concepts, I'm wondering why the value types
> of the utfN_iters need to be exactly of the right size. After all,
> a code unit sequence is just a sequence of integers, and (on input)
> considering a sequence of 7-bit ASCII character values a valid UTF32
> sequence is sound. (When producing UTF-32 output, having just 7 bits
> doesn't work, of course.)
>
> > Note the operator<<. I don't know how to provide a general-purpose
> > way to stream out a subrange<I, S>, when we know that it happens to
> > contain UTF-8, so I created utf8_view, and added an operator<<. I
> > have a similar concern about adding support for
> > std::format-/std::print-ing ranges of UTF.
>
> Does "utf8_view" embody the semantic constraint that it refers to
> a valid UTF8 code unit sequence?

Yes. You can feed whatever garbage in one end, and you always get
valid UTF-N out of all the views and algorithms. If you transcode
from /dev/random, the output will probably have a large number of
replacement characters.

> In other words, is it undefined
> behavior if you construct a utf8_view over a sequence of elements
> that isn't actually a valid sequence of UTF8 code units?

No. None of the views introduces any UB.

> Apparently, "utf8_iter" does not embody that constraint, because
> we have functions such as find_invalid_encoding() that can find
> non-UTF-8.

Right. Input may be garbage.

> Oh, why is there no find_invalid_encoding that takes a range
> instead of an iterator pair?

Brevity. I don't expect these to be used very often. They're pretty
low-level. If there's a strong desire by the rest of SG-16, I'm not
opposed to adding them.

> General design question: Do we want to differentiate in the type
> system these two situations?
>
> - a valid sequence of UTF-x code units
> - a sequence of integers that may or may not be a valid sequence of UTF-x code units

I don't. Some people might want to.

> Transcoding facilities would always produce "valid" sequences,
> which might save the next step a potentially costly validation.
> But maybe validation is so cheap that we'd prefer to always avoid
> the undefined behavior inherent in the "type represents valid
> sequence" option.

By getting rid of the algorithms, we've closed the door on efficiency
discussions, IMO.

> > Streaming or printing a utfN_view "just works", and this convenience
> > is used throughout Boost.Text and the examples in the papers I'm
> > proposing. I think the value of this convenience is evident in the
> > examples.
>
> I've searched for "<<" in https://open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r0.html
> and the only matches are those for the definitions of operator<<.
> Similar for "format(" and "print(". As it stands, the facility
> seems undermotivated to me. I'd like to point out that a WG21
> proposal should have complete motivation for any proposed feature,
> if only for historical record.

Again, printing text is a first-order feature.

> I do notice that you could create formatter specializations for
> "subrange" with the constraint that the subrange's values are
> sufficiently UTF-x-like. However, I'd like to point out that there
> is currently no facility to print char8_t data via std::format
> (I think), and I'm hesitant to introduce that as a drive-by with
> a transcoding facility.

I prefer the system I'm proposing, because it's more concrete -- the
things that get printed/formatted/streamed are utN_views, which are an
explicit part of the Unicode support. A more general facility might
catch things that look UTF-like that the user never intended.

> I'm not finding range adaptors in the style of [range.adaptors]
> that can be chained with "|" and would transcode a UTF-8 range
> into a (say) UTF-32 range. Could you point me to those in your
> paper?

They're in there now.

Zach

Received on 2023-05-04 01:48:04