C++ Logo

sg16

Advanced search

Re: utfN_view

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 17 Apr 2023 14:56:59 -0400
On 4/16/23 2:54 PM, Zach Laine via SG16 wrote:
> We again talked about utfN_view at the last meeting. I was trying to
> justify their existence, and again I could not remember the salient
> point during the discussion. Now I have. Here is one of them:
>
> template<utf8_iter I, sentinel_for<I> S = I>
> struct utf8_view : view_interface<utf8_view<I, S>> {
> using iterator = I;
> using sentinel = S;
>
> constexpr utf8_view() {}
> constexpr utf8_view(iterator first, sentinel last);
>
> constexpr iterator begin() const;
> constexpr sentinel end() const;
>
> friend constexpr bool operator==(utf8_view lhs, utf8_view rhs)
> { return lhs.begin() == rhs.begin() && lhs.end() == rhs.end(); }
>
> template<class CharT, class Traits>
> friend basic_ostream<CharT, Traits>&
> operator<<(basic_ostream<CharT, Traits>& os, utf8_view v);
>
> private:
> using iterator_t = unspecified; // exposition only
> using sentinel_t = unspecified; // exposition only
>
> iterator_t first_; // exposition only
> [[no_unique_address]] sentinel_t last_; // exposition only
> };
>
> Note the operator<<. I don't know how to provide a general-purpose
> way to stream out a subrange<I, S>, when we know that it happens to
> contain UTF-8, so I created utf8_view, and added an operator<<. I
> have a similar concern about adding support for
> std::format-/std::print-ing ranges of UTF.
I don't think the operator<< above works as a general-purpose method
regardless. What does it do when CharT is wchar_t?
> Streaming or printing a utfN_view "just works", and this convenience
> is used throughout Boost.Text and the examples in the papers I'm
> proposing.

I suspect this is not actually true. The paper doesn't explain what
operator<< actually does at present. Does it "just work" on Windows to
stream to stdout if the user hasn't changed the console encoding to
UTF-8 and is not using Microsoft's new Terminal? What would it do if
stdout is directed to a terminal in an EBCDIC environment? What if it
were directed to a text file in that same environment?

There are some hard questions here that I think need to be (separately)
answered before we can start supplying such operators.

> I think the value of this convenience is evident in the
> examples. If someone has a reasonable alternative, I'm happy to
> replace utfN_view with something that works more like a typical
> std::ranges view. Without such an alternative, I want to keep the
> current design.

For the case where UTF text is held in char or wchar_t based storage,
the solution I prefer is to give the programmer a tool for presenting
that data through an interface that exposes it as char8_t, char16_t, or
char32_t. Then, we can just rely on the type system to infer the right
encoding to use. Something like the following where the unspecified
iterator converts the value type of the supplied iterator to char8_t.

    template<std::input_iterator I, std::sentinel_for<I> S>
    requires std::convertible_to<std::value_type_t<I>, char8_t>
    struct as_utf8_view : std::ranges::view_base {
         using iterator = /* unspecified */;
         using sentinel = /* unspecified */;

         constexpr as_utf8_view();
         constexpr as_utf8_view(I, S);

         constexpr iterator begin() const;
         constexpr sentinel end() const;
    };
    template<std::ranges::range R>
    requires std::convertible_to<std::ranges::range_value_t<R>, char8_t>
    auto as_utf8(R r) {
       return as_utf8_view(std::ranges::begin(r), std::ranges::end(r));
    }

That suffices to adapt a range of values of a type that is convertible
to char8_t to a view of char8_t values such that they can be used with
any interface that works with a range of char8_t.

(Feel free to substitute CTAD as desired)

Tom.

Received on 2023-04-17 18:57:01