C++ Logo

sg16

Advanced search

Re: utfN_view

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Wed, 3 May 2023 20:58:12 -0500
On Mon, Apr 17, 2023 at 1:57 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 4/16/23 2:54 PM, Zach Laine via SG16 wrote:
>
> We again talked about utfN_view at the last meeting. I was trying to
> justify their existence, and again I could not remember the salient
> point during the discussion. Now I have. Here is one of them:
>
> template<utf8_iter I, sentinel_for<I> S = I>
> struct utf8_view : view_interface<utf8_view<I, S>> {
> using iterator = I;
> using sentinel = S;
>
> constexpr utf8_view() {}
> constexpr utf8_view(iterator first, sentinel last);
>
> constexpr iterator begin() const;
> constexpr sentinel end() const;
>
> friend constexpr bool operator==(utf8_view lhs, utf8_view rhs)
> { return lhs.begin() == rhs.begin() && lhs.end() == rhs.end(); }
>
> template<class CharT, class Traits>
> friend basic_ostream<CharT, Traits>&
> operator<<(basic_ostream<CharT, Traits>& os, utf8_view v);
>
> private:
> using iterator_t = unspecified; // exposition only
> using sentinel_t = unspecified; // exposition only
>
> iterator_t first_; // exposition only
> [[no_unique_address]] sentinel_t last_; // exposition only
> };
>
> Note the operator<<. I don't know how to provide a general-purpose
> way to stream out a subrange<I, S>, when we know that it happens to
> contain UTF-8, so I created utf8_view, and added an operator<<. I
> have a similar concern about adding support for
> std::format-/std::print-ing ranges of UTF.
>
> I don't think the operator<< above works as a general-purpose method regardless. What does it do when CharT is wchar_t?

It transcodes to UTF-16, of course. In my implementation, I only
support printing to ostream<{char,wchar_t}>. I was asked (I can't
remember by whom anymore) to make it a template <typename CharT>
generalization. It should perhaps be constrained to utf_code_unit
CharT.

> Streaming or printing a utfN_view "just works", and this convenience
> is used throughout Boost.Text and the examples in the papers I'm
> proposing.
>
> I suspect this is not actually true. The paper doesn't explain what operator<< actually does at present. Does it "just work" on Windows to stream to stdout if the user hasn't changed the console encoding to UTF-8 and is not using Microsoft's new Terminal? What would it do if stdout is directed to a terminal in an EBCDIC environment? What if it were directed to a text file in that same environment?

If you configure your system to print Unicode and then try to read it
as non-Unicode, you get mojibake. That doesn't mean that it doesn't
"just work" when you don't do something batty. There has to be
support all up and down your data pipeline for Unicode, for you not to
get mojibake. I can't fix that, but I can specify how to produce
UTF-formatted output.

> There are some hard questions here that I think need to be (separately) answered before we can start supplying such operators.

I disagree. We don't need to lock the entire system down to Unicode
to take some text and spit it out to an ostream.

> I think the value of this convenience is evident in the
> examples. If someone has a reasonable alternative, I'm happy to
> replace utfN_view with something that works more like a typical
> std::ranges view. Without such an alternative, I want to keep the
> current design.
>
> For the case where UTF text is held in char or wchar_t based storage, the solution I prefer is to give the programmer a tool for presenting that data through an interface that exposes it as char8_t, char16_t, or char32_t. Then, we can just rely on the type system to infer the right encoding to use. Something like the following where the unspecified iterator converts the value type of the supplied iterator to char8_t.
>
> template<std::input_iterator I, std::sentinel_for<I> S>
> requires std::convertible_to<std::value_type_t<I>, char8_t>
> struct as_utf8_view : std::ranges::view_base {
> using iterator = /* unspecified */;
> using sentinel = /* unspecified */;
>
> constexpr as_utf8_view();
> constexpr as_utf8_view(I, S);
>
> constexpr iterator begin() const;
> constexpr sentinel end() const;
> };
> template<std::ranges::range R>
> requires std::convertible_to<std::ranges::range_value_t<R>, char8_t>
> auto as_utf8(R r) {
> return as_utf8_view(std::ranges::begin(r), std::ranges::end(r));
> }
>
> That suffices to adapt a range of values of a type that is convertible to char8_t to a view of char8_t values such that they can be used with any interface that works with a range of char8_t.
>
> (Feel free to substitute CTAD as desired)

I think this is pretty close to what's in the P1 version of the paper,
modulo spelling.

Zach

Received on 2023-05-04 01:58:25