ISOCPP sg16 List: Re: utfN

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Wed, 3 May 2023 21:06:36 -0500

On Wed, May 3, 2023 at 8:58 PM Zach Laine <whatwasthataddress_at_[hidden]> wrote:
>
> On Mon, Apr 17, 2023 at 1:57 PM Tom Honermann <tom_at_[hidden]> wrote:
> >
> > On 4/16/23 2:54 PM, Zach Laine via SG16 wrote:
> >
> > We again talked about utfN_view at the last meeting. I was trying to
> > justify their existence, and again I could not remember the salient
> > point during the discussion. Now I have. Here is one of them:
> >
> > template<utf8_iter I, sentinel_for<I> S = I>
> > struct utf8_view : view_interface<utf8_view<I, S>> {
> > using iterator = I;
> > using sentinel = S;
> >
> > constexpr utf8_view() {}
> > constexpr utf8_view(iterator first, sentinel last);
> >
> > constexpr iterator begin() const;
> > constexpr sentinel end() const;
> >
> > friend constexpr bool operator==(utf8_view lhs, utf8_view rhs)
> > { return lhs.begin() == rhs.begin() && lhs.end() == rhs.end(); }
> >
> > template<class CharT, class Traits>
> > friend basic_ostream<CharT, Traits>&
> > operator<<(basic_ostream<CharT, Traits>& os, utf8_view v);
> >
> > private:
> > using iterator_t = unspecified; // exposition only
> > using sentinel_t = unspecified; // exposition only
> >
> > iterator_t first_; // exposition only
> > [[no_unique_address]] sentinel_t last_; // exposition only
> > };
> >
> > Note the operator<<. I don't know how to provide a general-purpose
> > way to stream out a subrange<I, S>, when we know that it happens to
> > contain UTF-8, so I created utf8_view, and added an operator<<. I
> > have a similar concern about adding support for
> > std::format-/std::print-ing ranges of UTF.
> >
> > I don't think the operator<< above works as a general-purpose method regardless. What does it do when CharT is wchar_t?
>
> It transcodes to UTF-16, of course. In my implementation, I only
> support printing to ostream<{char,wchar_t}>. I was asked (I can't
> remember by whom anymore) to make it a template <typename CharT>
> generalization. It should perhaps be constrained to utf_code_unit
> CharT.
>
> > Streaming or printing a utfN_view "just works", and this convenience
> > is used throughout Boost.Text and the examples in the papers I'm
> > proposing.
> >
> > I suspect this is not actually true. The paper doesn't explain what operator<< actually does at present. Does it "just work" on Windows to stream to stdout if the user hasn't changed the console encoding to UTF-8 and is not using Microsoft's new Terminal? What would it do if stdout is directed to a terminal in an EBCDIC environment? What if it were directed to a text file in that same environment?
>
> If you configure your system to print Unicode and then try to read it
> as non-Unicode, you get mojibake. That doesn't mean that it doesn't
> "just work" when you don't do something batty. There has to be
> support all up and down your data pipeline for Unicode, for you not to
> get mojibake. I can't fix that, but I can specify how to produce
> UTF-formatted output.
>
> > There are some hard questions here that I think need to be (separately) answered before we can start supplying such operators.
>
> I disagree. We don't need to lock the entire system down to Unicode
> to take some text and spit it out to an ostream.
>
> > I think the value of this convenience is evident in the
> > examples. If someone has a reasonable alternative, I'm happy to
> > replace utfN_view with something that works more like a typical
> > std::ranges view. Without such an alternative, I want to keep the
> > current design.
> >
> > For the case where UTF text is held in char or wchar_t based storage, the solution I prefer is to give the programmer a tool for presenting that data through an interface that exposes it as char8_t, char16_t, or char32_t. Then, we can just rely on the type system to infer the right encoding to use. Something like the following where the unspecified iterator converts the value type of the supplied iterator to char8_t.
> >
> > template<std::input_iterator I, std::sentinel_for<I> S>
> > requires std::convertible_to<std::value_type_t<I>, char8_t>
> > struct as_utf8_view : std::ranges::view_base {
> > using iterator = /* unspecified */;
> > using sentinel = /* unspecified */;
> >
> > constexpr as_utf8_view();
> > constexpr as_utf8_view(I, S);
> >
> > constexpr iterator begin() const;
> > constexpr sentinel end() const;
> > };
> > template<std::ranges::range R>
> > requires std::convertible_to<std::ranges::range_value_t<R>, char8_t>
> > auto as_utf8(R r) {
> > return as_utf8_view(std::ranges::begin(r), std::ranges::end(r));
> > }
> >
> > That suffices to adapt a range of values of a type that is convertible to char8_t to a view of char8_t values such that they can be used with any interface that works with a range of char8_t.
> >
> > (Feel free to substitute CTAD as desired)
>
> I think this is pretty close to what's in the P1 version of the paper,
> modulo spelling.
>
> Zach

One thing I forgot to address above: We could change the
bits-requirements on the concepts from exactly-N bits to >=N bits &&
<N*2 bits for UTF-{8,16}, and >=32 bits for UTF-32. Any more general
than that, and we cannot deduce the UTF from a given integral type.

Ok, new issue. This is the result of one of the polls from last time
we talked about all this:

    UTF transcoding interfaces provided by the C++ standard library should
    operate on `charN_t` types, with support for other types provided by
    adapters, possibly with a special case for `char` and `wchar_t` when
    their associated literal encodings are UTF.

    SF F N A SA
     5 1 0 0 1

    Attendance: 9

    Consensus: Consensus in favour

    Author position: SA - Precondition that input is intended to be UTF-8.
    You can't get around that by adding a wrapper. This doesn't help
    people find bugs.

So, what do people suggest here as the adaptor mechanism? Something like:

std::vector<int> code_points = /* something from ICU */;

auto r = code_points | std::uc::trust_me_its_utf32 | std::uc::as_utf16;

? Something else?

Zach

Received on 2023-05-04 02:06:49