C++ Logo

sg16

Advanced search

Re: Thoughts on P2728R6: Unicode in the Library, Part 1: UTF Transcoding

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Thu, 14 Sep 2023 22:07:11 +0200
On 14/09/2023 17.59, Zach Laine via SG16 wrote:
> On Wed, Sep 13, 2023 at 2:30 PM Jens Maurer via SG16
> <sg16_at_[hidden]> wrote:
>>
>> On 13/09/2023 18.58, Tom Honermann via SG16 wrote:
>>> The following reflects some of my personal thoughts regarding this paper
>>
>> Same here.
>>
>> In general, I find the paper seriously lacking rationale.
>
> Please indicate which design decisions you think lack rationale.

This is an incomplete list:

 - For each defined concept, what is the rationale for exposing it
(as opposed to having it exposition-only)?
(Reducing the number of defined concepts goes a long way of avoiding
rationale.)

"It is common to want to view the same text as code points and code units at different times.
It is therefore important that transcoding iterators ..."

Some examples would help to demonstrate "common".
Why is it important that iterators (as opposed to ranges) have that
property? Why can't I just remember a view to the original range?

 - Why does the example in 4.1 and 4.3 pass a range by value? Best practice seems
to be to pass ranges by &&. (And that makes sense, because you don't want
to copy a std::vector<char16_t> that you might happen to pass into
process_input.)

 - Why does the example in 4.1 declare process_input_again() the way it does?
It seems bad practice to ask for a specific type of view, instead of having
a template parameter with suitable constraints on e.g. value type and
other properties. This part is not motivating anything for me.

 - The last line in example 4.2 has a typo: utf16.end() -> utf16_view.end().
(Personally, I would normalize to "utf16", since utf16_view is already
defined by the library, so confusion is possible.)

 - Example 4.2 does not appear to motivate the presence of the utf_iterator
interface. After all, the view-based formulation is so much shorter and
nicer. Please drop the utf_iterator formulation from that example.

 - Example 4.3 talks about "accepts sequences of UTF-16", but then
it takes a utf8_range. That doesn't add up.

 - Example 4.4: Does the existing std::format allow putting a char8_t character
range into a char-based format string? I thought we recently disabled that
because it might not do the right thing. Does Victor have an opinion?
Which combinations are supported, exactly? The intro text talks "code points"
(UTF-32), but the example then streams UTF-8 onto a std::ostream.
What happens if I stream char16_t on wostream on Linux and char32_t on
wostream on Windows?

 - There is no rationale for the chosen names. It's pretty obvious
for "utf8", but the as_utf8 vs. to_utf8 question seems to hint that
alternatives should be presented, together with some discussion
and rationale for the eventual choice.

 - 'This proposal depends on the existence of P2727 “std::iterator_interface”.'
Why? We know that std::iterator_interface doesn't do the right thing for
some corner cases. If that's just an implementation shortcut, it probably
shouldn't be part of the interface specification. (The latter is hard to
achieve with the over-explicit ranges specifications.)

 - Why is the hierarchy of concepts as shown? Atomic constraints affect
subsumption, which affects overload resolution. Having the long list
of disjunctions seems unhelpful for that. If we want to constrain a
template parameter that it is a UTF-8 code unit, we should end up
with the atomic constraints "same_as<T, char8_t> || same_as<T, char>"
(the second part depends on the "option 2" switch).
The "false" parts inside code_unit don't go away during constraint
normalization for e.g. utf8_code_unit; see [temp.constr.normal].
This means utf_code_unit has 9 atomic constraints even without option 2.
For the avoidance of doubt, 'same_as<T, char8_t> && F == format::utf8'
is an (entire) atomic constraint (because those need to be primary
expressions, and 'F == format::utf8' is not a primary expression.
That means "same_as" is (just) a boolean expression and doesn't get
the special recursive constraint normalization treatment.

 - Section 5.4.1 "That is, the adapting iterator that as_char32_t uses
is gone. This makes using as_char32_t more efficient, when used in
conjunction with as_utfN."
Did you check the generated assembly code? I'd expect the optimizer
to remove this entirely.

 - Section 5.4 utf_iterator: Whether this befriends any other utf_iterator
is an implementation detail that shouldn't be shown.

 - The fact that utf_iterator's constructor has different arity
depending on the properties of "I" gives me pause, and seems
actively unhelpful in generic contexts. If I want to iterate
with ++ only, the caller still needs to find out whether his
iterator happens to be a forward iterator or a bidirectional
iterator in order to call the right constructor. Can we allow
the three-argument constructor in more cases?

 - Given that utf_iterator takes less memory if it is forward-only,
can I somehow intentionally "downgrade" it to avoid overhead?


>> I notice we have
>>
>> template<format Format, class R>
>> utf_view(R &&) -> utf_view<Format, views::all_t<R>>;
>>
>> How is "Format" going to be deduced here?
>>
>>
>> The deduction guides for utf8_view and friends are not shown in the synopses.
>
> I find this confusing too, but apparently this is just how deductions
> guides interact with template aliases. Tomasz showed me how to get it
> to work. If you look down a bit from that declaration, you'll see the
> aliases for the utfN_views. The combination of the guide for utf_view
> and the alias for utfN_view makes this work.

You need a deduction guide for the alias template utf8_view, not for the
class template utf_view, I think.

>> Why is there a project_view? The existing transform_view seems to work quite
>> nicely in its place.
>
> It does not, because transform_view cannot be a borrowed_range.

Ah, ok.

> project_view was an attempt to remedy that, suggested in an SG-9
> meeting. Since then, SG-9 has decided they'd rather see a solution
> for making transform_view conditionally borrowed. I'm going to write
> a separate paper with Barry for that. More explicitly, transform_view
> will be gone from the next revision.

transform_view -> project_view in that last sentence, I think.
Otherwise: Wonderful.

>> How can I configure error handling for the utf_view ? That seems to be missing.
>
> It is indeed missing. As Corentin pointed out, there's not really a
> mechanism in the ranges work for early termination on error.

As discussed in the telecon, throwing an exception is a perfectly fine
error handling strategy that is supported by ranges. But if error
handling goes away entirely (also for utf_iterator), then this is moot.

>> constexpr auto begin() {
>> constexpr format from_format = format-of<ranges::range_value_t<V>>();
>> if constexpr(is-charn-view<V>) {
>> return make_begin<from_format>(base_.impl_.begin().base(), base_.impl_.end().base());
>> } else {
>> return make_begin<from_format>(ranges::begin(base_), ranges::end(base_));
>> }
>> }
>>
>>
>> This apparently wants to do some unpacking in some special cases, in addition / beyond
>> unpacking utfN -> utfX -> utfY chains. It would be better to state that this unpacking
>> is done by the range adaptor object; see (for example) [range.drop.overview].
>
> That would be problematic, because it is inconvenient -- when you
> create a utf_view<format::utf8, char8_view<V>> foo (whether directly
> or via adaptors), you want foo.base() to return a char8_view<V>, not a
> V.

Agreed so far.

> Giving you a V would seem to silently convert foo.base() from
> char8_t to whatever it was adapted from (say char).

Ok.

I'm reading here that you're taking extra effort to instantiate
utf_iterator<iterator of V> instead of utf_iterator<iterator of charN_view<V>>.
I'm not seeing any benefit in doing so; charN_view is a very thin layer
that the compiler should just optimize away.

>> And, of course, instead of talking about as_utfN, we should talk about as_utf<T>
>> where T is one of char8_t, char16_t, or char32_t. as_utfN can still exist,
>> e.g. as references to the corresponding variable template specializations.
>
> That seems odd to me. Do you have a use case for someone wanting to
> write code that converts to "some UTF" generically? I'm all about
> genericity, but I'm having a hard time coming up with a realistic
> example. Usually you need to convert to exactly UTF-8, or UTF-32, or
> whatever. Coming *from* a generic input is likely to be a fairly
> common case, but converting to a generic output does not seem seem
> likely to be.

Suppose I want to write code that works well on platforms where char = 8 bit
and on platforms where char = 16 bit. In order to store text efficiently on
both platforms, it seems plausible to switch from utf8 to utf16 as the
internal representation, depending on CHAR_BIT. And that's best done
by template parameters, it seems.

Given the very small, and only syntactic, cost of enabling such use,
why not just do it?

(I'm still occasionally offended by the non-generic nature of int_leastN_t
int_fastN_t.)

>> "5.4.2 Why utf_iterator is not a nested type within utf_view"
>>
>> I disagree with the rationale. You can get at the iterator if you have
>> a view, or you can construct a view type and ask for its iterator if you
>> feel the need.
>
> SG-9 voted on this and found weak consensus for having a separate iterator.

Where is that documented in the paper?

So, you're telling me I need to convince LEWG to throw a monkey wrench.

Jens

Received on 2023-09-14 20:07:16