C++ Logo

sg16

Advanced search

Re: Thoughts on P2728R6: Unicode in the Library, Part 1: UTF Transcoding

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Thu, 14 Sep 2023 08:20:27 +0200
On 13/09/2023 21.30, Jens Maurer via SG16 wrote:
> On 13/09/2023 18.58, Tom Honermann via SG16 wrote:
>> The following reflects some of my personal thoughts regarding this paper
>
> Same here.
>
> In general, I find the paper seriously lacking rationale.

I understand I may have missed a few SG16 meetings and/or telecons.
I don't remember being present when the following changes were asked
for (leading to R4 of the paper):

Removed the utility functions and Unicode-related constants, except replacement_character.
Change null_sentinel_t back to being Unicode-specific.

I remember always advocating for null_sentintel_t being a std::-level
facility, so that corroborates my feeling I wasn't present for that
iteration.

Anyway, more comments:

5.2 says "Add concepts that describe parameters to transcoding APIs"

I'm not seeing a use of utf_input_range_like or of utf8_input_range_like
in the rest of the paper. Why do they exist?

In general, changing concepts after their initial publication must be
considered a backward-compatibility-breaking change in general, so we
have been rather careful introducing named concepts, and have relied
on exposition-only concept when we "just" needed to describe
parameter constraints for internal purposes. (A named concept
might be employed by the user for their own purposes, unrelated to
the other facilities presented here.)

That means I want to see detailed rationale in the paper for the
introduction of each of the named concepts. In general, I think
we should refrain from introducing short-hand concepts such as
utf16_code_unit if an additional template parameter can handle
the situation just fine, e.g. code_unit<T, char16_t>
("is T a UTF-16 code unit?")


"The encoding of u8"text" is not necessarily UTF-8!"

That is an incorrect statement for a conforming implementation;
see [lex.string].

"It depends on the flags you pass to your compiler."

You can always pass flags to your compiler that make your
compiler non-conforming, but that doesn't mean we should
base standard library design decisions on that.

In short, I think the presence of such an argument weakens
the paper.


> In short, I think "text" | std::uc::as_utf32 should “just work”. Making users write "text" | std::uc::as_char8_t | std::uc::as_utf32, when that does not increase correctness or efficiency – and produces no different object code – seems wrongheaded to me.

I think the paper could benefit from a bit more discussion of the implied assumption
that an input range of char is UTF-8 encoded. In other parts of the standard (e.g. std::print),
we make that assumption only if the literal encoding is UTF-8. (Same for wchar_t, whose
encoding might be UTF-16 or UTF-32 or something non-UTF.)

Talking about "text" | blah in particular, where does the decay to pointer happen?
Isn't this an array-of-char that is passed as-is to "blah", causing it to attempt
to transcode the terminating 0 as well?


Section "Add unpack_iterator_and_sentinel CPO for iterator “unpacking”"

This then shows this in an example:

    // Get the input as UTF-32. This may involve unpacking, so possibly decltype(r.begin()) != I.
    auto r = ranges::subrange(first, last) | uc::as_utf32;

Since this is a view layered on top of an iterator range, I don't think there is any
reasonable expectation at all that decltype(r.begin()) could ever be the type of "first" == I,
regardless of unpacking.

I think it's an SG9 question whether the "transcode_to_utf32" use-case (returning some
iterators telling me how far I got, using ranges algorithms) is a use-case that ranges
wants to support in the first place. If so, a "packing/unpacking" facility is missing
in general, it seems. (We have the same problem for something like views::drop, which
also special-case certain views and thus run into the same iterator type problems,
I believe.) And that means such a facility should be introduced and proposed on its
own merits.

Jens

Received on 2023-09-14 06:20:33