C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] UTF transcoding views draft D2728R14 includes .base_code_units() design section

From: Eddie Nolan <eddiejnolan_at_[hidden]>
Date: Tue, 9 Jun 2026 01:25:04 +0200
Hi Tom,

Thanks for the feedback. I've updated my draft to try to provide more
rationale: https://isocpp.org/files/papers/D2728R14.html#base_code_units

Unfortunately I think the *non-propagating-cache *idea is not relevant
here; it's basically just a `std::optional` with weird copy/move semantics
that discard the object so the cache doesn't propagate when the view is
copied/moved.

The updated verbiage from the paper is copied below.

[...]

In response, it was suggested that the above lifetime issue could be
addressed by changing the return type of .base_code_units() to something
like std::inplace_vector<char8_t, 4>.

That creates a different lifetime problem. Consider this example. The
Unicode Tags block is intended for use in flag emojis but has been used for
LLM prompt injections. Say a user writes the following function, which
divides the stream of characters into Tags and non-Tags, and also imagine
that they have a custom sink type that accepts iterator pairs rather than
ranges:

 <https://isocpp.org/files/papers/D2728R14.html#cb39-1>constexpr bool
is_tag(char32_t c) { return (c & ~0x7F) == 0xE0000; }
<https://isocpp.org/files/papers/D2728R14.html#cb39-2>
<https://isocpp.org/files/papers/D2728R14.html#cb39-3>void
partition_tags(std::ranges::range auto text, sink non_tags, sink tags)
{ <https://isocpp.org/files/papers/D2728R14.html#cb39-4> auto
utf_view = text | std::views::to_utf32;
<https://isocpp.org/files/papers/D2728R14.html#cb39-5> for (auto it =
utf_view.begin(); it != utf_view.end(); ++it) {
<https://isocpp.org/files/papers/D2728R14.html#cb39-6> (is_tag(*it)
? tags : non_tags).consume(
<https://isocpp.org/files/papers/D2728R14.html#cb39-7>
it.base_code_units().begin(), it.base_code_units().end());
<https://isocpp.org/files/papers/D2728R14.html#cb39-8> }
<https://isocpp.org/files/papers/D2728R14.html#cb39-9>}

Again, this works perfectly well when partition_tags is passed a forward
range, but then when it’s passed an input range, because each call to .
base_code_units() returns a separate temporary std::inplace_vector, it.
base_code_units().begin() and it.base_code_units().end() now point to
different objects, so the function invokes UB.
10.5.4 Survey of Range Adaptors that Downgrade to Input
<https://isocpp.org/files/papers/D2728R14.html#survey-of-range-adaptors-that-downgrade-to-input>

Some range adaptors downgrade forward ranges into input ranges: these are,
to my understanding, views::as_input, views::cache_latest, views::join, and
views::join_with.

[range.as.input.overview] states, “This is useful to avoid overhead that
can be necessary to provide support for the operations needed for greater
iterator strength.” This use case is potentially relevant for transcoding
views, since the size of the iterator may be greater with a stronger
iterator category. For example, bidirectional transcoding iterators need to
store the begin iterator from the underlying range to avoid overrunning the
beginning when transcoding backwards, but forward iterators don’t need it.

But implementing .base_code_units() for input views would actually cause
views::as_input to *increase* the transcoding iterator’s overhead relative
to its forward-iterator implementation, because the iterator would need to
contain an additional code unit cache.

views::as_input was introduced by [P3725R3] <https://wg21.link/p3725r3>,
“Filter View Extensions for Safer Use,” and, rather than avoiding overhead,
its main motivation was composition with std::views::filter in order to
avoid pitfalls related to mutating through a filter.

This is potentially relevant to transcoding, in that someone might write a
filter-view pipeline on characters. Say a user wants to print the UTF-8
code units for all the non-ASCII code points in a range. That would look
like this:

 <https://isocpp.org/files/papers/D2728R14.html#cb40-1>void
print_nonascii_code_points_and_code_units(std::ranges::range auto
text) { <https://isocpp.org/files/papers/D2728R14.html#cb40-2> auto
print_code_point{
<https://isocpp.org/files/papers/D2728R14.html#cb40-3> [](char32_t
code_point, auto code_unit_range) {
<https://isocpp.org/files/papers/D2728R14.html#cb40-4>
std::println( <https://isocpp.org/files/papers/D2728R14.html#cb40-5>
   "{:#x} = {::#x}", static_cast<std::uint32_t>(code_point),
<https://isocpp.org/files/papers/D2728R14.html#cb40-6>
code_unit_range | std::views::transform([](char8_t c) { return
(std::uint8_t)c; }));
<https://isocpp.org/files/papers/D2728R14.html#cb40-7> }};
<https://isocpp.org/files/papers/D2728R14.html#cb40-8> auto
code_points = text
<https://isocpp.org/files/papers/D2728R14.html#cb40-9>
    | std::views::filter([](char8_t c) { return c >= 0x80; })
<https://isocpp.org/files/papers/D2728R14.html#cb40-10>
     | std::views::to_utf32;
<https://isocpp.org/files/papers/D2728R14.html#cb40-11> for (auto it
= code_points.begin(); it != code_points.end(); ++it) {
<https://isocpp.org/files/papers/D2728R14.html#cb40-12>
print_code_point(*it, it.base_code_units());
<https://isocpp.org/files/papers/D2728R14.html#cb40-13> }
<https://isocpp.org/files/papers/D2728R14.html#cb40-14>}

A user following the [P3725R3] <https://wg21.link/p3725r3> guidance might
insert a views::as_input adaptor into the pipeline before std::views::filter,
which would continue to compile and work if we provided .base_code_units()
for input ranges, but which would cause
print_nonascii_code_points_and_code_units to fail to compile if we didn’t.

But views::as_input isn’t strictly necessary here. And we already need to
teach users that inserting views::as_input before std::views::filter will,
in rare cases, cause some uses of .base() to fail to compile. To
demonstrate why this isn’t a novelty, consider the following example:

 <https://isocpp.org/files/papers/D2728R14.html#cb41-1>struct Task {
int priority; };
<https://isocpp.org/files/papers/D2728R14.html#cb41-2>
<https://isocpp.org/files/papers/D2728R14.html#cb41-3>bool
submit_batch(std::ranges::range auto batch);
<https://isocpp.org/files/papers/D2728R14.html#cb41-4>
<https://isocpp.org/files/papers/D2728R14.html#cb41-5>// Submit the
high-priority tasks in batches; on a transient failure, hand the
<https://isocpp.org/files/papers/D2728R14.html#cb41-6>// remaining
high-priority tasks to the retry queue.
<https://isocpp.org/files/papers/D2728R14.html#cb41-7>void
submit_high_priority_tasks(std::vector<Task>& tasks) {
<https://isocpp.org/files/papers/D2728R14.html#cb41-8> auto high =
tasks | std::views::filter([](Task const& t) { return t.priority >
100; }); <https://isocpp.org/files/papers/D2728R14.html#cb41-9> auto
batches = high | std::views::chunk(BATCH_SIZE);
<https://isocpp.org/files/papers/D2728R14.html#cb41-10> for (auto it
= batches.begin(); it != batches.end(); ++it) {
<https://isocpp.org/files/papers/D2728R14.html#cb41-11> if
(!submit_batch(*it)) {
<https://isocpp.org/files/papers/D2728R14.html#cb41-12>
requeue(std::ranges::subrange(it.base(), high.end()));
<https://isocpp.org/files/papers/D2728R14.html#cb41-13> return;
<https://isocpp.org/files/papers/D2728R14.html#cb41-14> }
<https://isocpp.org/files/papers/D2728R14.html#cb41-15> }
<https://isocpp.org/files/papers/D2728R14.html#cb41-16>}

This works as written, but if views::as_input is inserted in front of views
::filter, the call to it.base() fails to compile because std::ranges::
chunk_view’s iterator doesn’t provide .base() for input views. But views::
as_input is unnecessary here as well.

Furthermore, it’s worth noting that the list of plausible reasons to apply
a filter_view on code *units* as opposed to code *points* is extremely
short; ordinarily, doing so risks corrupting the output.

Moving on to views::cache_latest: that one is an adaptor with a niche use
case and no special relevance to transcoding.

views::join is directly relevant, since it’s common for users to want to
reassemble a text string that had previously been broken up into separate
parts before transcoding it. views::join_with is also potentially relevant,
since users may want to transcode text after having used views::join_with
to add separators to it.

It’s important to note that in the common case, views::join and views::
join_with do not downgrade from forward to input. They only do so if the
range of ranges it’s given is a range of *prvalue* ranges.

For example, the views::join adaptor in the following example does not
downgrade:

 <https://isocpp.org/files/papers/D2728R14.html#cb42-1>void
print_errors(std::ranges::range auto packets) {
<https://isocpp.org/files/papers/D2728R14.html#cb42-2> auto
print_code_units =
<https://isocpp.org/files/papers/D2728R14.html#cb42-3>
[](std::ranges::range auto code_unit) {
<https://isocpp.org/files/papers/D2728R14.html#cb42-4>
std::println("{::#x}",
<https://isocpp.org/files/papers/D2728R14.html#cb42-5>
  code_unit <https://isocpp.org/files/papers/D2728R14.html#cb42-6>
              | std::views::transform([](char8_t c) { return
(std::uint8_t)c; }));
<https://isocpp.org/files/papers/D2728R14.html#cb42-7> };
<https://isocpp.org/files/papers/D2728R14.html#cb42-8> auto utf_view
= packets <https://isocpp.org/files/papers/D2728R14.html#cb42-9>
         | std::views::join
<https://isocpp.org/files/papers/D2728R14.html#cb42-10>
| std::views::to_utf32_or_error;
<https://isocpp.org/files/papers/D2728R14.html#cb42-11> for (auto it
= utf_view.begin(); it != utf_view.end(); ++it) {
<https://isocpp.org/files/papers/D2728R14.html#cb42-12> if
(!(*it).has_value()) {
<https://isocpp.org/files/papers/D2728R14.html#cb42-13>
print_code_units(it.base_code_units());
<https://isocpp.org/files/papers/D2728R14.html#cb42-14> }
<https://isocpp.org/files/papers/D2728R14.html#cb42-15> }
<https://isocpp.org/files/papers/D2728R14.html#cb42-16>}

But, if the packets need to be decrypted before transcoding, and the user
alters the pipeline like so:

 <https://isocpp.org/files/papers/D2728R14.html#cb43-1>+ std::u8string
decrypt(std::u8string_view packet) {
<https://isocpp.org/files/papers/D2728R14.html#cb43-2>+ return
packet <https://isocpp.org/files/papers/D2728R14.html#cb43-3>+
 | std::views::transform(
<https://isocpp.org/files/papers/D2728R14.html#cb43-4>+
[](char8_t c) {
<https://isocpp.org/files/papers/D2728R14.html#cb43-5>+
return static_cast<char8_t>(c ^ 0x55);
<https://isocpp.org/files/papers/D2728R14.html#cb43-6>+
}) <https://isocpp.org/files/papers/D2728R14.html#cb43-7>+ |
std::ranges::to<std::u8string>();
<https://isocpp.org/files/papers/D2728R14.html#cb43-8>+ }
<https://isocpp.org/files/papers/D2728R14.html#cb43-9>
<https://isocpp.org/files/papers/D2728R14.html#cb43-10>void
print_errors(std::ranges::range auto packets) {
<https://isocpp.org/files/papers/D2728R14.html#cb43-11> auto
print_code_units =
<https://isocpp.org/files/papers/D2728R14.html#cb43-12>
[](std::ranges::range auto code_unit) {
<https://isocpp.org/files/papers/D2728R14.html#cb43-13>
std::println("{::#x}",
<https://isocpp.org/files/papers/D2728R14.html#cb43-14>
   code_unit <https://isocpp.org/files/papers/D2728R14.html#cb43-15>
                | std::views::transform([](char8_t c) { return
(std::uint8_t)c; }));
<https://isocpp.org/files/papers/D2728R14.html#cb43-16> };
<https://isocpp.org/files/papers/D2728R14.html#cb43-17> auto utf_view
= packets <https://isocpp.org/files/papers/D2728R14.html#cb43-18>+
          | std::views::transform(decrypt)
<https://isocpp.org/files/papers/D2728R14.html#cb43-19>
| std::views::join
<https://isocpp.org/files/papers/D2728R14.html#cb43-20>
| std::views::to_utf32_or_error;
<https://isocpp.org/files/papers/D2728R14.html#cb43-21> for (auto it
= utf_view.begin(); it != utf_view.end(); ++it) {
<https://isocpp.org/files/papers/D2728R14.html#cb43-22> if
(!(*it).has_value()) {
<https://isocpp.org/files/papers/D2728R14.html#cb43-23>
print_code_units(it.base_code_units());
<https://isocpp.org/files/papers/D2728R14.html#cb43-24> }
<https://isocpp.org/files/papers/D2728R14.html#cb43-25> }
<https://isocpp.org/files/papers/D2728R14.html#cb43-26>}

Then it downgrades.

[...]

Thanks,

Eddie


On Mon, Jun 8, 2026 at 10:36 AM Tom Honermann <tom_at_[hidden]> wrote:

> Thank you, Eddie. I appreciate the treatise provided; I think it covers
> the concerns well.
>
> The lifetime safety concerns could, I think, be addressed by
> base_code_units() returning (in the case of an input range) a small
> container (e.g., std::array<char8_t, 4>) by value rather than returning a
> reference to a cache held in the iterator. Note that the subrange returned
> need not be mutable (and probably shouldn't be).
>
> I mentioned one concern that wasn't addressed; the propensity for input
> ranges to be encountered at an increased frequency due to use of range
> adapters. I don't have a good sense of how often demotion to an input range
> should be expected going forward. I think it would be helpful to include
> some discussion or that and/or a list of range adapters that result in such
> demotion (a search for "iterator_concept denotes input_iterator_tag" and
> "iterator_category denotes input_iterator_tag" may be helpful). There seems
> to be some tension in the ranges library regarding support for input ranges
> and their potential use to avoid overhead (see [range.as.input.overview]
> <https://eel.is/c++draft/range.as.input.overview>). The standard
> specifies a *non-propagating-cache* ([range.nonprop.cache]
> <https://eel.is/c++draft/range.nonprop.cache>) exposition-only type that
> looks like it might be relevant too; it is used in the specification for
> join_view, join_with_view, lazy_split_view, chunk_view (for input
> ranges), and cache_latest_view. You concluded that the ergonomics of easy
> access to the underlying code unit sequence is not favored relative to the
> safety aspects, but I'm yet to be convinced pending more analysis of range
> adapters.
>
> Tom.
> On 6/7/26 5:29 PM, Eddie Nolan via SG16 wrote:
>
> Cross-posting this message to both the SG9 and SG16 mailing lists.
>
> I've uploaded a new draft D2728R14 of my transcoding views paper, which
> includes a new discussion of the idea of adding a member function to the
> transcoding iterator to provide the underlying code unit range for the
> current code point, and my case against doing so, in the "Design Discussion
> and Alternatives" section.
>
> It can be found here:
>
> https://isocpp.org/files/papers/D2728R14.html#base_code_units
>
> This should address the requests for relevant code examples from the most
> recent SG16 telecon (2026-05-27
> <https://wiki.isocpp.org/2026_Telecons:SG16Teleconference2026-05-27>).
> See also previous discussions on the SG16 reflector
> <https://lists.isocpp.org/sg16/2026/05/4711.php> and mattermost
> <https://chat.isocpp.org/general/pl/6hms7iacbtbmppcyubezerc95a>.
>
> Thanks,
>
> Eddie
>
>

Received on 2026-06-08 23:25:20