Hi Tom,

Thanks for the feedback. I've updated my draft to try to provide more rationale: https://isocpp.org/files/papers/D2728R14.html#base_code_units

Unfortunately I think the non-propagating-cache idea is not relevant here; it's basically just a `std::optional` with weird copy/move semantics that discard the object so the cache doesn't propagate when the view is copied/moved.

The updated verbiage from the paper is copied below.

[...]

In response, it was suggested that the above lifetime issue could be addressed by changing the return type of .base_code_units() to something like std::inplace_vector<char8_t, 4>.

That creates a different lifetime problem. Consider this example. The Unicode Tags block is intended for use in flag emojis but has been used for LLM prompt injections. Say a user writes the following function, which divides the stream of characters into Tags and non-Tags, and also imagine that they have a custom sink type that accepts iterator pairs rather than ranges:

constexpr bool is_tag(char32_t c) { return (c & ~0x7F) == 0xE0000; }

void partition_tags(std::ranges::range auto text, sink non_tags, sink tags) {
  auto utf_view = text | std::views::to_utf32;
  for (auto it = utf_view.begin(); it != utf_view.end(); ++it) {
    (is_tag(*it) ? tags : non_tags).consume(
      it.base_code_units().begin(), it.base_code_units().end());
  }
}

Again, this works perfectly well when partition_tags is passed a forward range, but then when it’s passed an input range, because each call to .base_code_units() returns a separate temporary std::inplace_vector, it.base_code_units().begin() and it.base_code_units().end() now point to different objects, so the function invokes UB.

10.5.4 Survey of Range Adaptors that Downgrade to Input

Some range adaptors downgrade forward ranges into input ranges: these are, to my understanding, views::as_input, views::cache_latest, views::join, and views::join_with.

[range.as.input.overview] states, “This is useful to avoid overhead that can be necessary to provide support for the operations needed for greater iterator strength.” This use case is potentially relevant for transcoding views, since the size of the iterator may be greater with a stronger iterator category. For example, bidirectional transcoding iterators need to store the begin iterator from the underlying range to avoid overrunning the beginning when transcoding backwards, but forward iterators don’t need it.

But implementing .base_code_units() for input views would actually cause views::as_input to increase the transcoding iterator’s overhead relative to its forward-iterator implementation, because the iterator would need to contain an additional code unit cache.

views::as_input was introduced by [P3725R3], “Filter View Extensions for Safer Use,” and, rather than avoiding overhead, its main motivation was composition with std::views::filter in order to avoid pitfalls related to mutating through a filter.

This is potentially relevant to transcoding, in that someone might write a filter-view pipeline on characters. Say a user wants to print the UTF-8 code units for all the non-ASCII code points in a range. That would look like this:

void print_nonascii_code_points_and_code_units(std::ranges::range auto text) {
  auto print_code_point{
    [](char32_t code_point, auto code_unit_range) {
    std::println(
      "{:#x} = {::#x}", static_cast<std::uint32_t>(code_point),
      code_unit_range | std::views::transform([](char8_t c) { return (std::uint8_t)c; }));
    }};
  auto code_points = text
                     | std::views::filter([](char8_t c) { return c >= 0x80; })
                     | std::views::to_utf32;
  for (auto it = code_points.begin(); it != code_points.end(); ++it) {
    print_code_point(*it, it.base_code_units());
  }
}

A user following the [P3725R3] guidance might insert a views::as_input adaptor into the pipeline before std::views::filter, which would continue to compile and work if we provided .base_code_units() for input ranges, but which would cause print_nonascii_code_points_and_code_units to fail to compile if we didn’t.

But views::as_input isn’t strictly necessary here. And we already need to teach users that inserting views::as_input before std::views::filter will, in rare cases, cause some uses of .base() to fail to compile. To demonstrate why this isn’t a novelty, consider the following example:

struct Task { int priority; };

bool submit_batch(std::ranges::range auto batch);

// Submit the high-priority tasks in batches; on a transient failure, hand the
// remaining high-priority tasks to the retry queue.
void submit_high_priority_tasks(std::vector<Task>& tasks) {
  auto high = tasks | std::views::filter([](Task const& t) { return t.priority > 100; });
  auto batches = high | std::views::chunk(BATCH_SIZE);
  for (auto it = batches.begin(); it != batches.end(); ++it) {
    if (!submit_batch(*it)) {
      requeue(std::ranges::subrange(it.base(), high.end()));
      return;
    }
  }
}

This works as written, but if views::as_input is inserted in front of views::filter, the call to it.base() fails to compile because std::ranges::chunk_view’s iterator doesn’t provide .base() for input views. But views::as_input is unnecessary here as well.

Furthermore, it’s worth noting that the list of plausible reasons to apply a filter_view on code units as opposed to code points is extremely short; ordinarily, doing so risks corrupting the output.

Moving on to views::cache_latest: that one is an adaptor with a niche use case and no special relevance to transcoding.

views::join is directly relevant, since it’s common for users to want to reassemble a text string that had previously been broken up into separate parts before transcoding it. views::join_with is also potentially relevant, since users may want to transcode text after having used views::join_with to add separators to it.

It’s important to note that in the common case, views::join and views::join_with do not downgrade from forward to input. They only do so if the range of ranges it’s given is a range of prvalue ranges.

For example, the views::join adaptor in the following example does not downgrade:

void print_errors(std::ranges::range auto packets) {
  auto print_code_units =
    [](std::ranges::range auto code_unit) {
      std::println("{::#x}",
                   code_unit
                   | std::views::transform([](char8_t c) { return (std::uint8_t)c; }));
    };
  auto utf_view = packets
                | std::views::join
                | std::views::to_utf32_or_error;
  for (auto it = utf_view.begin(); it != utf_view.end(); ++it) {
    if (!(*it).has_value()) {
      print_code_units(it.base_code_units());
    }
  }
}

But, if the packets need to be decrypted before transcoding, and the user alters the pipeline like so:

+ std::u8string decrypt(std::u8string_view packet) {
+   return packet
+          | std::views::transform(
+              [](char8_t c) {
+                return static_cast<char8_t>(c ^ 0x55);
+              })
+          | std::ranges::to<std::u8string>();
+ }

void print_errors(std::ranges::range auto packets) {
  auto print_code_units =
    [](std::ranges::range auto code_unit) {
      std::println("{::#x}",
                   code_unit
                   | std::views::transform([](char8_t c) { return (std::uint8_t)c; }));
    };
  auto utf_view = packets
+               | std::views::transform(decrypt)
                | std::views::join
                | std::views::to_utf32_or_error;
  for (auto it = utf_view.begin(); it != utf_view.end(); ++it) {
    if (!(*it).has_value()) {
      print_code_units(it.base_code_units());
    }
  }
}

Then it downgrades.

[...]

Thanks,

Eddie