Thank you, Eddie. I greatly appreciate your continued followup and willingness to update the paper to record the discussion.
Tom.
EddieThanks,Unfortunately I think the non-propagating-cache idea is not relevant here; it's basically just a `std::optional` with weird copy/move semantics that discard the object so the cache doesn't propagate when the view is copied/moved.Hi Tom,Thanks for the feedback. I've updated my draft to try to provide more rationale: https://isocpp.org/files/papers/D2728R14.html#base_code_units
The updated verbiage from the paper is copied below.
[...]In response, it was suggested that the above lifetime issue could be addressed by changing the return type of
.base_code_units()to something likestd::inplace_vector<char8_t, 4>.That creates a different lifetime problem. Consider this example. The Unicode Tags block is intended for use in flag emojis but has been used for LLM prompt injections. Say a user writes the following function, which divides the stream of characters into Tags and non-Tags, and also imagine that they have a custom sink type that accepts iterator pairs rather than ranges:
constexpr bool is_tag(char32_t c) { return (c & ~0x7F) == 0xE0000; } void partition_tags(std::ranges::range auto text, sink non_tags, sink tags) { auto utf_view = text | std::views::to_utf32; for (auto it = utf_view.begin(); it != utf_view.end(); ++it) { (is_tag(*it) ? tags : non_tags).consume( it.base_code_units().begin(), it.base_code_units().end()); } }Again, this works perfectly well when
partition_tagsis passed a forward range, but then when it’s passed an input range, because each call to.base_code_units()returns a separate temporarystd::inplace_vector,it.base_code_units().begin()andit.base_code_units().end()now point to different objects, so the function invokes UB.10.5.4 Survey of Range Adaptors that Downgrade to Input
Some range adaptors downgrade forward ranges into input ranges: these are, to my understanding,
views::as_input,views::cache_latest,views::join, andviews::join_with.
[range.as.input.overview]states, “This is useful to avoid overhead that can be necessary to provide support for the operations needed for greater iterator strength.” This use case is potentially relevant for transcoding views, since the size of the iterator may be greater with a stronger iterator category. For example, bidirectional transcoding iterators need to store the begin iterator from the underlying range to avoid overrunning the beginning when transcoding backwards, but forward iterators don’t need it.But implementing
.base_code_units()for input views would actually causeviews::as_inputto increase the transcoding iterator’s overhead relative to its forward-iterator implementation, because the iterator would need to contain an additional code unit cache.
views::as_inputwas introduced by [P3725R3], “Filter View Extensions for Safer Use,” and, rather than avoiding overhead, its main motivation was composition withstd::views::filterin order to avoid pitfalls related to mutating through a filter.This is potentially relevant to transcoding, in that someone might write a filter-view pipeline on characters. Say a user wants to print the UTF-8 code units for all the non-ASCII code points in a range. That would look like this:
void print_nonascii_code_points_and_code_units(std::ranges::range auto text) { auto print_code_point{ [](char32_t code_point, auto code_unit_range) { std::println( "{:#x} = {::#x}", static_cast<std::uint32_t>(code_point), code_unit_range | std::views::transform([](char8_t c) { return (std::uint8_t)c; })); }}; auto code_points = text | std::views::filter([](char8_t c) { return c >= 0x80; }) | std::views::to_utf32; for (auto it = code_points.begin(); it != code_points.end(); ++it) { print_code_point(*it, it.base_code_units()); } }A user following the [P3725R3] guidance might insert a
views::as_inputadaptor into the pipeline beforestd::views::filter, which would continue to compile and work if we provided.base_code_units()for input ranges, but which would causeprint_nonascii_code_points_and_code_unitsto fail to compile if we didn’t.But
views::as_inputisn’t strictly necessary here. And we already need to teach users that insertingviews::as_inputbeforestd::views::filterwill, in rare cases, cause some uses of.base()to fail to compile. To demonstrate why this isn’t a novelty, consider the following example:struct Task { int priority; }; bool submit_batch(std::ranges::range auto batch); // Submit the high-priority tasks in batches; on a transient failure, hand the // remaining high-priority tasks to the retry queue. void submit_high_priority_tasks(std::vector<Task>& tasks) { auto high = tasks | std::views::filter([](Task const& t) { return t.priority > 100; }); auto batches = high | std::views::chunk(BATCH_SIZE); for (auto it = batches.begin(); it != batches.end(); ++it) { if (!submit_batch(*it)) { requeue(std::ranges::subrange(it.base(), high.end())); return; } } }This works as written, but if
views::as_inputis inserted in front ofviews::filter, the call toit.base()fails to compile becausestd::ranges::chunk_view’s iterator doesn’t provide.base()for input views. Butviews::as_inputis unnecessary here as well.Furthermore, it’s worth noting that the list of plausible reasons to apply a filter_view on code units as opposed to code points is extremely short; ordinarily, doing so risks corrupting the output.
Moving on to
views::cache_latest: that one is an adaptor with a niche use case and no special relevance to transcoding.
views::joinis directly relevant, since it’s common for users to want to reassemble a text string that had previously been broken up into separate parts before transcoding it.views::join_withis also potentially relevant, since users may want to transcode text after having usedviews::join_withto add separators to it.It’s important to note that in the common case,
views::joinandviews::join_withdo not downgrade from forward to input. They only do so if the range of ranges it’s given is a range of prvalue ranges.For example, the
views::joinadaptor in the following example does not downgrade:void print_errors(std::ranges::range auto packets) { auto print_code_units = [](std::ranges::range auto code_unit) { std::println("{::#x}", code_unit | std::views::transform([](char8_t c) { return (std::uint8_t)c; })); }; auto utf_view = packets | std::views::join | std::views::to_utf32_or_error; for (auto it = utf_view.begin(); it != utf_view.end(); ++it) { if (!(*it).has_value()) { print_code_units(it.base_code_units()); } } }But, if the packets need to be decrypted before transcoding, and the user alters the pipeline like so:
+ std::u8string decrypt(std::u8string_view packet) { + return packet + | std::views::transform( + [](char8_t c) { + return static_cast<char8_t>(c ^ 0x55); + }) + | std::ranges::to<std::u8string>(); + } void print_errors(std::ranges::range auto packets) { auto print_code_units = [](std::ranges::range auto code_unit) { std::println("{::#x}", code_unit | std::views::transform([](char8_t c) { return (std::uint8_t)c; })); }; auto utf_view = packets + | std::views::transform(decrypt) | std::views::join | std::views::to_utf32_or_error; for (auto it = utf_view.begin(); it != utf_view.end(); ++it) { if (!(*it).has_value()) { print_code_units(it.base_code_units()); } } }Then it downgrades.
[...]
On Mon, Jun 8, 2026 at 10:36 AM Tom Honermann <tom@honermann.net> wrote:
Thank you, Eddie. I appreciate the treatise provided; I think it covers the concerns well.
The lifetime safety concerns could, I think, be addressed by base_code_units() returning (in the case of an input range) a small container (e.g., std::array<char8_t, 4>) by value rather than returning a reference to a cache held in the iterator. Note that the subrange returned need not be mutable (and probably shouldn't be).
I mentioned one concern that wasn't addressed; the propensity for input ranges to be encountered at an increased frequency due to use of range adapters. I don't have a good sense of how often demotion to an input range should be expected going forward. I think it would be helpful to include some discussion or that and/or a list of range adapters that result in such demotion (a search for "iterator_concept denotes input_iterator_tag" and "iterator_category denotes input_iterator_tag" may be helpful). There seems to be some tension in the ranges library regarding support for input ranges and their potential use to avoid overhead (see [range.as.input.overview]). The standard specifies a non-propagating-cache ([range.nonprop.cache]) exposition-only type that looks like it might be relevant too; it is used in the specification for join_view, join_with_view, lazy_split_view, chunk_view (for input ranges), and cache_latest_view. You concluded that the ergonomics of easy access to the underlying code unit sequence is not favored relative to the safety aspects, but I'm yet to be convinced pending more analysis of range adapters.
Tom.
On 6/7/26 5:29 PM, Eddie Nolan via SG16 wrote:
EddieCross-posting this message to both the SG9 and SG16 mailing lists.I've uploaded a new draft D2728R14 of my transcoding views paper, which includes a new discussion of the idea of adding a member function to the transcoding iterator to provide the underlying code unit range for the current code point, and my case against doing so, in the "Design Discussion and Alternatives" section.
This should address the requests for relevant code examples from the most recent SG16 telecon (2026-05-27). See also previous discussions on the SG16 reflector and mattermost.Thanks,