ISOCPP sg16 List: Re: [isocpp-sg16] UTF transcoding views draft D2728R14 includes .base_code

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 10 Jun 2026 02:59:01 -0400

Thank you, Eddie. I greatly appreciate your continued followup and
willingness to update the paper to record the discussion.

Tom.

On 6/8/26 7:25 PM, Eddie Nolan wrote:
> Hi Tom,
>
> Thanks for the feedback. I've updated my draft to try to provide more
> rationale: https://isocpp.org/files/papers/D2728R14.html#base_code_units
>
> Unfortunately I think the /non-propagating-cache /idea is not relevant
> here; it's basically just a `std::optional` with weird copy/move
> semantics that discard the object so the cache doesn't propagate when
> the view is copied/moved.
>
> The updated verbiage from the paper is copied below.
>
> [...]
>
> In response, it was suggested that the above lifetime issue could be
> addressed by changing the return type of |.base_code_units()| to
> something like |std::inplace_vector<char8_t, 4>|.
>
> That creates a different lifetime problem. Consider this example. The
> Unicode Tags block is intended for use in flag emojis but has been
> used for LLM prompt injections. Say a user writes the following
> function, which divides the stream of characters into Tags and
> non-Tags, and also imagine that they have a custom sink type that
> accepts iterator pairs rather than ranges:
>
> |constexpr bool is_tag(char32_t c) { return (c & ~0x7F) == 0xE0000; }
> void partition_tags(std::ranges::range auto text, sink non_tags, sink
> tags) { auto utf_view = text | std::views::to_utf32; for (auto it =
> utf_view.begin(); it != utf_view.end(); ++it) { (is_tag(*it) ? tags :
> non_tags).consume( it.base_code_units().begin(),
> it.base_code_units().end()); } }|
>
> Again, this works perfectly well when |partition_tags| is passed a
> forward range, but then when it’s passed an input range, because each
> call to |.base_code_units()| returns a separate temporary
> |std::inplace_vector|, |it.base_code_units().begin()| and
> |it.base_code_units().end()| now point to different objects, so the
> function invokes UB.
>
>
> 10.5.4 Survey of Range Adaptors that Downgrade to Input
>
> Some range adaptors downgrade forward ranges into input ranges: these
> are, to my understanding, |views::as_input|, |views::cache_latest|,
> |views::join|, and |views::join_with|.
>
> |[range.as.input.overview]| states, “This is useful to avoid overhead
> that can be necessary to provide support for the operations needed for
> greater iterator strength.” This use case is potentially relevant for
> transcoding views, since the size of the iterator may be greater with
> a stronger iterator category. For example, bidirectional transcoding
> iterators need to store the begin iterator from the underlying range
> to avoid overrunning the beginning when transcoding backwards, but
> forward iterators don’t need it.
>
> But implementing |.base_code_units()| for input views would actually
> cause |views::as_input| to /increase/ the transcoding iterator’s
> overhead relative to its forward-iterator implementation, because the
> iterator would need to contain an additional code unit cache.
>
> |views::as_input| was introduced by [P3725R3]
> <https://wg21.link/p3725r3>, “Filter View Extensions for Safer Use,”
> and, rather than avoiding overhead, its main motivation was
> composition with |std::views::filter| in order to avoid pitfalls
> related to mutating through a filter.
>
> This is potentially relevant to transcoding, in that someone might
> write a filter-view pipeline on characters. Say a user wants to print
> the UTF-8 code units for all the non-ASCII code points in a range.
> That would look like this:
>
> |void print_nonascii_code_points_and_code_units(std::ranges::range
> auto text) { auto print_code_point{ [](char32_t code_point, auto
> code_unit_range) { std::println( "{:#x} = {::#x}",
> static_cast<std::uint32_t>(code_point), code_unit_range |
> std::views::transform([](char8_t c) { return (std::uint8_t)c; })); }};
> auto code_points = text | std::views::filter([](char8_t c) { return c
> >= 0x80; }) | std::views::to_utf32; for (auto it =
> code_points.begin(); it != code_points.end(); ++it) {
> print_code_point(*it, it.base_code_units()); } }|
>
> A user following the [P3725R3] <https://wg21.link/p3725r3> guidance
> might insert a |views::as_input| adaptor into the pipeline before
> |std::views::filter|, which would continue to compile and work if we
> provided |.base_code_units()| for input ranges, but which would cause
> |print_nonascii_code_points_and_code_units| to fail to compile if we
> didn’t.
>
> But |views::as_input| isn’t strictly necessary here. And we already
> need to teach users that inserting |views::as_input| before
> |std::views::filter| will, in rare cases, cause some uses of |.base()|
> to fail to compile. To demonstrate why this isn’t a novelty, consider
> the following example:
>
> |struct Task { int priority; }; bool submit_batch(std::ranges::range
> auto batch); // Submit the high-priority tasks in batches; on a
> transient failure, hand the // remaining high-priority tasks to the
> retry queue. void submit_high_priority_tasks(std::vector<Task>& tasks)
> { auto high = tasks | std::views::filter([](Task const& t) { return
> t.priority > 100; }); auto batches = high |
> std::views::chunk(BATCH_SIZE); for (auto it = batches.begin(); it !=
> batches.end(); ++it) { if (!submit_batch(*it)) {
> requeue(std::ranges::subrange(it.base(), high.end())); return; } } }|
>
> This works as written, but if |views::as_input| is inserted in front
> of |views::filter|, the call to |it.base()| fails to compile because
> |std::ranges::chunk_view|’s iterator doesn’t provide |.base()| for
> input views. But |views::as_input| is unnecessary here as well.
>
> Furthermore, it’s worth noting that the list of plausible reasons to
> apply a filter_view on code /units/ as opposed to code /points/ is
> extremely short; ordinarily, doing so risks corrupting the output.
>
> Moving on to |views::cache_latest|: that one is an adaptor with a
> niche use case and no special relevance to transcoding.
>
> |views::join| is directly relevant, since it’s common for users to
> want to reassemble a text string that had previously been broken up
> into separate parts before transcoding it. |views::join_with| is also
> potentially relevant, since users may want to transcode text after
> having used |views::join_with| to add separators to it.
>
> It’s important to note that in the common case, |views::join| and
> |views::join_with| do not downgrade from forward to input. They only
> do so if the range of ranges it’s given is a range of /prvalue/ ranges.
>
> For example, the |views::join| adaptor in the following example does
> not downgrade:
>
> |void print_errors(std::ranges::range auto packets) { auto
> print_code_units = [](std::ranges::range auto code_unit) {
> std::println("{::#x}", code_unit | std::views::transform([](char8_t c)
> { return (std::uint8_t)c; })); }; auto utf_view = packets |
> std::views::join | std::views::to_utf32_or_error; for (auto it =
> utf_view.begin(); it != utf_view.end(); ++it) { if
> (!(*it).has_value()) { print_code_units(it.base_code_units()); } } }|
>
> But, if the packets need to be decrypted before transcoding, and the
> user alters the pipeline like so:
>
> |+ std::u8string decrypt(std::u8string_view packet) { + return packet
> + | std::views::transform( + [](char8_t c) { + return
> static_cast<char8_t>(c ^ 0x55); + }) + |
> std::ranges::to<std::u8string>(); + } void
> print_errors(std::ranges::range auto packets) { auto print_code_units
> = [](std::ranges::range auto code_unit) { std::println("{::#x}",
> code_unit | std::views::transform([](char8_t c) { return
> (std::uint8_t)c; })); }; auto utf_view = packets + |
> std::views::transform(decrypt) | std::views::join |
> std::views::to_utf32_or_error; for (auto it = utf_view.begin(); it !=
> utf_view.end(); ++it) { if (!(*it).has_value()) {
> print_code_units(it.base_code_units()); } } }|
>
> Then it downgrades.
>
> [...]
>
>
> Thanks,
>
> Eddie
>
>
> On Mon, Jun 8, 2026 at 10:36 AM Tom Honermann <tom_at_[hidden]> wrote:
>
> Thank you, Eddie. I appreciate the treatise provided; I think it
> covers the concerns well.
>
> The lifetime safety concerns could, I think, be addressed by
> base_code_units() returning (in the case of an input range) a
> small container (e.g., std::array<char8_t, 4>) by value rather
> than returning a reference to a cache held in the iterator. Note
> that the subrange returned need not be mutable (and probably
> shouldn't be).
>
> I mentioned one concern that wasn't addressed; the propensity for
> input ranges to be encountered at an increased frequency due to
> use of range adapters. I don't have a good sense of how often
> demotion to an input range should be expected going forward. I
> think it would be helpful to include some discussion or that
> and/or a list of range adapters that result in such demotion (a
> search for "iterator_concept denotes input_iterator_tag" and
> "iterator_category denotes input_iterator_tag" may be helpful).
> There seems to be some tension in the ranges library regarding
> support for input ranges and their potential use to avoid overhead
> (see [range.as.input.overview]
> <https://eel.is/c++draft/range.as.input.overview>). The standard
> specifies a /non-propagating-cache/ ([range.nonprop.cache]
> <https://eel.is/c++draft/range.nonprop.cache>) exposition-only
> type that looks like it might be relevant too; it is used in the
> specification for join_view, join_with_view, lazy_split_view,
> chunk_view (for input ranges), and cache_latest_view. You
> concluded that the ergonomics of easy access to the underlying
> code unit sequence is not favored relative to the safety aspects,
> but I'm yet to be convinced pending more analysis of range adapters.
>
> Tom.
>
> On 6/7/26 5:29 PM, Eddie Nolan via SG16 wrote:
>> Cross-posting this message to both the SG9 and SG16 mailing lists.
>>
>> I've uploaded a new draft D2728R14 of my transcoding views paper,
>> which includes a new discussion of the idea of adding a member
>> function to the transcoding iterator to provide the underlying
>> code unit range for the current code point, and my case against
>> doing so, in the "Design Discussion and Alternatives" section.
>>
>> It can be found here:
>>
>> https://isocpp.org/files/papers/D2728R14.html#base_code_units
>>
>> This should address the requests for relevant code examples from
>> the most recent SG16 telecon (2026-05-27
>> <https://wiki.isocpp.org/2026_Telecons:SG16Teleconference2026-05-27>).
>> See also previous discussions on the SG16 reflector
>> <https://lists.isocpp.org/sg16/2026/05/4711.php> and mattermost
>> <https://chat.isocpp.org/general/pl/6hms7iacbtbmppcyubezerc95a>.
>>
>> Thanks,
>>
>> Eddie
>>

Received on 2026-06-10 06:59:07