On Wed, May 27, 2026 at 2:34 PM Tom Honermann <tom@honermann.net> wrote:

On 5/27/26 2:27 PM, Tom Honermann via SG16 wrote:

On 5/27/26 2:47 AM, Eddie Nolan wrote:

Thanks for providing this feedback. Here are my thoughts:

But the general idea of a small cache is applicable; when the underlying range (only) models input range, the (specialized) iterator can hold a (4 byte) input buffer just as is done for the output code unit buffer. Unlike the output code unit buffer, there is no buffer index to maintain since base() would always return an iterator to the beginning of that buffer.

The closest thing I can think of to a precedent that justifies this approach is that the views API can have range adaptors perform optimizations that result in .base() not returning an iterator to the view that was passed in to the range adaptor. For example, passing an instance of std::ranges::reverse_view to std::views::reverse yields the base of the std::ranges::reverse_view instead of a reverse_view of a reverse_view; so, when you invoke .base(), you don't get a std::reverse_iterator. P2728 takes advantage of this to enable double-transcode optimizations.

However, I don't think we have precedent for a view type giving out an iterator from .base() whose type is unrelated to the iterator type of the underlying view. That approach seems like it violates the expectations users might have of the way that .base() works.

I agree; I didn't intent to suggest that base() should return a different iterator type.

In P2728R13 I just removed .base() for non-forward input ranges.

The text_view iterators also expose a base_range() member that returns a range of the underlying code unit sequence corresponding to base() + code-unit-sequence-length (which I think is equivalent to to_increment_ in P2728). Is there a reason not to expose such a member? As is, it appears that obtaining that range would require constructing a subrange using base() from one iterator and base() from another iterator that has advanced to the next character. Such a subrange would not be valid in the case of specialized input iterators that use an input buffer cache as I suggested above (the two iterators would not point in to the same range).

In a previous revision of the paper (P2728R7), rather than having _or_error views that give out std::expected as the value_type, I tried to address error handling with a .success() member function on the iterator that gave out std::expected<void, utf_transcoding_error>. I was advised by the chair at that timethat adding member functions other than .base() was objectionable to SG9, because users now have built an expectation that they can implement classes that wrap views by providing a limited set of member functions, which includes .base() but which does not include any novel designs. Unfortunately, I can't point to the minutes, since I was given this advice during an "unofficial" session during Wrocław. I would worry about experiencing similar resistance to the idea of adding a .base_range() member function.

I don't understand the objection. I don't see how adding iterator-specific member functions removes the ability to wrap views; such wrappers simply wouldn't expose those members which seems fine to me.

However, I currently haven't seen any use cases that .base_range() would enable that can't be implemented using .base(), other than input ranges, of course. In the previous telecon, I presented examples of sophisticated use cases for .base(), which are now added to P2728R13 in cleaner form. These are the "Transcoding into a buffer of a fixed number of code units without truncating code points" and "Performing code unit substitutions on cuneiform strings" examples.

I don't see why those use cases aren't applicable to input ranges. I may be mistaken, but I expect programmers to encounter input ranges more frequently going forward because they may be produced by range adapters.

I now see the later correction regarding those use cases.

A classic example of where access to the original code units is useful is when substitutions occur. When U+FFFD is produced, access to the original code unit sequence provides the ability to analyze, log, or ameliorate the effects of the (presumably) incorrect code unit sequence.

Tom.

Furthermore, another problem with giving out views into an internal buffer in the transcoding iterator is that users will inevitably try to do the following:

Store a view or iterator pointing into the cache buffer

Increment the transcoding iterator, invalidating the aforementioned cache buffer view/iterator

Obtain a new view/iterator to the cache buffer of the incremented transcoding iterator

Try to compare the new view to the first one

This is a footgun.

Wouldn't this be addressed by appropriate use of ranges::dangling and/or std::ranges::borrowed_subrange_t?

Ultimately, I just don't think supporting .base() on input views is feasible right now. The current paper design doesn't provide it for input views, which still leaves the door open to adding it in the future if we change our minds about that fact.

Adding base() in the future is possible. Adding base_code_units() as I suggested would be an ABI breaking change though since it requires an additional cache in the iterator.

Tom.

Thanks,

Eddie

On Tue, May 26, 2026 at 3:48 PM Tom Honermann <tom@honermann.net> wrote:

It sounds like we'll be continuing discussion of P2728R12 tomorrow. I would like to discuss the approach suggested below as a resolution for the concerns raised last time regarding the behavior of base() for input ranges. Please share any thoughts ahead of the meeting if possible.

Tom.

On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:

Thank you for the presentation on Wednesday, Eddie. I was glad for us to finally get back to this paper! I have a few comments now that I've read through the latest revision.

We briefly discussed what the behavior for base() should be for transcoding iterators that work with an underlying range that only models std::ranges::input_range. The proposed wording has this note in 24.7.?.6 ([range.transcoding.iterator]).

[ Note: to_utf_view::iterator maintains invariants on base() which differ depending on whether it’s an input iterator. In both cases, if *this is at the end of the range being adapted, then base() == end(). But if it’s not at the end of the adapted range, and it’s an input iterator, then the position of base() is always at the end of the input subsequence corresponding to the current code point. On the other hand, for forward and bidirectional iterators, the position of base() is always at the beginning of the input subsequence corresponding to the current code point. — end note ]

When I was working on text_view many years ago, I addressed this concern for encoding and decoding iterator types with an underlying input iterator through specialization; partial specializations of those types substituted a caching iterator for the original underlying input iterator. The exact way that I went about this would not be appropriate for the P2728 design (the cache consists of a cooperatively managed look ahead buffer that is incrementally retired as iterators are advanced; we don't want that here). But the general idea of a small cache is applicable; when the underlying range (only) models input range, the (specialized) iterator can hold a (4 byte) input buffer just as is done for the output code unit buffer. Unlike the output code unit buffer, there is no buffer index to maintain since base() would always return an iterator to the beginning of that buffer. For consistency with forward (and better) iterators, it would be useful for the iterator returned by base() to be comparable to the underlying (input) iterator for the purposes of comparison against end(); but see an alternative approach below.

The text_view iterators also expose a base_range() member that returns a range of the underlying code unit sequence corresponding to base() + code-unit-sequence-length (which I think is equivalent to to_increment_ in P2728). Is there a reason not to expose such a member? As is, it appears that obtaining that range would require constructing a subrange using base() from one iterator and base() from another iterator that has advanced to the next character. Such a subrange would not be valid in the case of specialized input iterators that use an input buffer cache as I suggested above (the two iterators would not point in to the same range).

I think it would be useful to differentiate access to the (complete) underlying range vs access to the input code unit sequence for the current character. Obviously, access to the complete underlying range isn't possible for input iterators, but access to the current input code unit sequence is (with the caching approach described above is). The iterators could expose this interface:

// Forward+ iterators only; returns an iterator into the underlying range.
constexpr const iterator_t<Base>& base() const & noexcept requires forward_range<Base> { ... }
constexpr iterator_t<Base> base() && requires forward_range<Base> { ... }

// Input+ iterators; returns a subrange containing the input code units for the current character.
// References the input code unit sequence cache for input iterators.
// References the underlying range otherwise.
constexpr subrange<...> base_code_units() const noexcept { ... }

Unlike base(), base_code_units() would not necessarily contain iterators for the underlying range (e.g., in the case of a caching input iterator). Note that base() could be used to modify the underlying range (likely ill-advised) while the subrange returned by base_code_units() could restrict such writes thereby ensuring consistent behavior for input and forward+ iterators.

Tom.