On 5/27/26 3:25 PM, Eddie Nolan wrote:

I agree; I didn't intend to suggest that base() should return a different iterator type.

I'm confused. If the transcoding iterator is wrapping an arbitrary input iterator, how is .base() supposed to return an instance of that type that points into the iterator's internal cache buffer? For example, if it's wrapping a std::istreambuf_iterator.

It isn't. The suggestion is for base() to do what it already does (and not providing it for input ranges is fine). Additionally, provide a base_code_units() member that returns a subrange for (only) the code units decoded when this iterator was last created/advanced. For a forward-or-better iterator, this would return a subrange that uses the same iterator as is returned from base(). For an input iterator, this would return a subrange referencing the internal cache in the iterator. The lifetime of the subrange is therefore tied to the iterator's current state.

Wouldn't this be addressed by appropriate use of ranges::dangling and/or std::ranges::borrowed_subrange_t?

No, to my understanding, the ranges::dangling method isn't intended to support that use case; it's a very narrowly scoped facility intended to prevent returning an iterator from a function that takes an owning view by value.

Hmm, ok. I'm no expert in this area. Maybe other range experts or SG9 participants have ideas.

I expect programmers to encounter input ranges more frequently going forward because they may be produced by range adapters.

That's true, but it's also the case that input iterators and ranges tend to result in annoying special cases in many places throughout the standard views API. We support them as best we can, but I don't think they necessarily justify bending over backwards to support, especially when the proposed mechanisms for doing so are problematic.

It's not ideal, but if you have an input range and you really need forward-range-only features, frequently it's feasible to address that by copying the input range into a buffer first.

In particular, many of the non-view-based transcoding APIs inherently require creating a buffer for the input, so users whose existing use case is transcoding an input range will likely be copying it already.

I don't disagree; this is a subjective area where there isn't a right technical answer. I would like to avoid imposing copies on programmers where they otherwise aren't needed.

A classic example of where access to the original code units is useful is when substitutions occur. When U+FFFD is produced, access to the original code unit sequence provides the ability to analyze, log, or ameliorate the effects of the (presumably) incorrect code unit sequence.

Except for the case of input iterators, this can be implemented using .base() without needing .base_range(). Here is a modified version of my transcode_or_throw example from the paper which has been updated to add the code units that produced the error to the exception message:

Right, but my concern is support for input ranges.

Tom.

template <typename FromChar, typename ToChar>
std::basic_string<ToChar> transcode_or_throw(std::basic_string_view<FromChar> input) {
  std::basic_string<ToChar> result;
  auto view = input | std::views::to_utf_or_error<ToChar>;
  for (auto it = view.begin(), end = view.end(); it != end; ++it) {
    if ((*it).has_value()) {
      result.push_back(**it);
    } else {
      throw std::runtime_error("error at position " +
                               std::to_string(it.base() - input.begin()) + ": " +
                               enum_to_string((*it).error()) + "; code units:" +
                               [&] {
                                  std::string s;
                                  for (auto p = it.base(); p != std::next(it).base(); ++p)
                                      s += std::format(" 0x{:02X}", static_cast<unsigned>(*p));
                                  return s;
                               }());
    }
  }
  return result;
}

- Eddie



On Wed, May 27, 2026 at 2:34 PM Tom Honermann <tom@honermann.net> wrote:
On 5/27/26 2:27 PM, Tom Honermann via SG16 wrote:
On 5/27/26 2:47 AM, Eddie Nolan wrote:

Thanks for providing this feedback. Here are my thoughts:

But the general idea of a small cache is applicable; when the underlying range (only) models input range, the (specialized) iterator can hold a (4 byte) input buffer just as is done for the output code unit buffer. Unlike the output code unit buffer, there is no buffer index to maintain since base() would always return an iterator to the beginning of that buffer.

The closest thing I can think of to a precedent that justifies this approach is that the views API can have range adaptors perform optimizations that result in .base() not returning an iterator to the view that was passed in to the range adaptor. For example, passing an instance of std::ranges::reverse_view to std::views::reverse yields the base of the std::ranges::reverse_view instead of a reverse_view of a reverse_view; so, when you invoke .base(), you don't get a std::reverse_iterator. P2728 takes advantage of this to enable double-transcode optimizations.

However, I don't think we have precedent for a view type giving out an iterator from .base() whose type is unrelated to the iterator type of the underlying view. That approach seems like it violates the expectations users might have of the way that .base() works.

I agree; I didn't intent to suggest that base() should return a different iterator type. 

In P2728R13 I just removed .base() for non-forward input ranges.

The text_view iterators also expose a base_range() member that returns a range of the underlying code unit sequence corresponding to base() + code-unit-sequence-length (which I think is equivalent to to_increment_ in P2728). Is there a reason not to expose such a member? As is, it appears that obtaining that range would require constructing a subrange using base() from one iterator and base() from another iterator that has advanced to the next character. Such a subrange would not be valid in the case of specialized input iterators that use an input buffer cache as I suggested above (the two iterators would not point in to the same range).

In a previous revision of the paper (P2728R7), rather than having _or_error views that give out std::expected as the value_type, I tried to address error handling with a .success() member function on the iterator that gave out std::expected<void, utf_transcoding_error>. I was advised by the chair at that timethat adding member functions other than .base() was objectionable to SG9, because users now have built an expectation that they can implement classes that wrap views by providing a limited set of member functions, which includes .base() but which does not include any novel designs. Unfortunately, I can't point to the minutes, since I was given this advice during an "unofficial" session during Wrocław. I would worry about experiencing similar resistance to the idea of adding a .base_range() member function.

I don't understand the objection. I don't see how adding iterator-specific member functions removes the ability to wrap views; such wrappers simply wouldn't expose those members which seems fine to me.

However, I currently haven't seen any use cases that .base_range() would enable that can't be implemented using .base(), other than input ranges, of course. In the previous telecon, I presented examples of sophisticated use cases for .base(), which are now added to P2728R13 in cleaner form. These are the "Transcoding into a buffer of a fixed number of code units without truncating code points" and "Performing code unit substitutions on cuneiform strings" examples.

I don't see why those use cases aren't applicable to input ranges. I may be mistaken, but I expect programmers to encounter input ranges more frequently going forward because they may be produced by range adapters.

I now see the later correction regarding those use cases.

A classic example of where access to the original code units is useful is when substitutions occur. When U+FFFD is produced, access to the original code unit sequence provides the ability to analyze, log, or ameliorate the effects of the (presumably) incorrect code unit sequence.

Tom.

Furthermore, another problem with giving out views into an internal buffer in the transcoding iterator is that users will inevitably try to do the following:

  • Store a view or iterator pointing into the cache buffer
  • Increment the transcoding iterator, invalidating the aforementioned cache buffer view/iterator
  • Obtain a new view/iterator to the cache buffer of the incremented transcoding iterator
  • Try to compare the new view to the first one

This is a footgun.

Wouldn't this be addressed by appropriate use of ranges::dangling and/or std::ranges::borrowed_subrange_t?

Ultimately, I just don't think supporting .base() on input views is feasible right now. The current paper design doesn't provide it for input views, which still leaves the door open to adding it in the future if we change our minds about that fact.

Adding base() in the future is possible. Adding base_code_units() as I suggested would be an ABI breaking change though since it requires an additional cache in the iterator.

Tom.

Thanks,

Eddie



On Tue, May 26, 2026 at 3:48 PM Tom Honermann <tom@honermann.net> wrote:

It sounds like we'll be continuing discussion of P2728R12 tomorrow. I would like to discuss the approach suggested below as a resolution for the concerns raised last time regarding the behavior of base() for input ranges. Please share any thoughts ahead of the meeting if possible.

Tom.

On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:

Thank you for the presentation on Wednesday, Eddie. I was glad for us to finally get back to this paper! I have a few comments now that I've read through the latest revision.

We briefly discussed what the behavior for base() should be for transcoding iterators that work with an underlying range that only models std::ranges::input_range. The proposed wording has this note in 24.7.?.6 ([range.transcoding.iterator]).

[ Note: to_utf_view::iterator maintains invariants on base() which differ depending on whether it’s an input iterator. In both cases, if *this is at the end of the range being adapted, then base() == end(). But if it’s not at the end of the adapted range, and it’s an input iterator, then the position of base() is always at the end of the input subsequence corresponding to the current code point. On the other hand, for forward and bidirectional iterators, the position of base() is always at the beginning of the input subsequence corresponding to the current code point. — end note ]

When I was working on text_view many years ago, I addressed this concern for encoding and decoding iterator types with an underlying input iterator through specialization; partial specializations of those types substituted a caching iterator for the original underlying input iterator. The exact way that I went about this would not be appropriate for the P2728 design (the cache consists of a cooperatively managed look ahead buffer that is incrementally retired as iterators are advanced; we don't want that here). But the general idea of a small cache is applicable; when the underlying range (only) models input range, the (specialized) iterator can hold a (4 byte) input buffer just as is done for the output code unit buffer. Unlike the output code unit buffer, there is no buffer index to maintain since base() would always return an iterator to the beginning of that buffer. For consistency with forward (and better) iterators, it would be useful for the iterator returned by base() to be comparable to the underlying (input) iterator for the purposes of comparison against end(); but see an alternative approach below.

The text_view iterators also expose a base_range() member that returns a range of the underlying code unit sequence corresponding to base() + code-unit-sequence-length (which I think is equivalent to to_increment_ in P2728). Is there a reason not to expose such a member? As is, it appears that obtaining that range would require constructing a subrange using base() from one iterator and base() from another iterator that has advanced to the next character. Such a subrange would not be valid in the case of specialized input iterators that use an input buffer cache as I suggested above (the two iterators would not point in to the same range).

I think it would be useful to differentiate access to the (complete) underlying range vs access to the input code unit sequence for the current character. Obviously, access to the complete underlying range isn't possible for input iterators, but access to the current input code unit sequence is (with the caching approach described above is). The iterators could expose this interface:

// Forward+ iterators only; returns an iterator into the underlying range.
constexpr const iterator_t<Base>& base() const & noexcept requires forward_range<Base> { ... }
constexpr iterator_t<Base> base() && requires forward_range<Base> { ... }

// Input+ iterators; returns a subrange containing the input code units for the current character.
// References the input code unit sequence cache for input iterators.
// References the underlying range otherwise.
constexpr subrange<...> base_code_units() const noexcept { ... }

Unlike base(), base_code_units() would not necessarily contain iterators for the underlying range (e.g., in the case of a caching input iterator). Note that base() could be used to modify the underlying range (likely ill-advised) while the subrange returned by base_code_units() could restrict such writes thereby ensuring consistent behavior for input and forward+ iterators.

Tom.