ISOCPP sg16 List: Re: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Eddie Nolan <eddiejnolan_at_[hidden]>
Date: Wed, 27 May 2026 02:47:03 -0400

Thanks for providing this feedback. Here are my thoughts:

But the general idea of a small cache is applicable; when the underlying
range (only) models input range, the (specialized) iterator can hold a (4
byte) input buffer just as is done for the output code unit buffer. Unlike
the output code unit buffer, there is no buffer index to maintain since
base() would always return an iterator to the beginning of that buffer.

The closest thing I can think of to a precedent that justifies this
approach is that the views API can have range adaptors perform
optimizations that result in .base() not returning an iterator to the view
that was passed in to the range adaptor. For example, passing an instance
of std::ranges::reverse_view to std::views::reverse yields the base of the
std::ranges::reverse_view instead of a reverse_view of a reverse_view; so,
when you invoke .base(), you don't get a std::reverse_iterator. P2728 takes
advantage of this to enable double-transcode optimizations.

However, I don't think we have precedent for a view type giving out an
iterator from .base() whose type is unrelated to the iterator type of the
underlying view. That approach seems like it violates the expectations
users might have of the way that .base() works.

In P2728R13 <https://isocpp.org/files/papers/P2728R13.html> I just removed
.base() for non-forward input ranges.

The text_view iterators also expose a base_range() member that returns a
range of the underlying code unit sequence corresponding to base() +
code-unit-sequence-length (which I think is equivalent to to_increment_ in
P2728). Is there a reason not to expose such a member? As is, it appears
that obtaining that range would require constructing a subrange using
base() from one iterator and base() from another iterator that has advanced
to the next character. Such a subrange would not be valid in the case of
specialized input iterators that use an input buffer cache as I suggested
above (the two iterators would not point in to the same range).

In a previous revision of the paper (P2728R7), rather than having _or_error
views that give out std::expected as the value_type, I tried to address
error handling with a .success() member function on the iterator that gave
out std::expected<void, utf_transcoding_error>. I was advised by the chair
at that timethat adding member functions other than .base() was
objectionable to SG9, because users now have built an expectation that they
can implement classes that wrap views by providing a limited set of member
functions, which includes .base() but which does not include any novel
designs. Unfortunately, I can't point to the minutes, since I was given
this advice during an "unofficial" session during Wrocław. I would worry
about experiencing similar resistance to the idea of adding a .base_range()
member function.

However, I currently haven't seen any use cases that .base_range() would
enable that can't be implemented using .base(), other than input ranges, of
course. In the previous telecon, I presented examples of sophisticated use
cases for .base(), which are now added to P2728R13 in cleaner form. These
are the "Transcoding into a buffer of a fixed number of code units without
truncating code points" and "Performing code unit substitutions on
cuneiform strings" examples.

Furthermore, another problem with giving out views into an internal buffer
in the transcoding iterator is that users will inevitably try to do the
following:

   - Store a view or iterator pointing into the cache buffer
   - Increment the transcoding iterator, invalidating the aforementioned
   cache buffer view/iterator
   - Obtain a new view/iterator to the cache buffer of the incremented
   transcoding iterator
   - Try to compare the new view to the first one

This is a footgun.

Ultimately, I just don't think supporting .base() on input views is
feasible right now. The current paper design doesn't provide it for input
views, which still leaves the door open to adding it in the future if we
change our minds about that fact.

Thanks,

Eddie

On Tue, May 26, 2026 at 3:48 PM Tom Honermann <tom_at_[hidden]> wrote:

> It sounds like we'll be continuing discussion of P2728R12 tomorrow. I
> would like to discuss the approach suggested below as a resolution for the
> concerns raised last time regarding the behavior of base() for input
> ranges. Please share any thoughts ahead of the meeting if possible.
>
> Tom.
> On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:
>
> Thank you for the presentation on Wednesday, Eddie. I was glad for us to
> finally get back to this paper! I have a few comments now that I've read
> through the latest revision.
>
> We briefly discussed what the behavior for base() should be for
> transcoding iterators that work with an underlying range that only models
> std::ranges::input_range. The proposed wording has this note in 24.7.?.6
> ([range.transcoding.iterator]).
>
> [ *Note:* to_utf_view::iterator maintains invariants on base() which
> differ depending on whether it’s an input iterator. In both cases, if
> *this is at the end of the range being adapted, then base() == end(). But
> if it’s not at the end of the adapted range, and it’s an input iterator,
> then the position of base() is always at the end of the input subsequence
> corresponding to the current code point. On the other hand, for forward and
> bidirectional iterators, the position of base() is always at the
> beginning of the input subsequence corresponding to the current code point.
> — *end note* ]
>
> When I was working on text_view
> <https://github.com/tahonermann/text_view/> many years ago, I addressed
> this concern for encoding and decoding iterator types with an underlying
> input iterator through specialization; partial specializations of those
> types substituted a caching iterator
> <https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/caching_iterator.hpp>
> for the original underlying input iterator. The exact way that I went about
> this would not be appropriate for the P2728 design (the cache consists of a
> cooperatively managed look ahead buffer that is incrementally retired as
> iterators are advanced; we don't want that here). But the general idea of a
> small cache is applicable; when the underlying range (only) models input
> range, the (specialized) iterator can hold a (4 byte) input buffer just as
> is done for the output code unit buffer. Unlike the output code unit
> buffer, there is no buffer index to maintain since base() would always
> return an iterator to the beginning of that buffer. For consistency with
> forward (and better) iterators, it would be useful for the iterator
> returned by base() to be comparable to the underlying (input) iterator
> for the purposes of comparison against end(); but see an alternative
> approach below.
>
> The text_view iterators also expose a base_range() member that returns a
> range of the underlying code unit sequence corresponding to base() +
> *code-unit-sequence-length* (which I think is equivalent to to_increment_
> in P2728). Is there a reason not to expose such a member? As is, it appears
> that obtaining that range would require constructing a subrange using
> base() from one iterator and base() from another iterator that has
> advanced to the next character. Such a subrange would not be valid in the
> case of specialized input iterators that use an input buffer cache as I
> suggested above (the two iterators would not point in to the same range).
>
> I think it would be useful to differentiate access to the (complete)
> underlying range vs access to the input code unit sequence for the current
> character. Obviously, access to the complete underlying range isn't
> possible for input iterators, but access to the current input code unit
> sequence is (with the caching approach described above is). The iterators
> could expose this interface:
>
> // Forward+ iterators only; returns an iterator into the underlying range.
> constexpr const iterator_t<Base>& *base()* const & noexcept *requires
> forward_range<Base>* { ... }
> constexpr iterator_t<Base> *base()* && *requires forward_range<Base>* {
> ... }
>
> // Input+ iterators; returns a subrange containing the input code units
> for the current character.
> // References the input code unit sequence cache for input iterators.
> // References the underlying range otherwise.
> constexpr subrange<...> *base_code_units()* const noexcept { ... }
>
> Unlike base(), base_code_units() would not necessarily contain iterators
> for the underlying range (e.g., in the case of a caching input iterator).
> Note that base() could be used to modify the underlying range (likely
> ill-advised) while the subrange returned by base_code_units() could
> restrict such writes thereby ensuring consistent behavior for input and
> forward+ iterators.
>
> Tom.
>
>

Received on 2026-05-27 06:47:18