ISOCPP sg16 List: Re: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Eddie Nolan <eddiejnolan_at_[hidden]>
Date: Wed, 27 May 2026 03:32:29 -0400

Correction: "Transcoding into a buffer of a fixed number of code units without truncating code points" does not use .base(). The relevant examples are “Transcoding Strings and Throwing a Descriptive Exception on Invalid UTF” and “Performing code unit substitutions on cuneiform strings.”

On May 27, 2026, at 2:47 AM, Eddie Nolan <eddiejnolan_at_[hidden]> wrote:

Thanks for providing this feedback. Here are my thoughts:

But the general idea of a small cache is applicable; when the underlying range (only) models input range, the (specialized) iterator can hold a (4 byte) input buffer just as is done for the output code unit buffer. Unlike the output code unit buffer, there is no buffer index to maintain since base() would always return an iterator to the beginning of that buffer.

The closest thing I can think of to a precedent that justifies this approach is that the views API can have range adaptors perform optimizations that result in .base() not returning an iterator to the view that was passed in to the range adaptor. For example, passing an instance of std::ranges::reverse_view to std::views::reverse yields the base of the std::ranges::reverse_view instead of a reverse_view of a reverse_view; so, when you invoke .base(), you don't get a std::reverse_iterator. P2728 takes advantage of this to enable double-transcode optimizations.

However, I don't think we have precedent for a view type giving out an iterator from .base() whose type is unrelated to the iterator type of the underlying view. That approach seems like it violates the expectations users might have of the way that .base() works.

In P2728R13 I just removed .base() for non-forward input ranges.

The text_view iterators also expose a base_range() member that returns a range of the underlying code unit sequence corresponding to base() + code-unit-sequence-length (which I think is equivalent to to_increment_ in P2728). Is there a reason not to expose such a member? As is, it appears that obtaining that range would require constructing a subrange using base() from one iterator and base() from another iterator that has advanced to the next character. Such a subrange would not be valid in the case of specialized input iterators that use an input buffer cache as I suggested above (the two iterators would not point in to the same range).

In a previous revision of the paper (P2728R7), rather than having _or_error views that give out std::expected as the value_type, I tried to address error handling with a .success() member function on the iterator that gave out std::expected<void, utf_transcoding_error>. I was advised by the chair at that timethat adding member functions other than .base() was objectionable to SG9, because users now have built an expectation that they can implement classes that wrap views by providing a limited set of member functions, which includes .base() but which does not include any novel designs. Unfortunately, I can't point to the minutes, since I was given this advice during an "unofficial" session during Wrocław. I would worry about experiencing similar resistance to the idea of adding a .base_range() member function.

However, I currently haven't seen any use cases that .base_range() would enable that can't be implemented using .base(), other than input ranges, of course. In the previous telecon, I presented examples of sophisticated use cases for .base(), which are now added to P2728R13 in cleaner form. These are the "Transcoding into a buffer of a fixed number of code units without truncating code points" and "Performing code unit substitutions on cuneiform strings" examples.

Furthermore, another problem with giving out views into an internal buffer in the transcoding iterator is that users will inevitably try to do the following:

Store a view or iterator pointing into the cache buffer
Increment the transcoding iterator, invalidating the aforementioned cache buffer view/iterator
Obtain a new view/iterator to the cache buffer of the incremented transcoding iterator
Try to compare the new view to the first one

This is a footgun.

Ultimately, I just don't think supporting .base() on input views is feasible right now. The current paper design doesn't provide it for input views, which still leaves the door open to adding it in the future if we change our minds about that fact.

Thanks,

Eddie

On Tue, May 26, 2026 at 3:48 PM Tom Honermann <tom_at_[hidden]> wrote:

It sounds like we'll be continuing discussion of P2728R12 tomorrow. I would like to discuss the approach suggested below as a resolution for the concerns raised last time regarding the behavior of base() for input ranges. Please share any thoughts ahead of the meeting if possible.

Tom.

On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:

Thank you for the presentation on Wednesday, Eddie. I was glad for us to finally get back to this paper! I have a few comments now that I've read through the latest revision.

We briefly discussed what the behavior for base() should be for transcoding iterators that work with an underlying range that only models std::ranges::input_range. The proposed wording has this note in 24.7.?.6 ([range.transcoding.iterator]).

[ Note: to_utf_view::iterator maintains invariants on base() which differ depending on whether it’s an input iterator. In both cases, if *this is at the end of the range being adapted, then base() == end(). But if it’s not at the end of the adapted range, and it’s an input iterator, then the position of base() is always at the end of the input subsequence corresponding to the current code point. On the other hand, for forward and bidirectional iterators, the position of base() is always at the beginning of the input subsequence corresponding to the current code point. — end note ]

When I was working on text_view many years ago, I addressed this concern for encoding and decoding iterator types with an underlying input iterator through specialization; partial specializations of those types substituted a caching iterator for the original underlying input iterator. The exact way that I went about this would not be appropriate for the P2728 design (the cache consists of a cooperatively managed look ahead buffer that is incrementally retired as iterators are advanced; we don't want that here). But the general idea of a small cache is applicable; when the underlying range (only) models input range, the (specialized) iterator can hold a (4 byte) input buffer just as is done for the output code unit buffer. Unlike the output code unit buffer, there is no buffer index to maintain since base() would always return an iterator to the beginning of that buffer. For consistency with forward (and better) iterators, it would be useful for the iterator returned by base() to be comparable to the underlying (input) iterator for the purposes of comparison against end(); but see an alternative approach below.

The text_view iterators also expose a base_range() member that returns a range of the underlying code unit sequence corresponding to base() + code-unit-sequence-length (which I think is equivalent to to_increment_ in P2728). Is there a reason not to expose such a member? As is, it appears that obtaining that range would require constructing a subrange using base() from one iterator and base() from another iterator that has advanced to the next character. Such a subrange would not be valid in the case of specialized input iterators that use an input buffer cache as I suggested above (the two iterators would not point in to the same range).

I think it would be useful to differentiate access to the (complete) underlying range vs access to the input code unit sequence for the current character. Obviously, access to the complete underlying range isn't possible for input iterators, but access to the current input code unit sequence is (with the caching approach described above is). The iterators could expose this interface:

// Forward+ iterators only; returns an iterator into the underlying range.
constexpr const iterator_t<Base>& base() const & noexcept requires forward_range<Base> { ... }
constexpr iterator_t<Base> base() && requires forward_range<Base> { ... }

// Input+ iterators; returns a subrange containing the input code units for the current character.
// References the input code unit sequence cache for input iterators.
// References the underlying range otherwise.
constexpr subrange<...> base_code_units() const noexcept { ... }

Unlike base(), base_code_units() would not necessarily contain iterators for the underlying range (e.g., in the case of a caching input iterator). Note that base() could be used to modify the underlying range (likely ill-advised) while the subrange returned by base_code_units() could restrict such writes thereby ensuring consistent behavior for input and forward+ iterators.

Tom.

Received on 2026-05-27 07:32:45