ISOCPP sg16 List: Re: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 May 2026 15:45:14 -0400

On 5/27/26 3:25 PM, Eddie Nolan wrote:
>
> I agree; I didn't intend to suggest that base() should return a
> different iterator type.
>
> I'm confused. If the transcoding iterator is wrapping an arbitrary
> input iterator, how is |.base()| supposed to return an instance of
> that type that points into the iterator's internal cache buffer? For
> example, if it's wrapping a |std::istreambuf_iterator|.
>
It isn't. The suggestion is for base() to do what it already does (and
not providing it for input ranges is fine). Additionally, provide a
base_code_units() member that returns a subrange for (only) the code
units decoded when this iterator was last created/advanced. For a
forward-or-better iterator, this would return a subrange that uses the
same iterator as is returned from base(). For an input iterator, this
would return a subrange referencing the internal cache in the iterator.
The lifetime of the subrange is therefore tied to the iterator's current
state.
>
> Wouldn't this be addressed by appropriate use of ranges::dangling
> and/or std::ranges::borrowed_subrange_t?
>
> No, to my understanding, the |ranges::dangling| method isn't intended
> to support that use case; it's a very narrowly scoped facility
> intended to prevent returning an iterator from a function that takes
> an owning view by value.
>
Hmm, ok. I'm no expert in this area. Maybe other range experts or SG9
participants have ideas.
>
> I expect programmers to encounter input ranges more frequently
> going forward because they may be produced by range adapters.
>
> That's true, but it's also the case that input iterators and ranges
> tend to result in annoying special cases in many places throughout the
> standard views API. We support them as best we can, but I don't think
> they necessarily justify bending over backwards to support, especially
> when the proposed mechanisms for doing so are problematic.
>
> It's not ideal, but if you have an input range and you really need
> forward-range-only features, frequently it's feasible to address that
> by copying the input range into a buffer first.
>
> In particular, many of the non-view-based transcoding APIs inherently
> require creating a buffer for the input, so users whose existing use
> case is transcoding an input range will likely be copying it already.
>
I don't disagree; this is a subjective area where there isn't a right
technical answer. I would like to avoid imposing copies on programmers
where they otherwise aren't needed.
>
> A classic example of where access to the original code units is
> useful is when substitutions occur. When U+FFFD is produced,
> access to the original code unit sequence provides the ability to
> analyze, log, or ameliorate the effects of the (presumably)
> incorrect code unit sequence.
>
> Except for the case of input iterators, this can be implemented using
> |.base()| without needing |.base_range()|. Here is a modified version
> of my |transcode_or_throw| example from the paper which has been
> updated to add the code units that produced the error to the exception
> message:
>
Right, but my concern is support for input ranges.

Tom.

> |template <typename FromChar, typename ToChar>
> std::basic_string<ToChar>
> transcode_or_throw(std::basic_string_view<FromChar> input) {
> std::basic_string<ToChar> result; auto view = input |
> std::views::to_utf_or_error<ToChar>; for (auto it = view.begin(), end
> = view.end(); it != end; ++it) { if ((*it).has_value()) {
> result.push_back(**it); } else { throw std::runtime_error("error at
> position " + std::to_string(it.base() - input.begin()) + ": " +
> enum_to_string((*it).error()) + "; code units:" + [&] { std::string s;
> for (auto p = it.base(); p != std::next(it).base(); ++p) s +=
> std::format(" 0x{:02X}", static_cast<unsigned>(*p)); return s; }()); }
> } return result; } |
>
> - Eddie
>
>
>
> On Wed, May 27, 2026 at 2:34 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 5/27/26 2:27 PM, Tom Honermann via SG16 wrote:
>> On 5/27/26 2:47 AM, Eddie Nolan wrote:
>>>
>>> Thanks for providing this feedback. Here are my thoughts:
>>>
>>> But the general idea of a small cache is applicable; when
>>> the underlying range (only) models input range, the
>>> (specialized) iterator can hold a (4 byte) input buffer just
>>> as is done for the output code unit buffer. Unlike the
>>> output code unit buffer, there is no buffer index to
>>> maintain since base() would always return an iterator to the
>>> beginning of that buffer.
>>>
>>> The closest thing I can think of to a precedent that justifies
>>> this approach is that the views API can have range adaptors
>>> perform optimizations that result in |.base()| not returning an
>>> iterator to the view that was passed in to the range adaptor.
>>> For example, passing an instance of |std::ranges::reverse_view|
>>> to |std::views::reverse| yields the base of the
>>> |std::ranges::reverse_view| instead of a |reverse_view| of a
>>> |reverse_view|; so, when you invoke |.base()|, you don't get a
>>> |std::reverse_iterator|. P2728 takes advantage of this to enable
>>> double-transcode optimizations.
>>>
>>> However, I don't think we have precedent for a view type giving
>>> out an iterator from |.base()| whose type is unrelated to the
>>> iterator type of the underlying view. That approach seems like
>>> it violates the expectations users might have of the way that
>>> |.base()| works.
>>>
>> I agree; I didn't intent to suggest that base() should return a
>> different iterator type.
>>>
>>> In P2728R13 <https://isocpp.org/files/papers/P2728R13.html> I
>>> just removed |.base()| for non-forward input ranges.
>>>
>>> The text_view iterators also expose a base_range() member
>>> that returns a range of the underlying code unit sequence
>>> corresponding to base() + code-unit-sequence-length (which I
>>> think is equivalent to to_increment_ in P2728). Is there a
>>> reason not to expose such a member? As is, it appears that
>>> obtaining that range would require constructing a subrange
>>> using base() from one iterator and base() from another
>>> iterator that has advanced to the next character. Such a
>>> subrange would not be valid in the case of specialized input
>>> iterators that use an input buffer cache as I suggested
>>> above (the two iterators would not point in to the same range).
>>>
>>> In a previous revision of the paper (P2728R7), rather than
>>> having |_or_error| views that give out |std::expected| as the
>>> |value_type|, I tried to address error handling with a
>>> |.success()| member function on the iterator that gave out
>>> |std::expected<void, utf_transcoding_error>|. I was advised by
>>> the chair at that timethat adding member functions other than
>>> |.base()| was objectionable to SG9, because users now have built
>>> an expectation that they can implement classes that wrap views
>>> by providing a limited set of member functions, which includes
>>> |.base()| but which does not include any novel designs.
>>> Unfortunately, I can't point to the minutes, since I was given
>>> this advice during an "unofficial" session during Wrocław. I
>>> would worry about experiencing similar resistance to the idea of
>>> adding a |.base_range()| member function.
>>>
>> I don't understand the objection. I don't see how adding
>> iterator-specific member functions removes the ability to wrap
>> views; such wrappers simply wouldn't expose those members which
>> seems fine to me.
>>>
>>> However, I currently haven't seen any use cases that
>>> |.base_range()| would enable that can't be implemented using
>>> |.base()|, other than input ranges, of course. In the previous
>>> telecon, I presented examples of sophisticated use cases for
>>> |.base()|, which are now added to P2728R13 in cleaner form.
>>> These are the "Transcoding into a buffer of a fixed number of
>>> code units without truncating code points" and "Performing code
>>> unit substitutions on cuneiform strings" examples.
>>>
>> I don't see why those use cases aren't applicable to input
>> ranges. I may be mistaken, but I expect programmers to encounter
>> input ranges more frequently going forward because they may be
>> produced by range adapters.
>
> I now see the later correction regarding those use cases.
>
> A classic example of where access to the original code units is
> useful is when substitutions occur. When U+FFFD is produced,
> access to the original code unit sequence provides the ability to
> analyze, log, or ameliorate the effects of the (presumably)
> incorrect code unit sequence.
>
> Tom.
>
>>> Furthermore, another problem with giving out views into an
>>> internal buffer in the transcoding iterator is that users will
>>> inevitably try to do the following:
>>>
>>> * Store a view or iterator pointing into the cache buffer
>>> * Increment the transcoding iterator, invalidating the
>>> aforementioned cache buffer view/iterator
>>> * Obtain a new view/iterator to the cache buffer of the
>>> incremented transcoding iterator
>>> * Try to compare the new view to the first one
>>>
>>> This is a footgun.
>>>
>> Wouldn't this be addressed by appropriate use of ranges::dangling
>> and/or std::ranges::borrowed_subrange_t?
>>>
>>> Ultimately, I just don't think supporting |.base()| on input
>>> views is feasible right now. The current paper design doesn't
>>> provide it for input views, which still leaves the door open to
>>> adding it in the future if we change our minds about that fact.
>>>
>> Adding base() in the future is possible. Adding
>> base_code_units() as I suggested would be an ABI breaking change
>> though since it requires an additional cache in the iterator.
>>
>> Tom.
>>
>>> Thanks,
>>>
>>> Eddie
>>>
>>>
>>>
>>> On Tue, May 26, 2026 at 3:48 PM Tom Honermann
>>> <tom_at_[hidden]> wrote:
>>>
>>> It sounds like we'll be continuing discussion of P2728R12
>>> tomorrow. I would like to discuss the approach suggested
>>> below as a resolution for the concerns raised last time
>>> regarding the behavior of base() for input ranges. Please
>>> share any thoughts ahead of the meeting if possible.
>>>
>>> Tom.
>>>
>>> On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:
>>>>
>>>> Thank you for the presentation on Wednesday, Eddie. I was
>>>> glad for us to finally get back to this paper! I have a few
>>>> comments now that I've read through the latest revision.
>>>>
>>>> We briefly discussed what the behavior for base() should be
>>>> for transcoding iterators that work with an underlying
>>>> range that only models std::ranges::input_range. The
>>>> proposed wording has this note in 24.7.?.6
>>>> ([range.transcoding.iterator]).
>>>>
>>>> [ /Note:/ to_utf_view::iterator maintains invariants on
>>>> base() which differ depending on whether it’s an input
>>>> iterator. In both cases, if *this is at the end of the
>>>> range being adapted, then base() == end(). But if it’s
>>>> not at the end of the adapted range, and it’s an input
>>>> iterator, then the position of base() is always at the
>>>> end of the input subsequence corresponding to the
>>>> current code point. On the other hand, for forward and
>>>> bidirectional iterators, the position of base() is
>>>> always at the beginning of the input subsequence
>>>> corresponding to the current code point. — /end note/ ]
>>>>
>>>> When I was working on text_view
>>>> <https://github.com/tahonermann/text_view/> many years ago,
>>>> I addressed this concern for encoding and decoding iterator
>>>> types with an underlying input iterator through
>>>> specialization; partial specializations of those types
>>>> substituted a caching iterator
>>>> <https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/caching_iterator.hpp>
>>>> for the original underlying input iterator. The exact way
>>>> that I went about this would not be appropriate for the
>>>> P2728 design (the cache consists of a cooperatively managed
>>>> look ahead buffer that is incrementally retired as
>>>> iterators are advanced; we don't want that here). But the
>>>> general idea of a small cache is applicable; when the
>>>> underlying range (only) models input range, the
>>>> (specialized) iterator can hold a (4 byte) input buffer
>>>> just as is done for the output code unit buffer. Unlike the
>>>> output code unit buffer, there is no buffer index to
>>>> maintain since base() would always return an iterator to
>>>> the beginning of that buffer. For consistency with forward
>>>> (and better) iterators, it would be useful for the iterator
>>>> returned by base() to be comparable to the underlying
>>>> (input) iterator for the purposes of comparison against
>>>> end(); but see an alternative approach below.
>>>>
>>>> The text_view iterators also expose a base_range() member
>>>> that returns a range of the underlying code unit sequence
>>>> corresponding to base() +
>>>> /code-unit-sequence-length/ (which I think is equivalent to
>>>> to_increment_ in P2728). Is there a reason not to expose
>>>> such a member? As is, it appears that obtaining that range
>>>> would require constructing a subrange using base() from one
>>>> iterator and base() from another iterator that has advanced
>>>> to the next character. Such a subrange would not be valid
>>>> in the case of specialized input iterators that use an
>>>> input buffer cache as I suggested above (the two iterators
>>>> would not point in to the same range).
>>>>
>>>> I think it would be useful to differentiate access to the
>>>> (complete) underlying range vs access to the input code
>>>> unit sequence for the current character. Obviously, access
>>>> to the complete underlying range isn't possible for input
>>>> iterators, but access to the current input code unit
>>>> sequence is (with the caching approach described above is).
>>>> The iterators could expose this interface:
>>>>
>>>> // Forward+ iterators only; returns an iterator into
>>>> the underlying range.
>>>> constexpr const iterator_t<Base>& *base()* const &
>>>> noexcept *requires forward_range<Base>* { ... }
>>>> constexpr iterator_t<Base> *base()* && *requires
>>>> forward_range<Base>* { ... }
>>>>
>>>> // Input+ iterators; returns a subrange containing the
>>>> input code units for the current character.
>>>> // References the input code unit sequence cache for
>>>> input iterators.
>>>> // References the underlying range otherwise.
>>>> constexpr subrange<...> *base_code_units()* const
>>>> noexcept { ... }
>>>>
>>>> Unlike base(), base_code_units() would not necessarily
>>>> contain iterators for the underlying range (e.g., in the
>>>> case of a caching input iterator). Note that base() could
>>>> be used to modify the underlying range (likely ill-advised)
>>>> while the subrange returned by base_code_units() could
>>>> restrict such writes thereby ensuring consistent behavior
>>>> for input and forward+ iterators.
>>>>
>>>> Tom.
>>>>
>>>>
>>

Received on 2026-05-27 19:45:18