ISOCPP sg16 List: Re: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 May 2026 14:34:28 -0400

On 5/27/26 2:27 PM, Tom Honermann via SG16 wrote:
> On 5/27/26 2:47 AM, Eddie Nolan wrote:
>>
>> Thanks for providing this feedback. Here are my thoughts:
>>
>> But the general idea of a small cache is applicable; when the
>> underlying range (only) models input range, the (specialized)
>> iterator can hold a (4 byte) input buffer just as is done for the
>> output code unit buffer. Unlike the output code unit buffer,
>> there is no buffer index to maintain since base() would always
>> return an iterator to the beginning of that buffer.
>>
>> The closest thing I can think of to a precedent that justifies this
>> approach is that the views API can have range adaptors perform
>> optimizations that result in |.base()| not returning an iterator to
>> the view that was passed in to the range adaptor. For example,
>> passing an instance of |std::ranges::reverse_view| to
>> |std::views::reverse| yields the base of the
>> |std::ranges::reverse_view| instead of a |reverse_view| of a
>> |reverse_view|; so, when you invoke |.base()|, you don't get a
>> |std::reverse_iterator|. P2728 takes advantage of this to enable
>> double-transcode optimizations.
>>
>> However, I don't think we have precedent for a view type giving out
>> an iterator from |.base()| whose type is unrelated to the iterator
>> type of the underlying view. That approach seems like it violates the
>> expectations users might have of the way that |.base()| works.
>>
> I agree; I didn't intent to suggest that base() should return a
> different iterator type.
>>
>> In P2728R13 <https://isocpp.org/files/papers/P2728R13.html> I just
>> removed |.base()| for non-forward input ranges.
>>
>> The text_view iterators also expose a base_range() member that
>> returns a range of the underlying code unit sequence
>> corresponding to base() + code-unit-sequence-length (which I
>> think is equivalent to to_increment_ in P2728). Is there a reason
>> not to expose such a member? As is, it appears that obtaining
>> that range would require constructing a subrange using base()
>> from one iterator and base() from another iterator that has
>> advanced to the next character. Such a subrange would not be
>> valid in the case of specialized input iterators that use an
>> input buffer cache as I suggested above (the two iterators would
>> not point in to the same range).
>>
>> In a previous revision of the paper (P2728R7), rather than having
>> |_or_error| views that give out |std::expected| as the |value_type|,
>> I tried to address error handling with a |.success()| member function
>> on the iterator that gave out |std::expected<void,
>> utf_transcoding_error>|. I was advised by the chair at that timethat
>> adding member functions other than |.base()| was objectionable to
>> SG9, because users now have built an expectation that they can
>> implement classes that wrap views by providing a limited set of
>> member functions, which includes |.base()| but which does not include
>> any novel designs. Unfortunately, I can't point to the minutes, since
>> I was given this advice during an "unofficial" session during
>> Wrocław. I would worry about experiencing similar resistance to the
>> idea of adding a |.base_range()| member function.
>>
> I don't understand the objection. I don't see how adding
> iterator-specific member functions removes the ability to wrap views;
> such wrappers simply wouldn't expose those members which seems fine to
> me.
>>
>> However, I currently haven't seen any use cases that |.base_range()|
>> would enable that can't be implemented using |.base()|, other than
>> input ranges, of course. In the previous telecon, I presented
>> examples of sophisticated use cases for |.base()|, which are now
>> added to P2728R13 in cleaner form. These are the "Transcoding into a
>> buffer of a fixed number of code units without truncating code
>> points" and "Performing code unit substitutions on cuneiform strings"
>> examples.
>>
> I don't see why those use cases aren't applicable to input ranges. I
> may be mistaken, but I expect programmers to encounter input ranges
> more frequently going forward because they may be produced by range
> adapters.

I now see the later correction regarding those use cases.

A classic example of where access to the original code units is useful
is when substitutions occur. When U+FFFD is produced, access to the
original code unit sequence provides the ability to analyze, log, or
ameliorate the effects of the (presumably) incorrect code unit sequence.

Tom.

>> Furthermore, another problem with giving out views into an internal
>> buffer in the transcoding iterator is that users will inevitably try
>> to do the following:
>>
>> * Store a view or iterator pointing into the cache buffer
>> * Increment the transcoding iterator, invalidating the
>> aforementioned cache buffer view/iterator
>> * Obtain a new view/iterator to the cache buffer of the incremented
>> transcoding iterator
>> * Try to compare the new view to the first one
>>
>> This is a footgun.
>>
> Wouldn't this be addressed by appropriate use of ranges::dangling
> and/or std::ranges::borrowed_subrange_t?
>>
>> Ultimately, I just don't think supporting |.base()| on input views is
>> feasible right now. The current paper design doesn't provide it for
>> input views, which still leaves the door open to adding it in the
>> future if we change our minds about that fact.
>>
> Adding base() in the future is possible. Adding base_code_units() as I
> suggested would be an ABI breaking change though since it requires an
> additional cache in the iterator.
>
> Tom.
>
>> Thanks,
>>
>> Eddie
>>
>>
>>
>> On Tue, May 26, 2026 at 3:48 PM Tom Honermann <tom_at_[hidden]> wrote:
>>
>> It sounds like we'll be continuing discussion of P2728R12
>> tomorrow. I would like to discuss the approach suggested below as
>> a resolution for the concerns raised last time regarding the
>> behavior of base() for input ranges. Please share any thoughts
>> ahead of the meeting if possible.
>>
>> Tom.
>>
>> On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:
>>>
>>> Thank you for the presentation on Wednesday, Eddie. I was glad
>>> for us to finally get back to this paper! I have a few comments
>>> now that I've read through the latest revision.
>>>
>>> We briefly discussed what the behavior for base() should be for
>>> transcoding iterators that work with an underlying range that
>>> only models std::ranges::input_range. The proposed wording has
>>> this note in 24.7.?.6 ([range.transcoding.iterator]).
>>>
>>> [ /Note:/ to_utf_view::iterator maintains invariants on
>>> base() which differ depending on whether it’s an input
>>> iterator. In both cases, if *this is at the end of the range
>>> being adapted, then base() == end(). But if it’s not at the
>>> end of the adapted range, and it’s an input iterator, then
>>> the position of base() is always at the end of the input
>>> subsequence corresponding to the current code point. On the
>>> other hand, for forward and bidirectional iterators, the
>>> position of base() is always at the beginning of the input
>>> subsequence corresponding to the current code point. — /end
>>> note/ ]
>>>
>>> When I was working on text_view
>>> <https://github.com/tahonermann/text_view/> many years ago, I
>>> addressed this concern for encoding and decoding iterator types
>>> with an underlying input iterator through specialization;
>>> partial specializations of those types substituted a caching
>>> iterator
>>> <https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/caching_iterator.hpp>
>>> for the original underlying input iterator. The exact way that I
>>> went about this would not be appropriate for the P2728 design
>>> (the cache consists of a cooperatively managed look ahead buffer
>>> that is incrementally retired as iterators are advanced; we
>>> don't want that here). But the general idea of a small cache is
>>> applicable; when the underlying range (only) models input range,
>>> the (specialized) iterator can hold a (4 byte) input buffer just
>>> as is done for the output code unit buffer. Unlike the output
>>> code unit buffer, there is no buffer index to maintain since
>>> base() would always return an iterator to the beginning of that
>>> buffer. For consistency with forward (and better) iterators, it
>>> would be useful for the iterator returned by base() to be
>>> comparable to the underlying (input) iterator for the purposes
>>> of comparison against end(); but see an alternative approach below.
>>>
>>> The text_view iterators also expose a base_range() member that
>>> returns a range of the underlying code unit sequence
>>> corresponding to base() + /code-unit-sequence-length/ (which I
>>> think is equivalent to to_increment_ in P2728). Is there a
>>> reason not to expose such a member? As is, it appears that
>>> obtaining that range would require constructing a subrange using
>>> base() from one iterator and base() from another iterator that
>>> has advanced to the next character. Such a subrange would not be
>>> valid in the case of specialized input iterators that use an
>>> input buffer cache as I suggested above (the two iterators would
>>> not point in to the same range).
>>>
>>> I think it would be useful to differentiate access to the
>>> (complete) underlying range vs access to the input code unit
>>> sequence for the current character. Obviously, access to the
>>> complete underlying range isn't possible for input iterators,
>>> but access to the current input code unit sequence is (with the
>>> caching approach described above is). The iterators could expose
>>> this interface:
>>>
>>> // Forward+ iterators only; returns an iterator into the
>>> underlying range.
>>> constexpr const iterator_t<Base>& *base()* const & noexcept
>>> *requires forward_range<Base>* { ... }
>>> constexpr iterator_t<Base> *base()* && *requires
>>> forward_range<Base>* { ... }
>>>
>>> // Input+ iterators; returns a subrange containing the input
>>> code units for the current character.
>>> // References the input code unit sequence cache for input
>>> iterators.
>>> // References the underlying range otherwise.
>>> constexpr subrange<...> *base_code_units()* const noexcept {
>>> ... }
>>>
>>> Unlike base(), base_code_units() would not necessarily contain
>>> iterators for the underlying range (e.g., in the case of a
>>> caching input iterator). Note that base() could be used to
>>> modify the underlying range (likely ill-advised) while the
>>> subrange returned by base_code_units() could restrict such
>>> writes thereby ensuring consistent behavior for input and
>>> forward+ iterators.
>>>
>>> Tom.
>>>
>>>
>

Received on 2026-05-27 18:34:31