C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 May 2026 14:27:30 -0400
On 5/27/26 2:47 AM, Eddie Nolan wrote:
>
> Thanks for providing this feedback. Here are my thoughts:
>
> But the general idea of a small cache is applicable; when the
> underlying range (only) models input range, the (specialized)
> iterator can hold a (4 byte) input buffer just as is done for the
> output code unit buffer. Unlike the output code unit buffer, there
> is no buffer index to maintain since base() would always return an
> iterator to the beginning of that buffer.
>
> The closest thing I can think of to a precedent that justifies this
> approach is that the views API can have range adaptors perform
> optimizations that result in |.base()| not returning an iterator to
> the view that was passed in to the range adaptor. For example, passing
> an instance of |std::ranges::reverse_view| to |std::views::reverse|
> yields the base of the |std::ranges::reverse_view| instead of a
> |reverse_view| of a |reverse_view|; so, when you invoke |.base()|, you
> don't get a |std::reverse_iterator|. P2728 takes advantage of this to
> enable double-transcode optimizations.
>
> However, I don't think we have precedent for a view type giving out an
> iterator from |.base()| whose type is unrelated to the iterator type
> of the underlying view. That approach seems like it violates the
> expectations users might have of the way that |.base()| works.
>
I agree; I didn't intent to suggest that base() should return a
different iterator type.
>
> In P2728R13 <https://isocpp.org/files/papers/P2728R13.html> I just
> removed |.base()| for non-forward input ranges.
>
> The text_view iterators also expose a base_range() member that
> returns a range of the underlying code unit sequence corresponding
> to base() + code-unit-sequence-length (which I think is equivalent
> to to_increment_ in P2728). Is there a reason not to expose such a
> member? As is, it appears that obtaining that range would require
> constructing a subrange using base() from one iterator and base()
> from another iterator that has advanced to the next character.
> Such a subrange would not be valid in the case of specialized
> input iterators that use an input buffer cache as I suggested
> above (the two iterators would not point in to the same range).
>
> In a previous revision of the paper (P2728R7), rather than having
> |_or_error| views that give out |std::expected| as the |value_type|, I
> tried to address error handling with a |.success()| member function on
> the iterator that gave out |std::expected<void,
> utf_transcoding_error>|. I was advised by the chair at that timethat
> adding member functions other than |.base()| was objectionable to SG9,
> because users now have built an expectation that they can implement
> classes that wrap views by providing a limited set of member
> functions, which includes |.base()| but which does not include any
> novel designs. Unfortunately, I can't point to the minutes, since I
> was given this advice during an "unofficial" session during Wrocław. I
> would worry about experiencing similar resistance to the idea of
> adding a |.base_range()| member function.
>
I don't understand the objection. I don't see how adding
iterator-specific member functions removes the ability to wrap views;
such wrappers simply wouldn't expose those members which seems fine to me.
>
> However, I currently haven't seen any use cases that |.base_range()|
> would enable that can't be implemented using |.base()|, other than
> input ranges, of course. In the previous telecon, I presented examples
> of sophisticated use cases for |.base()|, which are now added to
> P2728R13 in cleaner form. These are the "Transcoding into a buffer of
> a fixed number of code units without truncating code points" and
> "Performing code unit substitutions on cuneiform strings" examples.
>
I don't see why those use cases aren't applicable to input ranges. I may
be mistaken, but I expect programmers to encounter input ranges more
frequently going forward because they may be produced by range adapters.
>
> Furthermore, another problem with giving out views into an internal
> buffer in the transcoding iterator is that users will inevitably try
> to do the following:
>
> * Store a view or iterator pointing into the cache buffer
> * Increment the transcoding iterator, invalidating the
> aforementioned cache buffer view/iterator
> * Obtain a new view/iterator to the cache buffer of the incremented
> transcoding iterator
> * Try to compare the new view to the first one
>
> This is a footgun.
>
Wouldn't this be addressed by appropriate use of ranges::dangling and/or
std::ranges::borrowed_subrange_t?
>
> Ultimately, I just don't think supporting |.base()| on input views is
> feasible right now. The current paper design doesn't provide it for
> input views, which still leaves the door open to adding it in the
> future if we change our minds about that fact.
>
Adding base() in the future is possible. Adding base_code_units() as I
suggested would be an ABI breaking change though since it requires an
additional cache in the iterator.

Tom.

> Thanks,
>
> Eddie
>
>
>
> On Tue, May 26, 2026 at 3:48 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> It sounds like we'll be continuing discussion of P2728R12
> tomorrow. I would like to discuss the approach suggested below as
> a resolution for the concerns raised last time regarding the
> behavior of base() for input ranges. Please share any thoughts
> ahead of the meeting if possible.
>
> Tom.
>
> On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:
>>
>> Thank you for the presentation on Wednesday, Eddie. I was glad
>> for us to finally get back to this paper! I have a few comments
>> now that I've read through the latest revision.
>>
>> We briefly discussed what the behavior for base() should be for
>> transcoding iterators that work with an underlying range that
>> only models std::ranges::input_range. The proposed wording has
>> this note in 24.7.?.6 ([range.transcoding.iterator]).
>>
>> [ /Note:/ to_utf_view::iterator maintains invariants on
>> base() which differ depending on whether it’s an input
>> iterator. In both cases, if *this is at the end of the range
>> being adapted, then base() == end(). But if it’s not at the
>> end of the adapted range, and it’s an input iterator, then
>> the position of base() is always at the end of the input
>> subsequence corresponding to the current code point. On the
>> other hand, for forward and bidirectional iterators, the
>> position of base() is always at the beginning of the input
>> subsequence corresponding to the current code point. — /end
>> note/ ]
>>
>> When I was working on text_view
>> <https://github.com/tahonermann/text_view/> many years ago, I
>> addressed this concern for encoding and decoding iterator types
>> with an underlying input iterator through specialization; partial
>> specializations of those types substituted a caching iterator
>> <https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/caching_iterator.hpp>
>> for the original underlying input iterator. The exact way that I
>> went about this would not be appropriate for the P2728 design
>> (the cache consists of a cooperatively managed look ahead buffer
>> that is incrementally retired as iterators are advanced; we don't
>> want that here). But the general idea of a small cache is
>> applicable; when the underlying range (only) models input range,
>> the (specialized) iterator can hold a (4 byte) input buffer just
>> as is done for the output code unit buffer. Unlike the output
>> code unit buffer, there is no buffer index to maintain since
>> base() would always return an iterator to the beginning of that
>> buffer. For consistency with forward (and better) iterators, it
>> would be useful for the iterator returned by base() to be
>> comparable to the underlying (input) iterator for the purposes of
>> comparison against end(); but see an alternative approach below.
>>
>> The text_view iterators also expose a base_range() member that
>> returns a range of the underlying code unit sequence
>> corresponding to base() + /code-unit-sequence-length/ (which I
>> think is equivalent to to_increment_ in P2728). Is there a reason
>> not to expose such a member? As is, it appears that obtaining
>> that range would require constructing a subrange using base()
>> from one iterator and base() from another iterator that has
>> advanced to the next character. Such a subrange would not be
>> valid in the case of specialized input iterators that use an
>> input buffer cache as I suggested above (the two iterators would
>> not point in to the same range).
>>
>> I think it would be useful to differentiate access to the
>> (complete) underlying range vs access to the input code unit
>> sequence for the current character. Obviously, access to the
>> complete underlying range isn't possible for input iterators, but
>> access to the current input code unit sequence is (with the
>> caching approach described above is). The iterators could expose
>> this interface:
>>
>> // Forward+ iterators only; returns an iterator into the
>> underlying range.
>> constexpr const iterator_t<Base>& *base()* const & noexcept
>> *requires forward_range<Base>* { ... }
>> constexpr iterator_t<Base> *base()* && *requires
>> forward_range<Base>* { ... }
>>
>> // Input+ iterators; returns a subrange containing the input
>> code units for the current character.
>> // References the input code unit sequence cache for input
>> iterators.
>> // References the underlying range otherwise.
>> constexpr subrange<...> *base_code_units()* const noexcept {
>> ... }
>>
>> Unlike base(), base_code_units() would not necessarily contain
>> iterators for the underlying range (e.g., in the case of a
>> caching input iterator). Note that base() could be used to modify
>> the underlying range (likely ill-advised) while the subrange
>> returned by base_code_units() could restrict such writes thereby
>> ensuring consistent behavior for input and forward+ iterators.
>>
>> Tom.
>>
>>

Received on 2026-05-27 18:27:35