C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 26 May 2026 15:48:44 -0400
It sounds like we'll be continuing discussion of P2728R12 tomorrow. I
would like to discuss the approach suggested below as a resolution for
the concerns raised last time regarding the behavior of base() for input
ranges. Please share any thoughts ahead of the meeting if possible.

Tom.

On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:
>
> Thank you for the presentation on Wednesday, Eddie. I was glad for us
> to finally get back to this paper! I have a few comments now that I've
> read through the latest revision.
>
> We briefly discussed what the behavior for base() should be for
> transcoding iterators that work with an underlying range that only
> models std::ranges::input_range. The proposed wording has this note in
> 24.7.?.6 ([range.transcoding.iterator]).
>
> [ /Note:/ to_utf_view::iterator maintains invariants on base()
> which differ depending on whether it’s an input iterator. In both
> cases, if *this is at the end of the range being adapted, then
> base() == end(). But if it’s not at the end of the adapted range,
> and it’s an input iterator, then the position of base() is always
> at the end of the input subsequence corresponding to the current
> code point. On the other hand, for forward and bidirectional
> iterators, the position of base() is always at the beginning of
> the input subsequence corresponding to the current code point. —
> /end note/ ]
>
> When I was working on text_view
> <https://github.com/tahonermann/text_view/> many years ago, I
> addressed this concern for encoding and decoding iterator types with
> an underlying input iterator through specialization; partial
> specializations of those types substituted a caching iterator
> <https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/caching_iterator.hpp>
> for the original underlying input iterator. The exact way that I went
> about this would not be appropriate for the P2728 design (the cache
> consists of a cooperatively managed look ahead buffer that is
> incrementally retired as iterators are advanced; we don't want that
> here). But the general idea of a small cache is applicable; when the
> underlying range (only) models input range, the (specialized) iterator
> can hold a (4 byte) input buffer just as is done for the output code
> unit buffer. Unlike the output code unit buffer, there is no buffer
> index to maintain since base() would always return an iterator to the
> beginning of that buffer. For consistency with forward (and better)
> iterators, it would be useful for the iterator returned by base() to
> be comparable to the underlying (input) iterator for the purposes of
> comparison against end(); but see an alternative approach below.
>
> The text_view iterators also expose a base_range() member that returns
> a range of the underlying code unit sequence corresponding to base() +
> /code-unit-sequence-length/ (which I think is equivalent to
> to_increment_ in P2728). Is there a reason not to expose such a
> member? As is, it appears that obtaining that range would require
> constructing a subrange using base() from one iterator and base() from
> another iterator that has advanced to the next character. Such a
> subrange would not be valid in the case of specialized input iterators
> that use an input buffer cache as I suggested above (the two iterators
> would not point in to the same range).
>
> I think it would be useful to differentiate access to the (complete)
> underlying range vs access to the input code unit sequence for the
> current character. Obviously, access to the complete underlying range
> isn't possible for input iterators, but access to the current input
> code unit sequence is (with the caching approach described above is).
> The iterators could expose this interface:
>
> // Forward+ iterators only; returns an iterator into the
> underlying range.
> constexpr const iterator_t<Base>& *base()* const & noexcept
> *requires forward_range<Base>* { ... }
> constexpr iterator_t<Base> *base()* && *requires
> forward_range<Base>* { ... }
>
> // Input+ iterators; returns a subrange containing the input code
> units for the current character.
> // References the input code unit sequence cache for input iterators.
> // References the underlying range otherwise.
> constexpr subrange<...> *base_code_units()* const noexcept { ... }
>
> Unlike base(), base_code_units() would not necessarily contain
> iterators for the underlying range (e.g., in the case of a caching
> input iterator). Note that base() could be used to modify the
> underlying range (likely ill-advised) while the subrange returned by
> base_code_units() could restrict such writes thereby ensuring
> consistent behavior for input and forward+ iterators.
>
> Tom.
>
>

Received on 2026-05-26 19:48:49