ISOCPP sg16 List: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 16 May 2026 17:35:51 -0400

Thank you for the presentation on Wednesday, Eddie. I was glad for us to
finally get back to this paper! I have a few comments now that I've read
through the latest revision.

We briefly discussed what the behavior for base() should be for
transcoding iterators that work with an underlying range that only
models std::ranges::input_range. The proposed wording has this note in
24.7.?.6 ([range.transcoding.iterator]).

    [ /Note:/ to_utf_view::iterator maintains invariants on base() which
    differ depending on whether it’s an input iterator. In both cases,
    if *this is at the end of the range being adapted, then base() ==
    end(). But if it’s not at the end of the adapted range, and it’s an
    input iterator, then the position of base() is always at the end of
    the input subsequence corresponding to the current code point. On
    the other hand, for forward and bidirectional iterators, the
    position of base() is always at the beginning of the input
    subsequence corresponding to the current code point. — /end note/ ]

When I was working on text_view
<https://github.com/tahonermann/text_view/> many years ago, I addressed
this concern for encoding and decoding iterator types with an underlying
input iterator through specialization; partial specializations of those
types substituted a caching iterator
<https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/caching_iterator.hpp>
for the original underlying input iterator. The exact way that I went
about this would not be appropriate for the P2728 design (the cache
consists of a cooperatively managed look ahead buffer that is
incrementally retired as iterators are advanced; we don't want that
here). But the general idea of a small cache is applicable; when the
underlying range (only) models input range, the (specialized) iterator
can hold a (4 byte) input buffer just as is done for the output code
unit buffer. Unlike the output code unit buffer, there is no buffer
index to maintain since base() would always return an iterator to the
beginning of that buffer. For consistency with forward (and better)
iterators, it would be useful for the iterator returned by base() to be
comparable to the underlying (input) iterator for the purposes of
comparison against end(); but see an alternative approach below.

The text_view iterators also expose a base_range() member that returns a
range of the underlying code unit sequence corresponding to base() +
/code-unit-sequence-length/ (which I think is equivalent to
to_increment_ in P2728). Is there a reason not to expose such a member?
As is, it appears that obtaining that range would require constructing a
subrange using base() from one iterator and base() from another iterator
that has advanced to the next character. Such a subrange would not be
valid in the case of specialized input iterators that use an input
buffer cache as I suggested above (the two iterators would not point in
to the same range).

I think it would be useful to differentiate access to the (complete)
underlying range vs access to the input code unit sequence for the
current character. Obviously, access to the complete underlying range
isn't possible for input iterators, but access to the current input code
unit sequence is (with the caching approach described above is). The
iterators could expose this interface:

    // Forward+ iterators only; returns an iterator into the underlying
    range.
    constexpr const iterator_t<Base>& *base()* const & noexcept
    *requires forward_range<Base>* { ... }
    constexpr iterator_t<Base> *base()* && *requires
    forward_range<Base>* { ... }

    // Input+ iterators; returns a subrange containing the input code
    units for the current character.
    // References the input code unit sequence cache for input iterators.
    // References the underlying range otherwise.
    constexpr subrange<...> *base_code_units()* const noexcept { ... }

Unlike base(), base_code_units() would not necessarily contain iterators
for the underlying range (e.g., in the case of a caching input
iterator). Note that base() could be used to modify the underlying range
(likely ill-advised) while the subrange returned by base_code_units()
could restrict such writes thereby ensuring consistent behavior for
input and forward+ iterators.

Tom.

Received on 2026-05-16 21:35:57