C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Comments on P2728R12 Unicode in the Library, Part 1: UTF Transcoding

From: Eddie Nolan <eddiejnolan_at_[hidden]>
Date: Wed, 27 May 2026 15:25:15 -0400
I agree; I didn't intend to suggest that base() should return a different
iterator type.

I'm confused. If the transcoding iterator is wrapping an arbitrary input
iterator, how is .base() supposed to return an instance of that type that
points into the iterator's internal cache buffer? For example, if it's
wrapping a std::istreambuf_iterator.

Wouldn't this be addressed by appropriate use of ranges::dangling and/or
std::ranges::borrowed_subrange_t?

No, to my understanding, the ranges::dangling method isn't intended to
support that use case; it's a very narrowly scoped facility intended to
prevent returning an iterator from a function that takes an owning view by
value.

I expect programmers to encounter input ranges more frequently going
forward because they may be produced by range adapters.

That's true, but it's also the case that input iterators and ranges tend to
result in annoying special cases in many places throughout the standard
views API. We support them as best we can, but I don't think they
necessarily justify bending over backwards to support, especially when the
proposed mechanisms for doing so are problematic.

It's not ideal, but if you have an input range and you really need
forward-range-only features, frequently it's feasible to address that by
copying the input range into a buffer first.

In particular, many of the non-view-based transcoding APIs inherently
require creating a buffer for the input, so users whose existing use case
is transcoding an input range will likely be copying it already.

A classic example of where access to the original code units is useful is
when substitutions occur. When U+FFFD is produced, access to the original
code unit sequence provides the ability to analyze, log, or ameliorate the
effects of the (presumably) incorrect code unit sequence.

Except for the case of input iterators, this can be implemented using
.base() without needing .base_range(). Here is a modified version of my
transcode_or_throw example from the paper which has been updated to add the
code units that produced the error to the exception message:

template <typename FromChar, typename ToChar>
std::basic_string<ToChar>
transcode_or_throw(std::basic_string_view<FromChar> input) {
  std::basic_string<ToChar> result;
  auto view = input | std::views::to_utf_or_error<ToChar>;
  for (auto it = view.begin(), end = view.end(); it != end; ++it) {
    if ((*it).has_value()) {
      result.push_back(**it);
    } else {
      throw std::runtime_error("error at position " +
                               std::to_string(it.base() -
input.begin()) + ": " +
                               enum_to_string((*it).error()) + "; code units:" +
                               [&] {
                                  std::string s;
                                  for (auto p = it.base(); p !=
std::next(it).base(); ++p)
                                      s += std::format(" 0x{:02X}",
static_cast<unsigned>(*p));
                                  return s;
                               }());
    }
  }
  return result;
}

- Eddie


On Wed, May 27, 2026 at 2:34 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 5/27/26 2:27 PM, Tom Honermann via SG16 wrote:
>
> On 5/27/26 2:47 AM, Eddie Nolan wrote:
>
> Thanks for providing this feedback. Here are my thoughts:
>
> But the general idea of a small cache is applicable; when the underlying
> range (only) models input range, the (specialized) iterator can hold a (4
> byte) input buffer just as is done for the output code unit buffer. Unlike
> the output code unit buffer, there is no buffer index to maintain since
> base() would always return an iterator to the beginning of that buffer.
>
> The closest thing I can think of to a precedent that justifies this
> approach is that the views API can have range adaptors perform
> optimizations that result in .base() not returning an iterator to the
> view that was passed in to the range adaptor. For example, passing an
> instance of std::ranges::reverse_view to std::views::reverse yields the
> base of the std::ranges::reverse_view instead of a reverse_view of a
> reverse_view; so, when you invoke .base(), you don't get a
> std::reverse_iterator. P2728 takes advantage of this to enable
> double-transcode optimizations.
>
> However, I don't think we have precedent for a view type giving out an
> iterator from .base() whose type is unrelated to the iterator type of the
> underlying view. That approach seems like it violates the expectations
> users might have of the way that .base() works.
>
> I agree; I didn't intent to suggest that base() should return a different
> iterator type.
>
> In P2728R13 <https://isocpp.org/files/papers/P2728R13.html> I just
> removed .base() for non-forward input ranges.
>
> The text_view iterators also expose a base_range() member that returns a
> range of the underlying code unit sequence corresponding to base() +
> code-unit-sequence-length (which I think is equivalent to to_increment_ in
> P2728). Is there a reason not to expose such a member? As is, it appears
> that obtaining that range would require constructing a subrange using
> base() from one iterator and base() from another iterator that has advanced
> to the next character. Such a subrange would not be valid in the case of
> specialized input iterators that use an input buffer cache as I suggested
> above (the two iterators would not point in to the same range).
>
> In a previous revision of the paper (P2728R7), rather than having
> _or_error views that give out std::expected as the value_type, I tried to
> address error handling with a .success() member function on the iterator
> that gave out std::expected<void, utf_transcoding_error>. I was advised
> by the chair at that timethat adding member functions other than .base()
> was objectionable to SG9, because users now have built an expectation that
> they can implement classes that wrap views by providing a limited set of
> member functions, which includes .base() but which does not include any
> novel designs. Unfortunately, I can't point to the minutes, since I was
> given this advice during an "unofficial" session during Wrocław. I would
> worry about experiencing similar resistance to the idea of adding a
> .base_range() member function.
>
> I don't understand the objection. I don't see how adding iterator-specific
> member functions removes the ability to wrap views; such wrappers simply
> wouldn't expose those members which seems fine to me.
>
> However, I currently haven't seen any use cases that .base_range() would
> enable that can't be implemented using .base(), other than input ranges,
> of course. In the previous telecon, I presented examples of sophisticated
> use cases for .base(), which are now added to P2728R13 in cleaner form.
> These are the "Transcoding into a buffer of a fixed number of code units
> without truncating code points" and "Performing code unit substitutions on
> cuneiform strings" examples.
>
> I don't see why those use cases aren't applicable to input ranges. I may
> be mistaken, but I expect programmers to encounter input ranges more
> frequently going forward because they may be produced by range adapters.
>
> I now see the later correction regarding those use cases.
>
> A classic example of where access to the original code units is useful is
> when substitutions occur. When U+FFFD is produced, access to the original
> code unit sequence provides the ability to analyze, log, or ameliorate the
> effects of the (presumably) incorrect code unit sequence.
>
> Tom.
>
> Furthermore, another problem with giving out views into an internal buffer
> in the transcoding iterator is that users will inevitably try to do the
> following:
>
> - Store a view or iterator pointing into the cache buffer
> - Increment the transcoding iterator, invalidating the aforementioned
> cache buffer view/iterator
> - Obtain a new view/iterator to the cache buffer of the incremented
> transcoding iterator
> - Try to compare the new view to the first one
>
> This is a footgun.
>
> Wouldn't this be addressed by appropriate use of ranges::dangling and/or
> std::ranges::borrowed_subrange_t?
>
> Ultimately, I just don't think supporting .base() on input views is
> feasible right now. The current paper design doesn't provide it for input
> views, which still leaves the door open to adding it in the future if we
> change our minds about that fact.
>
> Adding base() in the future is possible. Adding base_code_units() as I
> suggested would be an ABI breaking change though since it requires an
> additional cache in the iterator.
>
> Tom.
>
> Thanks,
>
> Eddie
>
>
> On Tue, May 26, 2026 at 3:48 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> It sounds like we'll be continuing discussion of P2728R12 tomorrow. I
>> would like to discuss the approach suggested below as a resolution for the
>> concerns raised last time regarding the behavior of base() for input
>> ranges. Please share any thoughts ahead of the meeting if possible.
>>
>> Tom.
>> On 5/16/26 5:35 PM, Tom Honermann via SG16 wrote:
>>
>> Thank you for the presentation on Wednesday, Eddie. I was glad for us to
>> finally get back to this paper! I have a few comments now that I've read
>> through the latest revision.
>>
>> We briefly discussed what the behavior for base() should be for
>> transcoding iterators that work with an underlying range that only models
>> std::ranges::input_range. The proposed wording has this note in 24.7.?.6
>> ([range.transcoding.iterator]).
>>
>> [ *Note:* to_utf_view::iterator maintains invariants on base() which
>> differ depending on whether it’s an input iterator. In both cases, if
>> *this is at the end of the range being adapted, then base() == end().
>> But if it’s not at the end of the adapted range, and it’s an input
>> iterator, then the position of base() is always at the end of the input
>> subsequence corresponding to the current code point. On the other hand, for
>> forward and bidirectional iterators, the position of base() is always at
>> the beginning of the input subsequence corresponding to the current code
>> point. — *end note* ]
>>
>> When I was working on text_view
>> <https://github.com/tahonermann/text_view/> many years ago, I addressed
>> this concern for encoding and decoding iterator types with an underlying
>> input iterator through specialization; partial specializations of those
>> types substituted a caching iterator
>> <https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/caching_iterator.hpp>
>> for the original underlying input iterator. The exact way that I went about
>> this would not be appropriate for the P2728 design (the cache consists of a
>> cooperatively managed look ahead buffer that is incrementally retired as
>> iterators are advanced; we don't want that here). But the general idea of a
>> small cache is applicable; when the underlying range (only) models input
>> range, the (specialized) iterator can hold a (4 byte) input buffer just as
>> is done for the output code unit buffer. Unlike the output code unit
>> buffer, there is no buffer index to maintain since base() would always
>> return an iterator to the beginning of that buffer. For consistency with
>> forward (and better) iterators, it would be useful for the iterator
>> returned by base() to be comparable to the underlying (input) iterator
>> for the purposes of comparison against end(); but see an alternative
>> approach below.
>>
>> The text_view iterators also expose a base_range() member that returns a
>> range of the underlying code unit sequence corresponding to base() +
>> *code-unit-sequence-length* (which I think is equivalent to to_increment_
>> in P2728). Is there a reason not to expose such a member? As is, it appears
>> that obtaining that range would require constructing a subrange using
>> base() from one iterator and base() from another iterator that has
>> advanced to the next character. Such a subrange would not be valid in the
>> case of specialized input iterators that use an input buffer cache as I
>> suggested above (the two iterators would not point in to the same range).
>>
>> I think it would be useful to differentiate access to the (complete)
>> underlying range vs access to the input code unit sequence for the current
>> character. Obviously, access to the complete underlying range isn't
>> possible for input iterators, but access to the current input code unit
>> sequence is (with the caching approach described above is). The iterators
>> could expose this interface:
>>
>> // Forward+ iterators only; returns an iterator into the underlying range.
>> constexpr const iterator_t<Base>& *base()* const & noexcept *requires
>> forward_range<Base>* { ... }
>> constexpr iterator_t<Base> *base()* && *requires forward_range<Base>* {
>> ... }
>>
>> // Input+ iterators; returns a subrange containing the input code units
>> for the current character.
>> // References the input code unit sequence cache for input iterators.
>> // References the underlying range otherwise.
>> constexpr subrange<...> *base_code_units()* const noexcept { ... }
>>
>> Unlike base(), base_code_units() would not necessarily contain iterators
>> for the underlying range (e.g., in the case of a caching input iterator).
>> Note that base() could be used to modify the underlying range (likely
>> ill-advised) while the subrange returned by base_code_units() could
>> restrict such writes thereby ensuring consistent behavior for input and
>> forward+ iterators.
>>
>> Tom.
>>
>>
>

Received on 2026-05-27 19:25:30