On Wed, Sep 14, 2022 at 8:02 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

I had hoped to send this days ago, but alas. I don't have expectations of this being read before today's meeting.

P2626R0 proposes a set of functions to cast between character types. These functions are listed below (the see below describes a return type that is a pointer to a type that preserves CV qualifiers from the input).

template <class From>
constexpr see below cast_as_utf_unchecked(From* ptr, size_t n) noexcept;
template <class To, class From>
constexpr see below cast_utf_to(From* ptr, size_t n) noexcept;
template <class T>
constexpr see below cast_as_utf_unchecked(basic_string_view<T> && v) noexcept;
template <class To, class From>
constexpr auto cast_utf_to(basic_string_view<From> && v) noexcept;

It is well acknowledged that these functions have sharp edges that are inherent in what the paper seeks to accomplish and that there is no solution that won't have sharp edges. However, since these interfaces are not intended to be expert only (non-expert programmers are expected to use them), I'd like to try to dull the edges somewhat.

Our August 24th discussion and prior mailing list discussion of the paper raised concerns that it may be important, from a core language perspective, that object type conversions performed using the proposed utilities be undone prior to object destruction in order to avoid the possibility of, for example, an object of type char8_t being destructed as an object of type char (particularly in constant evaluation). We have not yet concluded discussion regarding these concerns, but regardless, I think a scoped mechanism to limit the duration under which objects of one type are accessed as objects of another type would be beneficial. For example (using iterator+size interfaces to avoid questions regarding destruction of subobjects):

void f(const char*, std::size_t);
void g(const char8_t*, std::size_t) {
void h(const char8_t *c8p, std::size_t n) {
f(cast_utf_to<char>(c8p, n), n);
g(c8p, n); // UB, c8p still points to converted-to-char objects.
}

I've been playing with an alternate RAII-based library interface that would help to avoid the problem above by enabling scoped access from a compatible type. The high level interfaces are:

template<typename T, typename R>
requires std::ranges::contiguous_range<R>
      && std::ranges::sized_range<R>
      && std::ranges::view<R>
class borrowed_object_view {
public:
using borrowed_element_type = /* T-with-cv-qualifiers-from-R */
using borrowed_view_type = /* subrange of T-with-cv-qualifiers-from-R and extent of R */

constexpr operator borrowed_view_type() const;
constexpr operator borrowed_element_type*() const;
};

template<typename T, typename R>
requires std::ranges::contiguous_range<R>
      && std::ranges::sized_range<R>
      && std::ranges::borrowed_range<R>
constexpr auto borrow_as(R &&r) -> borrowed_object_view<T, R>;

template<typename T, typename E>
constexpr auto borrow_as(E *p, std::size_t n) -> borrowed_object_view<T, std::span<E*>>;

With these, the example above becomes:

void h(const char8_t *c8p, std::size_t n) {
f(borrow_as<char>(c8p, n), n);
g(c8p, n); // Ok.
}

A prototype implementation is available at https://godbolt.org/z/jE7qxTe7b and it passes tests using both gcc and clang. The prototype of the above functions and class (defined in <borrow>) implements the borrowing functionality using the following interfaces (defined in <underlying_type_access>) where underlying_type_ex_t extends std::underlying_type to specify underlying types for the various character types. Some basic tests are present (in example.cpp). The tests are annotated with a few cases that are considered UB (and these are not detected during constant evaluation).

template<typename T, typename E>
requires std::same_as<detail::underlying_type_ex_t<T>, detail::underlying_type_ex_t<E>>
struct underlying_type_access_handle {
/* unspecified */
};

template<typename T, typename E>
requires std::same_as<detail::underlying_type_ex_t<T>, detail::underlying_type_ex_t<E>>
constexpr auto acquire_underlying_type_access(E *p, std::size_t n = 1) -> underlying_type_access_handle<T, E>;

template<typename T, typename E>
constexpr void release_underlying_type_access(underlying_type_access_handle<T, E> &);

For prototyping purposes, the acquire operation allocates and initializes new T's and the release operation deallocates them (after copying the T elements back over the E elements to preserve mutations when E is non-const). Such an implementation is poor quality from a performance standpoint (and therefore what we want to avoid), but serves as an interesting model for what we are trying to accomplish since it 1) works for both constant and non-constant evaluation, 2) avoids aliasing concerns, and 3) models what programmers have to do today to avoid UB and support constant evaluation.

In order to achieve performance goals, additional behavior needs to be ascribed to these functions. While it would remain UB for the objects to be accessed by the "wrong" type in between the acquire and release operations, I think we can resolve the aliasing concerns at the point of the calls with something like this:

For acquire_underlying_type_access:
Behaves as though an object of type E of unknown provenance has been read (thus ensuring that the objects (of type E) in the range [p,n) have all been stored).

For release_underlying_type_access:
Behaves as though an object of type T of unknown provenance has been read (thus ensuring that the objects (of type T) in the range [p,n) have all been stored) followed by a write to an object of type E of unknown provenance (thus ensuring that future reads of objects in the range [p,n) will force a load).

(It might be simpler to describe the semantics in terms of a copy, but the overlapping storage makes that difficult).

This doesn't address questions of how implementations would detect UB during constant evaluation.

These interfaces do not help to avoid UB in cases like these:

void j(const char*, const char*);
void k(const char8_t *c8p, std::size_t n) {
j(borrow_as<char>(c8p, n), borrow_as<char>(c8p, n)); // UB, the same range is borrowed multiple times in the same full expression.
}

void m(const char8_t *c8p, std::size_t n) {
const char *cp = borrow_as<char>(c8p, n);
cp[0]; // UB, the borrowed objects have already been released.

{
const auto &borrowed = borrow_as<char>(c8p, n);
const char* cp2 = borrowed;
cp[0]; // Ok.
c8p[0]; // UB, the borrowed objects have not yet been released.
}
c8p[0]; // Ok.
}

Your thoughts welcome.

Thanks for your work Tom

This was covered in the paper and considered.

I think it is *less* safe - Or rather, it feels more safe than it is.

Such an interface cannot guarantee unique access so, as your examples show, it is too easy to use such an object as a temporary that doesn't outlive the lifetime of the pointer, stored the """borrowed""" pointer somewhere etc.

The only benefit is that *if* used correctly, destruction of the buffer can happen correctly. But I think we can handle that case at its core.

There is no construct that could enforce unique access to a memory region.

If we can't make it safe, I don't know if there is value in making it look easier to use than it is.