ISOCPP sg16 List: An alternative interface for P2626R0 (charN_t incremental adoption: Casting pointers of UTF character types)

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 14 Sep 2022 14:02:37 -0400

I had hoped to send this days ago, but alas. I don't have expectations
of this being read before today's meeting.

P2626R0 <https://wg21.link/p2626r0> proposes a set of functions to cast
between character types. These functions are listed below (the /see
below/ describes a return type that is a pointer to a type that
preserves CV qualifiers from the input).

    template <class From>
    constexpr /see below/ cast_as_utf_unchecked(From* ptr, size_t n)
    noexcept;
    template <class To, class From>
    constexpr /see below/ cast_utf_to(From* ptr, size_t n) noexcept;
    template <class T>
    constexpr /see below/ cast_as_utf_unchecked(basic_string_view<T> &&
    v) noexcept;
    template <class To, class From>
    constexpr auto cast_utf_to(basic_string_view<From> && v) noexcept;

It is well acknowledged that these functions have sharp edges that are
inherent in what the paper seeks to accomplish and that there is no
solution that won't have sharp edges. However, since these interfaces
are not intended to be expert only (non-expert programmers are expected
to use them), I'd like to try to dull the edges somewhat.

Our August 24th discussion
<https://github.com/sg16-unicode/sg16-meetings#august-24th-2022> and
prior mailing list discussion
<https://lists.isocpp.org/sg16/2022/08/3354.php> of the paper raised
concerns that it may be important, from a core language perspective,
that object type conversions performed using the proposed utilities be
undone prior to object destruction in order to avoid the possibility of,
for example, an object of type char8_t being destructed as an object of
type char (particularly in constant evaluation). We have not yet
concluded discussion regarding these concerns, but regardless, I think a
scoped mechanism to limit the duration under which objects of one type
are accessed as objects of another type would be beneficial. For example
(using iterator+size interfaces to avoid questions regarding destruction
of subobjects):

    void f(const char*, std::size_t);
    void g(const char8_t*, std::size_t) {
    void h(const char8_t *c8p, std::size_t n) {
       f(cast_utf_to<char>(c8p, n), n);
       g(c8p, n); // UB, c8p still points to converted-to-char objects.
    }

I've been playing with an alternate RAII-based library interface that
would help to avoid the problem above by enabling scoped access from a
compatible type. The high level interfaces are:

    template<typename T, typename R>
    requires std::ranges::contiguous_range<R>
           && std::ranges::sized_range<R>
           && std::ranges::view<R>
    class borrowed_object_view {
    public:
       using borrowed_element_type = /* T-with-cv-qualifiers-from-R */
       using borrowed_view_type = /* subrange of
    T-with-cv-qualifiers-from-R and extent of R */

       constexpr operator borrowed_view_type() const;
       constexpr operator borrowed_element_type*() const;
    };

    template<typename T, typename R>
    requires std::ranges::contiguous_range<R>
           && std::ranges::sized_range<R>
           && std::ranges::borrowed_range<R>
    constexpr auto borrow_as(R &&r) -> borrowed_object_view<T, R>;

    template<typename T, typename E>
    constexpr auto borrow_as(E *p, std::size_t n) ->
    borrowed_object_view<T, std::span<E*>>;

With these, the example above becomes:

    void h(const char8_t *c8p, std::size_t n) {
       f(borrow_as<char>(c8p, n), n);
       g(c8p, n); // Ok.
    }

A prototype implementation is available at
https://godbolt.org/z/jE7qxTe7b and it passes tests using both gcc and
clang. The prototype of the above functions and class (defined in
<borrow>) implements the borrowing functionality using the following
interfaces (defined in <underlying_type_access>) where
underlying_type_ex_t extends std::underlying_type to specify underlying
types for the various character types. Some basic tests are present (in
example.cpp). The tests are annotated with a few cases that are
considered UB (and these are not detected during constant evaluation).

    template<typename T, typename E>
    requires std::same_as<detail::underlying_type_ex_t<T>,
    detail::underlying_type_ex_t<E>>
    struct underlying_type_access_handle {
       /* unspecified */
    };

    template<typename T, typename E>
    requires std::same_as<detail::underlying_type_ex_t<T>,
    detail::underlying_type_ex_t<E>>
    constexpr auto acquire_underlying_type_access(E *p, std::size_t n =
    1) -> underlying_type_access_handle<T, E>;

    template<typename T, typename E>
    constexpr void
    release_underlying_type_access(underlying_type_access_handle<T, E> &);

For prototyping purposes, the acquire operation allocates and
initializes new T's and the release operation deallocates them (after
copying the T elements back over the E elements to preserve mutations
when E is non-const). Such an implementation is poor quality from a
performance standpoint (and therefore what we want to avoid), but serves
as an interesting model for what we are trying to accomplish since it 1)
works for both constant and non-constant evaluation, 2) avoids aliasing
concerns, and 3) models what programmers have to do today to avoid UB
and support constant evaluation.

In order to achieve performance goals, additional behavior needs to be
ascribed to these functions. While it would remain UB for the objects to
be accessed by the "wrong" type in between the acquire and release
operations, I think we can resolve the aliasing concerns at the point of
the calls with something like this:

  * For acquire_underlying_type_access:
    Behaves as though an object of type E of unknown provenance has been
    read (thus ensuring that the objects (of type E) in the range [p,n)
    have all been stored).
  * For release_underlying_type_access:
    Behaves as though an object of type T of unknown provenance has been
    read (thus ensuring that the objects (of type T) in the range [p,n)
    have all been stored) followed by a write to an object of type E of
    unknown provenance (thus ensuring that future reads of objects in
    the range [p,n) will force a load).

(It might be simpler to describe the semantics in terms of a copy, but
the overlapping storage makes that difficult).

This doesn't address questions of how implementations would detect UB
during constant evaluation.

These interfaces do not help to avoid UB in cases like these:

    void j(const char*, const char*);
    void k(const char8_t *c8p, std::size_t n) {
       j(borrow_as<char>(c8p, n), borrow_as<char>(c8p, n)); // UB, the
    same range is borrowed multiple times in the same full expression.
    }

    void m(const char8_t *c8p, std::size_t n) {
       const char *cp = borrow_as<char>(c8p, n);
       cp[0]; // UB, the borrowed objects have already been released.

       {
       const auto &borrowed = borrow_as<char>(c8p, n);
       const char* cp2 = borrowed;
       cp[0]; // Ok.
       c8p[0]; // UB, the borrowed objects have not yet been released.
       }
       c8p[0]; // Ok.
    }

Your thoughts welcome.

Tom.

Received on 2022-09-14 18:02:39