ISOCPP sg16 List: Re: An alternative interface for P2626R0 (charN_t incremental adoption: Casting pointers of UTF character types)

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 14 Sep 2022 18:59:58 -0400

On 9/14/22 3:31 PM, Corentin Jabot via SG16 wrote:
>
>
> On Wed, Sep 14, 2022 at 8:02 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> I had hoped to send this days ago, but alas. I don't have
> expectations of this being read before today's meeting.
>
> P2626R0 <https://wg21.link/p2626r0> proposes a set of functions to
> cast between character types. These functions are listed below
> (the /see below/ describes a return type that is a pointer to a
> type that preserves CV qualifiers from the input).
>
> template <class From>
> constexpr /see below/ cast_as_utf_unchecked(From* ptr, size_t
> n) noexcept;
> template <class To, class From>
> constexpr /see below/ cast_utf_to(From* ptr, size_t n) noexcept;
> template <class T>
> constexpr /see below/
> cast_as_utf_unchecked(basic_string_view<T> && v) noexcept;
> template <class To, class From>
> constexpr auto cast_utf_to(basic_string_view<From> && v) noexcept;
>
> It is well acknowledged that these functions have sharp edges that
> are inherent in what the paper seeks to accomplish and that there
> is no solution that won't have sharp edges. However, since these
> interfaces are not intended to be expert only (non-expert
> programmers are expected to use them), I'd like to try to dull the
> edges somewhat.
>
> Our August 24th discussion
> <https://github.com/sg16-unicode/sg16-meetings#august-24th-2022>
> and prior mailing list discussion
> <https://lists.isocpp.org/sg16/2022/08/3354.php> of the paper
> raised concerns that it may be important, from a core language
> perspective, that object type conversions performed using the
> proposed utilities be undone prior to object destruction in order
> to avoid the possibility of, for example, an object of type
> char8_t being destructed as an object of type char (particularly
> in constant evaluation). We have not yet concluded discussion
> regarding these concerns, but regardless, I think a scoped
> mechanism to limit the duration under which objects of one type
> are accessed as objects of another type would be beneficial. For
> example (using iterator+size interfaces to avoid questions
> regarding destruction of subobjects):
>
> void f(const char*, std::size_t);
> void g(const char8_t*, std::size_t) {
> void h(const char8_t *c8p, std::size_t n) {
> f(cast_utf_to<char>(c8p, n), n);
> g(c8p, n); // UB, c8p still points to converted-to-char objects.
> }
>
> I've been playing with an alternate RAII-based library interface
> that would help to avoid the problem above by enabling scoped
> access from a compatible type. The high level interfaces are:
>
> template<typename T, typename R>
> requires std::ranges::contiguous_range<R>
> && std::ranges::sized_range<R>
> && std::ranges::view<R>
> class borrowed_object_view {
> public:
> using borrowed_element_type = /* T-with-cv-qualifiers-from-R */
> using borrowed_view_type = /* subrange of
> T-with-cv-qualifiers-from-R and extent of R */
>
> constexpr operator borrowed_view_type() const;
> constexpr operator borrowed_element_type*() const;
> };
>
> template<typename T, typename R>
> requires std::ranges::contiguous_range<R>
> && std::ranges::sized_range<R>
> && std::ranges::borrowed_range<R>
> constexpr auto borrow_as(R &&r) -> borrowed_object_view<T, R>;
>
> template<typename T, typename E>
> constexpr auto borrow_as(E *p, std::size_t n) ->
> borrowed_object_view<T, std::span<E*>>;
>
> With these, the example above becomes:
>
> void h(const char8_t *c8p, std::size_t n) {
> f(borrow_as<char>(c8p, n), n);
> g(c8p, n); // Ok.
> }
>
> A prototype implementation is available at
> https://godbolt.org/z/jE7qxTe7b and it passes tests using both gcc
> and clang. The prototype of the above functions and class (defined
> in <borrow>) implements the borrowing functionality using the
> following interfaces (defined in <underlying_type_access>) where
> underlying_type_ex_t extends std::underlying_type to specify
> underlying types for the various character types. Some basic tests
> are present (in example.cpp). The tests are annotated with a few
> cases that are considered UB (and these are not detected during
> constant evaluation).
>
> template<typename T, typename E>
> requires std::same_as<detail::underlying_type_ex_t<T>,
> detail::underlying_type_ex_t<E>>
> struct underlying_type_access_handle {
> /* unspecified */
> };
>
> template<typename T, typename E>
> requires std::same_as<detail::underlying_type_ex_t<T>,
> detail::underlying_type_ex_t<E>>
> constexpr auto acquire_underlying_type_access(E *p,
> std::size_t n = 1) -> underlying_type_access_handle<T, E>;
>
> template<typename T, typename E>
> constexpr void
> release_underlying_type_access(underlying_type_access_handle<T,
> E> &);
>
> For prototyping purposes, the acquire operation allocates and
> initializes new T's and the release operation deallocates them
> (after copying the T elements back over the E elements to preserve
> mutations when E is non-const). Such an implementation is poor
> quality from a performance standpoint (and therefore what we want
> to avoid), but serves as an interesting model for what we are
> trying to accomplish since it 1) works for both constant and
> non-constant evaluation, 2) avoids aliasing concerns, and 3)
> models what programmers have to do today to avoid UB and support
> constant evaluation.
>
> In order to achieve performance goals, additional behavior needs
> to be ascribed to these functions. While it would remain UB for
> the objects to be accessed by the "wrong" type in between the
> acquire and release operations, I think we can resolve the
> aliasing concerns at the point of the calls with something like this:
>
> * For acquire_underlying_type_access:
> Behaves as though an object of type E of unknown provenance
> has been read (thus ensuring that the objects (of type E) in
> the range [p,n) have all been stored).
> * For release_underlying_type_access:
> Behaves as though an object of type T of unknown provenance
> has been read (thus ensuring that the objects (of type T) in
> the range [p,n) have all been stored) followed by a write to
> an object of type E of unknown provenance (thus ensuring that
> future reads of objects in the range [p,n) will force a load).
>
> (It might be simpler to describe the semantics in terms of a copy,
> but the overlapping storage makes that difficult).
>
> This doesn't address questions of how implementations would detect
> UB during constant evaluation.
>
> These interfaces do not help to avoid UB in cases like these:
>
> void j(const char*, const char*);
> void k(const char8_t *c8p, std::size_t n) {
> j(borrow_as<char>(c8p, n), borrow_as<char>(c8p, n)); // UB,
> the same range is borrowed multiple times in the same full
> expression.
> }
>
> void m(const char8_t *c8p, std::size_t n) {
> const char *cp = borrow_as<char>(c8p, n);
> cp[0]; // UB, the borrowed objects have already been released.
>
> {
> const auto &borrowed = borrow_as<char>(c8p, n);
> const char* cp2 = borrowed;
> cp[0]; // Ok.
> c8p[0]; // UB, the borrowed objects have not yet been released.
> }
> c8p[0]; // Ok.
> }
>
> Your thoughts welcome.
>
>
> Thanks for your work Tom
> This was covered in the paper and considered.
>
> I think it is *less* safe - Or rather, it feels more safe than it is.
> Such an interface cannot guarantee unique access so, as your examples
> show, it is too easy to use such an object as a temporary that
> doesn't outlive the lifetime of the pointer, stored the """borrowed"""
> pointer somewhere etc.
> The only benefit is that *if* used correctly, destruction of the
> buffer can happen correctly. But I think we can handle that case at
> its core.
> There is no construct that could enforce unique access to a memory region.
> If we can't make it safe, I don't know if there is value in making it
> look easier to use than it is.

Thank you for your comments.

I acknowledge the lifetime concerns and I agree that there is no
mechanism to ensure consistent access by the right type. But there is a
more fundamental concern that this approach addresses that the
interfaces proposed in P2626R0 do not. Consider the following:

    void n(char8_t *c8p, std::size_t n) {
       c8p[0] = '1';
       char *cp = use_as<char>(c8p, n);
       cp[0] = '2';
       ...
       c8p[0] == '2'; // True or false? What triggers reification with
    the write via cp[0]?
    }

Note that a write via the converted type is not required in order for
problems to occur in implementations that use TBAA:

    void p(const char8_t *c8p, std::size_t n) {
       const char *cp = use_as<const char>(c8p, n);
       cp[0] == '1'; // True or false? What ensures the write to c8a[0]
    in q() has resulted in a store?
    }
    void q() {
       char8_t c8a[] = u8"text";
       c8a[0] = '1';
       p(c8a, sizeof(c8a));
    }

The semantics I described above are intended to denote barriers that
ensure that stores have completed such that future loads by the "other"
type will load the "current" value. The lifetime concerns may suggest
that a different syntax would be preferred; perhaps some form of block
syntax:

    void r(const char8_t *c8p, std::size_t n) {
       {
       c8p[0]; // Ok.
       borrow(c8p, n) as const char *cp {
         cp[0]; // Ok.
         c8p[0]; // UB, the borrowed objects have not yet been released.
       }
       c8p[0]; // Ok.
    }

(note and apology to Jens because I've done this in the past and he has
corrected me before and I forgot until now: I used discarded-value
expression statements like c8p[0]; above that I annotated as UB because
they "access" an object using the wrong type. The problem is that no
access occurs because no lvalue-to-rvalue conversion occurs for these
cases. Please pretend that they are all volatile qualified).

Tom.

>
> Thanks,
>
> corentin
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2022-09-14 23:00:00