On 9/14/22 3:31 PM, Corentin Jabot via SG16 wrote:


On Wed, Sep 14, 2022 at 8:02 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

I had hoped to send this days ago, but alas. I don't have expectations of this being read before today's meeting.

P2626R0 proposes a set of functions to cast between character types. These functions are listed below (the see below describes a return type that is a pointer to a type that preserves CV qualifiers from the input).

template <class From>
constexpr see below cast_as_utf_unchecked(From* ptr, size_t n) noexcept;
template <class To, class From>
constexpr see below cast_utf_to(From* ptr, size_t n) noexcept;
template <class T>
constexpr see below cast_as_utf_unchecked(basic_string_view<T> && v) noexcept;
template <class To, class From>
constexpr auto cast_utf_to(basic_string_view<From> && v) noexcept;

It is well acknowledged that these functions have sharp edges that are inherent in what the paper seeks to accomplish and that there is no solution that won't have sharp edges. However, since these interfaces are not intended to be expert only (non-expert programmers are expected to use them), I'd like to try to dull the edges somewhat.

Our August 24th discussion and prior mailing list discussion of the paper raised concerns that it may be important, from a core language perspective, that object type conversions performed using the proposed utilities be undone prior to object destruction in order to avoid the possibility of, for example, an object of type char8_t being destructed as an object of type char (particularly in constant evaluation). We have not yet concluded discussion regarding these concerns, but regardless, I think a scoped mechanism to limit the duration under which objects of one type are accessed as objects of another type would be beneficial. For example (using iterator+size interfaces to avoid questions regarding destruction of subobjects):

void f(const char*, std::size_t);
void g(const char8_t*, std::size_t) {

void h(const char8_t *c8p, std::size_t n) {
  f(cast_utf_to<char>(c8p, n), n);
  g(c8p, n); // UB, c8p still points to converted-to-char objects.
}

I've been playing with an alternate RAII-based library interface that would help to avoid the problem above by enabling scoped access from a compatible type. The high level interfaces are:

template<typename T, typename R>
requires std::ranges::contiguous_range<R>
      && std::ranges::sized_range<R>
      && std::ranges::view<R>      
class borrowed_object_view {
public:
  using borrowed_element_type = /* T-with-cv-qualifiers-from-R */
  using borrowed_view_type = /* subrange of T-with-cv-qualifiers-from-R and extent of R */

  constexpr operator borrowed_view_type() const;
  constexpr operator borrowed_element_type*() const;
};

template<typename T, typename R>
requires std::ranges::contiguous_range<R>
      && std::ranges::sized_range<R>
      && std::ranges::borrowed_range<R>
constexpr auto borrow_as(R &&r) -> borrowed_object_view<T, R>;

template<typename T, typename E>
constexpr auto borrow_as(E *p, std::size_t n) -> borrowed_object_view<T, std::span<E*>>;

With these, the example above becomes:

void h(const char8_t *c8p, std::size_t n) {
  f(borrow_as<char>(c8p, n), n);
  g(c8p, n); // Ok.
}

A prototype implementation is available at https://godbolt.org/z/jE7qxTe7b and it passes tests using both gcc and clang. The prototype of the above functions and class (defined in <borrow>) implements the borrowing functionality using the following interfaces (defined in <underlying_type_access>) where underlying_type_ex_t extends std::underlying_type to specify underlying types for the various character types. Some basic tests are present (in example.cpp). The tests are annotated with a few cases that are considered UB (and these are not detected during constant evaluation).

template<typename T, typename E>
requires std::same_as<detail::underlying_type_ex_t<T>, detail::underlying_type_ex_t<E>>
struct underlying_type_access_handle {
  /* unspecified */
};

template<typename T, typename E>
requires std::same_as<detail::underlying_type_ex_t<T>, detail::underlying_type_ex_t<E>>
constexpr auto acquire_underlying_type_access(E *p, std::size_t n = 1) -> underlying_type_access_handle<T, E>;

template<typename T, typename E>
constexpr void release_underlying_type_access(underlying_type_access_handle<T, E> &);

For prototyping purposes, the acquire operation allocates and initializes new T's and the release operation deallocates them (after copying the T elements back over the E elements to preserve mutations when E is non-const). Such an implementation is poor quality from a performance standpoint (and therefore what we want to avoid), but serves as an interesting model for what we are trying to accomplish since it 1) works for both constant and non-constant evaluation, 2) avoids aliasing concerns, and 3) models what programmers have to do today to avoid UB and support constant evaluation.

In order to achieve performance goals, additional behavior needs to be ascribed to these functions. While it would remain UB for the objects to be accessed by the "wrong" type in between the acquire and release operations, I think we can resolve the aliasing concerns at the point of the calls with something like this:

  • For acquire_underlying_type_access:
    Behaves as though an object of type E of unknown provenance has been read (thus ensuring that the objects (of type E) in the range [p,n) have all been stored).
  • For release_underlying_type_access:
    Behaves as though an object of type T of unknown provenance has been read (thus ensuring that the objects (of type T) in the range [p,n) have all been stored) followed by a write to an object of type E of unknown provenance (thus ensuring that future reads of objects in the range [p,n) will force a load).

(It might be simpler to describe the semantics in terms of a copy, but the overlapping storage makes that difficult).

This doesn't address questions of how implementations would detect UB during constant evaluation.

These interfaces do not help to avoid UB in cases like these:

void j(const char*, const char*);
void k(const char8_t *c8p, std::size_t n) {
  j(borrow_as<char>(c8p, n), borrow_as<char>(c8p, n)); // UB, the same range is borrowed multiple times in the same full expression.
}

void m(const char8_t *c8p, std::size_t n) {
  const char *cp = borrow_as<char>(c8p, n);
  cp[0]; // UB, the borrowed objects have already been released.

  {
  const auto &borrowed = borrow_as<char>(c8p, n);
  const char* cp2 = borrowed;
  cp[0]; // Ok.
  c8p[0]; // UB, the borrowed objects have not yet been released.
  }
  c8p[0]; // Ok.
}

Your thoughts welcome.


Thanks for your work Tom
This was covered in the paper and considered.

I think it is *less* safe - Or rather, it feels more safe than it is.
Such an interface cannot guarantee unique access so, as your examples show, it is too easy to use such an object as a temporary that doesn't outlive the lifetime of the pointer, stored the """borrowed""" pointer somewhere etc.
The only benefit is that *if* used correctly, destruction of the buffer can happen correctly. But I think we can handle that case at its core.
There is no construct that could enforce unique access to a memory region.
If we can't make it safe, I don't know if there is value in making it look easier to use than it is.

Thank you for your comments.

I acknowledge the lifetime concerns and I agree that there is no mechanism to ensure consistent access by the right type. But there is a more fundamental concern that this approach addresses that the interfaces proposed in P2626R0 do not. Consider the following:

void n(char8_t *c8p, std::size_t n) {
  c8p[0] = '1';
  char *cp = use_as<char>(c8p, n);
  cp[0] = '2';
  ...
  c8p[0] == '2'; // True or false? What triggers reification with the write via cp[0]?
}

Note that a write via the converted type is not required in order for problems to occur in implementations that use TBAA:

void p(const char8_t *c8p, std::size_t n) {
  const char *cp = use_as<const char>(c8p, n);
  cp[0] == '1'; // True or false? What ensures the write to c8a[0] in q() has resulted in a store?
}
void q() {
  char8_t c8a[] = u8"text";
  c8a[0] = '1';
  p(c8a, sizeof(c8a));
}

The semantics I described above are intended to denote barriers that ensure that stores have completed such that future loads by the "other" type will load the "current" value. The lifetime concerns may suggest that a different syntax would be preferred; perhaps some form of block syntax:

void r(const char8_t *c8p, std::size_t n) {
  {

  c8p[0]; // Ok.
  borrow(c8p, n) as const char *cp {
    cp[0]; // Ok.
    c8p[0]; // UB, the borrowed objects have not yet been released.
  }
  c8p[0]; // Ok.
}

(note and apology to Jens because I've done this in the past and he has corrected me before and I forgot until now: I used discarded-value expression statements like c8p[0]; above that I annotated as UB because they "access" an object using the wrong type. The problem is that no access occurs because no lvalue-to-rvalue conversion occurs for these cases. Please pretend that they are all volatile qualified).

Tom.


Thanks, 

corentin
 

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16