C++ Logo

sg16

Advanced search

Re: An alternative interface for P2626R0 (charN_t incremental adoption: Casting pointers of UTF character types)

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 14 Sep 2022 21:31:45 +0200
On Wed, Sep 14, 2022 at 8:02 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> I had hoped to send this days ago, but alas. I don't have expectations of
> this being read before today's meeting.
>
> P2626R0 <https://wg21.link/p2626r0> proposes a set of functions to cast
> between character types. These functions are listed below (the *see below*
> describes a return type that is a pointer to a type that preserves CV
> qualifiers from the input).
>
> template <class From>
> constexpr *see below* cast_as_utf_unchecked(From* ptr, size_t n) noexcept;
> template <class To, class From>
> constexpr *see below* cast_utf_to(From* ptr, size_t n) noexcept;
> template <class T>
> constexpr *see below* cast_as_utf_unchecked(basic_string_view<T> && v)
> noexcept;
> template <class To, class From>
> constexpr auto cast_utf_to(basic_string_view<From> && v) noexcept;
>
> It is well acknowledged that these functions have sharp edges that are
> inherent in what the paper seeks to accomplish and that there is no
> solution that won't have sharp edges. However, since these interfaces are
> not intended to be expert only (non-expert programmers are expected to use
> them), I'd like to try to dull the edges somewhat.
>
> Our August 24th discussion
> <https://github.com/sg16-unicode/sg16-meetings#august-24th-2022> and
> prior mailing list discussion
> <https://lists.isocpp.org/sg16/2022/08/3354.php> of the paper raised
> concerns that it may be important, from a core language perspective, that
> object type conversions performed using the proposed utilities be undone
> prior to object destruction in order to avoid the possibility of, for
> example, an object of type char8_t being destructed as an object of type
> char (particularly in constant evaluation). We have not yet concluded
> discussion regarding these concerns, but regardless, I think a scoped
> mechanism to limit the duration under which objects of one type are
> accessed as objects of another type would be beneficial. For example (using
> iterator+size interfaces to avoid questions regarding destruction of
> subobjects):
>
> void f(const char*, std::size_t);
> void g(const char8_t*, std::size_t) {
> void h(const char8_t *c8p, std::size_t n) {
> f(cast_utf_to<char>(c8p, n), n);
> g(c8p, n); // UB, c8p still points to converted-to-char objects.
> }
>
> I've been playing with an alternate RAII-based library interface that
> would help to avoid the problem above by enabling scoped access from a
> compatible type. The high level interfaces are:
>
> template<typename T, typename R>
> requires std::ranges::contiguous_range<R>
> && std::ranges::sized_range<R>
> && std::ranges::view<R>
> class borrowed_object_view {
> public:
> using borrowed_element_type = /* T-with-cv-qualifiers-from-R */
> using borrowed_view_type = /* subrange of T-with-cv-qualifiers-from-R
> and extent of R */
>
> constexpr operator borrowed_view_type() const;
> constexpr operator borrowed_element_type*() const;
> };
>
> template<typename T, typename R>
> requires std::ranges::contiguous_range<R>
> && std::ranges::sized_range<R>
> && std::ranges::borrowed_range<R>
> constexpr auto borrow_as(R &&r) -> borrowed_object_view<T, R>;
>
> template<typename T, typename E>
> constexpr auto borrow_as(E *p, std::size_t n) -> borrowed_object_view<T,
> std::span<E*>>;
>
> With these, the example above becomes:
>
> void h(const char8_t *c8p, std::size_t n) {
> f(borrow_as<char>(c8p, n), n);
> g(c8p, n); // Ok.
> }
>
> A prototype implementation is available at https://godbolt.org/z/jE7qxTe7b
> and it passes tests using both gcc and clang. The prototype of the above
> functions and class (defined in <borrow>) implements the borrowing
> functionality using the following interfaces (defined in
> <underlying_type_access>) where underlying_type_ex_t extends
> std::underlying_type to specify underlying types for the various
> character types. Some basic tests are present (in example.cpp). The tests
> are annotated with a few cases that are considered UB (and these are not
> detected during constant evaluation).
>
> template<typename T, typename E>
> requires std::same_as<detail::underlying_type_ex_t<T>,
> detail::underlying_type_ex_t<E>>
> struct underlying_type_access_handle {
> /* unspecified */
> };
>
> template<typename T, typename E>
> requires std::same_as<detail::underlying_type_ex_t<T>,
> detail::underlying_type_ex_t<E>>
> constexpr auto acquire_underlying_type_access(E *p, std::size_t n = 1) ->
> underlying_type_access_handle<T, E>;
>
> template<typename T, typename E>
> constexpr void release_underlying_type_access(underlying_type_access_handle<T,
> E> &);
>
> For prototyping purposes, the acquire operation allocates and initializes
> new T's and the release operation deallocates them (after copying the T
> elements back over the E elements to preserve mutations when E is
> non-const). Such an implementation is poor quality from a performance
> standpoint (and therefore what we want to avoid), but serves as an
> interesting model for what we are trying to accomplish since it 1) works
> for both constant and non-constant evaluation, 2) avoids aliasing concerns,
> and 3) models what programmers have to do today to avoid UB and support
> constant evaluation.
>
> In order to achieve performance goals, additional behavior needs to be
> ascribed to these functions. While it would remain UB for the objects to be
> accessed by the "wrong" type in between the acquire and release operations,
> I think we can resolve the aliasing concerns at the point of the calls with
> something like this:
>
> - For acquire_underlying_type_access:
> Behaves as though an object of type E of unknown provenance has been
> read (thus ensuring that the objects (of type E) in the range [p,n) have
> all been stored).
> - For release_underlying_type_access:
> Behaves as though an object of type T of unknown provenance has been
> read (thus ensuring that the objects (of type T) in the range [p,n) have
> all been stored) followed by a write to an object of type E of unknown
> provenance (thus ensuring that future reads of objects in the range [p,n)
> will force a load).
>
> (It might be simpler to describe the semantics in terms of a copy, but the
> overlapping storage makes that difficult).
>
> This doesn't address questions of how implementations would detect UB
> during constant evaluation.
>
> These interfaces do not help to avoid UB in cases like these:
>
> void j(const char*, const char*);
> void k(const char8_t *c8p, std::size_t n) {
> j(borrow_as<char>(c8p, n), borrow_as<char>(c8p, n)); // UB, the same
> range is borrowed multiple times in the same full expression.
> }
>
> void m(const char8_t *c8p, std::size_t n) {
> const char *cp = borrow_as<char>(c8p, n);
> cp[0]; // UB, the borrowed objects have already been released.
>
> {
> const auto &borrowed = borrow_as<char>(c8p, n);
> const char* cp2 = borrowed;
> cp[0]; // Ok.
> c8p[0]; // UB, the borrowed objects have not yet been released.
> }
> c8p[0]; // Ok.
> }
>
> Your thoughts welcome.
>

Thanks for your work Tom
This was covered in the paper and considered.

I think it is *less* safe - Or rather, it feels more safe than it is.
Such an interface cannot guarantee unique access so, as your examples show,
it is too easy to use such an object as a temporary that doesn't outlive
the lifetime of the pointer, stored the """borrowed""" pointer somewhere
etc.
The only benefit is that *if* used correctly, destruction of the buffer can
happen correctly. But I think we can handle that case at its core.
There is no construct that could enforce unique access to a memory region.
If we can't make it safe, I don't know if there is value in making it look
easier to use than it is.

Thanks,

corentin


> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-09-14 19:31:59