C++ Logo

sg16

Advanced search

Re: An alternative interface for P2626R0 (charN_t incremental adoption: Casting pointers of UTF character types)

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 14 Sep 2022 20:06:45 -0400
On 9/14/22 6:59 PM, Tom Honermann via SG16 wrote:
> On 9/14/22 3:31 PM, Corentin Jabot via SG16 wrote:
>>
>>
>> On Wed, Sep 14, 2022 at 8:02 PM Tom Honermann via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> I had hoped to send this days ago, but alas. I don't have
>> expectations of this being read before today's meeting.
>>
>> P2626R0 <https://wg21.link/p2626r0> proposes a set of functions
>> to cast between character types. These functions are listed below
>> (the /see below/ describes a return type that is a pointer to a
>> type that preserves CV qualifiers from the input).
>>
>> template <class From>
>> constexpr /see below/ cast_as_utf_unchecked(From* ptr, size_t
>> n) noexcept;
>> template <class To, class From>
>> constexpr /see below/ cast_utf_to(From* ptr, size_t n) noexcept;
>> template <class T>
>> constexpr /see below/
>> cast_as_utf_unchecked(basic_string_view<T> && v) noexcept;
>> template <class To, class From>
>> constexpr auto cast_utf_to(basic_string_view<From> && v)
>> noexcept;
>>
>> It is well acknowledged that these functions have sharp edges
>> that are inherent in what the paper seeks to accomplish and that
>> there is no solution that won't have sharp edges. However, since
>> these interfaces are not intended to be expert only (non-expert
>> programmers are expected to use them), I'd like to try to dull
>> the edges somewhat.
>>
>> Our August 24th discussion
>> <https://github.com/sg16-unicode/sg16-meetings#august-24th-2022>
>> and prior mailing list discussion
>> <https://lists.isocpp.org/sg16/2022/08/3354.php> of the paper
>> raised concerns that it may be important, from a core language
>> perspective, that object type conversions performed using the
>> proposed utilities be undone prior to object destruction in order
>> to avoid the possibility of, for example, an object of type
>> char8_t being destructed as an object of type char (particularly
>> in constant evaluation). We have not yet concluded discussion
>> regarding these concerns, but regardless, I think a scoped
>> mechanism to limit the duration under which objects of one type
>> are accessed as objects of another type would be beneficial. For
>> example (using iterator+size interfaces to avoid questions
>> regarding destruction of subobjects):
>>
>> void f(const char*, std::size_t);
>> void g(const char8_t*, std::size_t) {
>> void h(const char8_t *c8p, std::size_t n) {
>> f(cast_utf_to<char>(c8p, n), n);
>> g(c8p, n); // UB, c8p still points to converted-to-char
>> objects.
>> }
>>
>> I've been playing with an alternate RAII-based library interface
>> that would help to avoid the problem above by enabling scoped
>> access from a compatible type. The high level interfaces are:
>>
>> template<typename T, typename R>
>> requires std::ranges::contiguous_range<R>
>> && std::ranges::sized_range<R>
>> && std::ranges::view<R>
>> class borrowed_object_view {
>> public:
>> using borrowed_element_type = /* T-with-cv-qualifiers-from-R */
>> using borrowed_view_type = /* subrange of
>> T-with-cv-qualifiers-from-R and extent of R */
>>
>> constexpr operator borrowed_view_type() const;
>> constexpr operator borrowed_element_type*() const;
>> };
>>
>> template<typename T, typename R>
>> requires std::ranges::contiguous_range<R>
>> && std::ranges::sized_range<R>
>> && std::ranges::borrowed_range<R>
>> constexpr auto borrow_as(R &&r) -> borrowed_object_view<T, R>;
>>
>> template<typename T, typename E>
>> constexpr auto borrow_as(E *p, std::size_t n) ->
>> borrowed_object_view<T, std::span<E*>>;
>>
>> With these, the example above becomes:
>>
>> void h(const char8_t *c8p, std::size_t n) {
>> f(borrow_as<char>(c8p, n), n);
>> g(c8p, n); // Ok.
>> }
>>
>> A prototype implementation is available at
>> https://godbolt.org/z/jE7qxTe7b and it passes tests using both
>> gcc and clang. The prototype of the above functions and class
>> (defined in <borrow>) implements the borrowing functionality
>> using the following interfaces (defined in
>> <underlying_type_access>) where underlying_type_ex_t extends
>> std::underlying_type to specify underlying types for the various
>> character types. Some basic tests are present (in example.cpp).
>> The tests are annotated with a few cases that are considered UB
>> (and these are not detected during constant evaluation).
>>
>> template<typename T, typename E>
>> requires std::same_as<detail::underlying_type_ex_t<T>,
>> detail::underlying_type_ex_t<E>>
>> struct underlying_type_access_handle {
>> /* unspecified */
>> };
>>
>> template<typename T, typename E>
>> requires std::same_as<detail::underlying_type_ex_t<T>,
>> detail::underlying_type_ex_t<E>>
>> constexpr auto acquire_underlying_type_access(E *p,
>> std::size_t n = 1) -> underlying_type_access_handle<T, E>;
>>
>> template<typename T, typename E>
>> constexpr void
>> release_underlying_type_access(underlying_type_access_handle<T,
>> E> &);
>>
>> For prototyping purposes, the acquire operation allocates and
>> initializes new T's and the release operation deallocates them
>> (after copying the T elements back over the E elements to
>> preserve mutations when E is non-const). Such an implementation
>> is poor quality from a performance standpoint (and therefore what
>> we want to avoid), but serves as an interesting model for what we
>> are trying to accomplish since it 1) works for both constant and
>> non-constant evaluation, 2) avoids aliasing concerns, and 3)
>> models what programmers have to do today to avoid UB and support
>> constant evaluation.
>>
>> In order to achieve performance goals, additional behavior needs
>> to be ascribed to these functions. While it would remain UB for
>> the objects to be accessed by the "wrong" type in between the
>> acquire and release operations, I think we can resolve the
>> aliasing concerns at the point of the calls with something like this:
>>
>> * For acquire_underlying_type_access:
>> Behaves as though an object of type E of unknown provenance
>> has been read (thus ensuring that the objects (of type E) in
>> the range [p,n) have all been stored).
>> * For release_underlying_type_access:
>> Behaves as though an object of type T of unknown provenance
>> has been read (thus ensuring that the objects (of type T) in
>> the range [p,n) have all been stored) followed by a write to
>> an object of type E of unknown provenance (thus ensuring that
>> future reads of objects in the range [p,n) will force a load).
>>
>> (It might be simpler to describe the semantics in terms of a
>> copy, but the overlapping storage makes that difficult).
>>
>> This doesn't address questions of how implementations would
>> detect UB during constant evaluation.
>>
>> These interfaces do not help to avoid UB in cases like these:
>>
>> void j(const char*, const char*);
>> void k(const char8_t *c8p, std::size_t n) {
>> j(borrow_as<char>(c8p, n), borrow_as<char>(c8p, n)); // UB,
>> the same range is borrowed multiple times in the same full
>> expression.
>> }
>>
>> void m(const char8_t *c8p, std::size_t n) {
>> const char *cp = borrow_as<char>(c8p, n);
>> cp[0]; // UB, the borrowed objects have already been released.
>>
>> {
>> const auto &borrowed = borrow_as<char>(c8p, n);
>> const char* cp2 = borrowed;
>> cp[0]; // Ok.
>> c8p[0]; // UB, the borrowed objects have not yet been released.
>> }
>> c8p[0]; // Ok.
>> }
>>
>> Your thoughts welcome.
>>
>>
>> Thanks for your work Tom
>> This was covered in the paper and considered.
>>
>> I think it is *less* safe - Or rather, it feels more safe than it is.
>> Such an interface cannot guarantee unique access so, as your examples
>> show, it is too easy to use such an object as a temporary that
>> doesn't outlive the lifetime of the
>> pointer, stored the """borrowed""" pointer somewhere etc.
>> The only benefit is that *if* used correctly, destruction of the
>> buffer can happen correctly. But I think we can handle that case at
>> its core.
>> There is no construct that could enforce unique access to a memory
>> region.
>> If we can't make it safe, I don't know if there is value in making it
>> look easier to use than it is.
>
> Thank you for your comments.
>
> I acknowledge the lifetime concerns and I agree that there is no
> mechanism to ensure consistent access by the right type. But there is
> a more fundamental concern that this approach addresses that the
> interfaces proposed in P2626R0 do not. Consider the following:
>
> void n(char8_t *c8p, std::size_t n) {
> c8p[0] = '1';
> char *cp = use_as<char>(c8p, n);
> cp[0] = '2';
> ...
> c8p[0] == '2'; // True or false? What triggers reification with
> the write via cp[0]?
> }
>
> Note that a write via the converted type is not required in order for
> problems to occur in implementations that use TBAA:
>
> void p(const char8_t *c8p, std::size_t n) {
> const char *cp = use_as<const char>(c8p, n);
> cp[0] == '1'; // True or false? What ensures the write to c8a[0]
> in q() has resulted in a store?
> }
> void q() {
> char8_t c8a[] = u8"text";
> c8a[0] = '1';
> p(c8a, sizeof(c8a));
> }
>
Since char aliases everything, substitute e.g., wchar_t and char16_t for
char and char8_t when considering those last two examples.

Tom.

> The semantics I described above are intended to denote barriers that
> ensure that stores have completed such that future loads by the
> "other" type will load the "current" value. The lifetime concerns may
> suggest that a different syntax would be preferred; perhaps some form
> of block syntax:
>
> void r(const char8_t *c8p, std::size_t n) {
> {
> c8p[0]; // Ok.
> borrow(c8p, n) as const char *cp {
> cp[0]; // Ok.
> c8p[0]; // UB, the borrowed objects have not yet been released.
> }
> c8p[0]; // Ok.
> }
>
> (note and apology to Jens because I've done this in the past and he
> has corrected me before and I forgot until now: I used discarded-value
> expression statements like c8p[0]; above that I annotated as UB
> because they "access" an object using the wrong type. The problem is
> that no access occurs because no lvalue-to-rvalue conversion occurs
> for these cases. Please pretend that they are all volatile qualified).
>
> Tom.
>
>>
>> Thanks,
>>
>> corentin
>>
>> Tom.
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>

Received on 2022-09-15 00:06:47