sg16: Re: [SG16] Reinterpreting pointers of character types

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 29 Jan 2021 20:57:59 +0100

On Fri, Jan 29, 2021 at 8:00 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 1/29/21 4:27 AM, Corentin wrote:
>
>
>
> On Fri, Jan 29, 2021 at 9:39 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 29/01/2021 08.20, Tom Honermann wrote:
>> > On 1/28/21 1:57 PM, Jens Maurer via SG16 wrote:
>> >> On 28/01/2021 19.37, Corentin via SG16 wrote:
>> >>> On Thu, Jan 28, 2021 at 7:22 PM Peter Brett <pbrett_at_[hidden]
>> <mailto:pbrett_at_[hidden]>> wrote:
>> >>>
>> >>> I think the big problem here is trying to make it a template.____
>> >>>
>> >>> __ __
>> >>>
>> >>> Make it named. It’s literally not possible to use this correctly
>> in generic code.
>> >>>
>> >>>
>> >>> Question then is do we want to solve the issue for wchar_t?
>> >>> Because having the name of the encoding in the function kinda
>> precludes that - the sizeof(wchar_t) being platform dependant
>> >> You only get away with char* -> char8_t* because "char" has special
>> >> aliasing exceptions.
>> >>
>> >> You'll get the full set of aliasing concerns for
>> >> wchar_t* -> char16_t* or char32_t*
>> >
>> > I think what we're looking for is a portable solution for this ICU hack
>> <
>> https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36>
>> (generalized to make it work for [unsigned] char* conversion to char8_t*);
>> the goal being to enable some form of explicit restricted pointer
>> interconvertibility between same sized/aligned types.
>>
>> Yes.
>>
>> > I don't understand the ICU hack sufficiently well to relate it to a
>> memory or object model. I'm also not sure that it actually works (though
>> it may suffice for the scenarios that are encountered in practice).
>> >
>> > Perhaps something like this would suffice.
>> >
>> > template<typename To, typename From>
>> > requires requires {
>> > requires std::is_trivial_v<To>;
>> > requires std::is_trivial_v<From>;
>> > requires sizeof(To) == sizeof(From);
>> > requires alignof(To) == alignof(From);
>> > }
>> > To* alias_barrier_cast(From *p) {
>> > asm volatile("" : : "rm"(p) : "memory");
>> > return reinterpret_cast<To*>(p);
>> > }
>>
>> From a C++ memory model perspective, there is no difference between,
>> say, char16_t and short or int: They form their own aliasing domain.
>> Converting the pointer with reinterpret_cast or something is NOT
>> the problem; the problem is accessing the data before and after.
>>
>
> The solution I'm thinking of is that the data can only be
> accessed through the returned pointer after the function call.
> Do you think that is more workable?
>
> I was thinking in terms of a similar model as well, but if a returned
> pointer is required, then existing interfaces that don't return such a
> pointer won't be usable.
>
Is there any?
However, I've been thinking since then that some libraries may access
(read) the memory through the original pointer, which would be undefined
and the user wouldn't be able to tell.

> I've been trying to construct a test case that demonstrates that
> reinterpret_cast doesn't suffice, but I have so far been unable to coerce
> an optimizer into taking advantage of anti-aliasing rules. I've tried gcc
> and clang with -O3 and -fstrict-aliasing, but it looks like if the
> optimizer's escape analysis finds that an address escapes, regardless of by
> which type, then it assumes the object may have been modified (unless
> declared const). I've been trying with various types, not just ones that
> involve the aliasing friendly types.
>
> If anyone can suggest or identify a test that demonstrates that
> reinterpret_cast doesn't suffice, that might be helpful for testing other
> solutions.
>
>
>
>>
>> There have been papers in the past that attempted to bless
>> regions of memory with a different data type (e.g. to deal with
>> mmapped file data); I think such a direction might be worthwhile
>> to investigate.
>
>
>> I certainly don't want to deal with a "solution" that covers
>> char8_t / char16_t / char32_t only, if the underlying concerns
>> are also applicable elsewhere.
>>
>
> Even if there was a use case for a generalized solution, we need to do
> something specific for this as we have additional
> preconditions, namely that the input is a well-formed sequence of utf code
> units.
>
> I'm not sure where this requirement comes from. I think the only solution
> we need is one that permits access to a [unsigned] char object via a
> char8_t lvalue (in limited cases) without provoking UB. Whether the data
> is well-formed UTF-8 is an orthogonal concern that is more in the domain of
> contracts.
>

Yes, there are two different concerns here, and I'm not sure how
intertwined there are.
Hopefully, we can agree that none of us want to walk back on char8_t's
design.
I think we (you) did the right call in C++ not to make it alias char or
make it otherwise implicitly convertible.

And I think we can also agree that the intent of char8_t, char16_t,
char32_t is to store UTF data.
We shouldn't weaken that by saying that we put no precondition whatsoever
on conversion from char to char8_t.
Libraries and users dealing in char8_t should be allowed to assume that
char8_t is supposed to hold utf-8 data. otherwise char8_t is just a weird
way to spell uint8_t.

If that's true, then casting any random bytes to a sequence of char8_t
would be very surprising for users.
Therefore, I think it's reasonable to put that extra condition in. I don't
think we want to give the impression that char and char8_t are
interchangeable or represent the same platonic values.

To be clear, I absolutely agree this is indeed a very different concern of
how the conversion can be made to work in the memory model.

The two concerns then are:

- How do we make this work
- What interface do we present to the user, and with what preconditions

I had a very interesting chat with Peter this afternoon where he argued
that we should only provide this kind of cast_from_utfN_unchecked function
if there is also a cast_from_utfN function which would do validation.

The conversion we are talking about, while necessary, is a giant foot gun
and I think we should treat it as such.
It would be very easily misused by people who don't fully understand
Unicode or how C++ handles text.

> Tom.
>

Received on 2021-01-29 13:58:12