sg16: Re: [SG16] Reinterpreting pointers of character types

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 29 Jan 2021 14:00:34 -0500

On 1/29/21 3:39 AM, Jens Maurer wrote:
> On 29/01/2021 08.20, Tom Honermann wrote:
>> On 1/28/21 1:57 PM, Jens Maurer via SG16 wrote:
>>> On 28/01/2021 19.37, Corentin via SG16 wrote:
>>>> On Thu, Jan 28, 2021 at 7:22 PM Peter Brett <pbrett_at_[hidden] <mailto:pbrett_at_[hidden]>> wrote:
>>>>
>>>> I think the big problem here is trying to make it a template.____
>>>>
>>>> __ __
>>>>
>>>> Make it named. It’s literally not possible to use this correctly in generic code.
>>>>
>>>>
>>>> Question then is do we want to solve the issue for wchar_t?
>>>> Because having the name of the encoding in the function kinda precludes that - the sizeof(wchar_t) being platform dependant
>>> You only get away with char* -> char8_t* because "char" has special
>>> aliasing exceptions.
>>>
>>> You'll get the full set of aliasing concerns for
>>> wchar_t* -> char16_t* or char32_t*
>> I think what we're looking for is a portable solution for this ICU hack <https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36> (generalized to make it work for [unsigned] char* conversion to char8_t*); the goal being to enable some form of explicit restricted pointer interconvertibility between same sized/aligned types.
> Yes.
>
>> I don't understand the ICU hack sufficiently well to relate it to a memory or object model. I'm also not sure that it actually works (though it may suffice for the scenarios that are encountered in practice).
>>
>> Perhaps something like this would suffice.
>>
>> template<typename To, typename From>
>> requires requires {
>> requires std::is_trivial_v<To>;
>> requires std::is_trivial_v<From>;
>> requires sizeof(To) == sizeof(From);
>> requires alignof(To) == alignof(From);
>> }
>> To* alias_barrier_cast(From *p) {
>> asm volatile("" : : "rm"(p) : "memory");
>> return reinterpret_cast<To*>(p);
>> }
> From a C++ memory model perspective, there is no difference between,
> say, char16_t and short or int: They form their own aliasing domain.
> Converting the pointer with reinterpret_cast or something is NOT
> the problem; the problem is accessing the data before and after.

Yes and, correct me if I'm mistaken, the concern is more the C++ object
model.

We want to address this:

    void new_school(const char8_t *pc8) {
       *pc8; // UB if pc8 does not point to a char8_t object.
    }
    void old_school() {
       const char *text = "UTF-8 encoded text";
       new_school(reinterpret_cast<const char8_t*>(text));// Results in
    UB within new_school().
    }

Though the issues are more complicated than this as I'll note in a reply
to Corentin.

>
> There have been papers in the past that attempted to bless
> regions of memory with a different data type (e.g. to deal with
> mmapped file data); I think such a direction might be worthwhile
> to investigate.
Are you thinking of P0137 <https://wg21.link/p0137> and related
std::launder() papers and P0593 <https://wg21.link/p0593>? Do you know
of other papers?
>
> I certainly don't want to deal with a "solution" that covers
> char8_t / char16_t / char32_t only, if the underlying concerns
> are also applicable elsewhere.

Likewise.

Tom.

Received on 2021-01-29 13:00:38