sg16: Re: [SG16] Reinterpreting pointers of character types

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 29 Jan 2021 10:27:16 +0100

On Fri, Jan 29, 2021 at 9:39 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 29/01/2021 08.20, Tom Honermann wrote:
> > On 1/28/21 1:57 PM, Jens Maurer via SG16 wrote:
> >> On 28/01/2021 19.37, Corentin via SG16 wrote:
> >>> On Thu, Jan 28, 2021 at 7:22 PM Peter Brett <pbrett_at_[hidden]
> <mailto:pbrett_at_[hidden]>> wrote:
> >>>
> >>> I think the big problem here is trying to make it a template.____
> >>>
> >>> __ __
> >>>
> >>> Make it named. It’s literally not possible to use this correctly
> in generic code.
> >>>
> >>>
> >>> Question then is do we want to solve the issue for wchar_t?
> >>> Because having the name of the encoding in the function kinda
> precludes that - the sizeof(wchar_t) being platform dependant
> >> You only get away with char* -> char8_t* because "char" has special
> >> aliasing exceptions.
> >>
> >> You'll get the full set of aliasing concerns for
> >> wchar_t* -> char16_t* or char32_t*
> >
> > I think what we're looking for is a portable solution for this ICU hack <
> https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36>
> (generalized to make it work for [unsigned] char* conversion to char8_t*);
> the goal being to enable some form of explicit restricted pointer
> interconvertibility between same sized/aligned types.
>
> Yes.
>
> > I don't understand the ICU hack sufficiently well to relate it to a
> memory or object model. I'm also not sure that it actually works (though
> it may suffice for the scenarios that are encountered in practice).
> >
> > Perhaps something like this would suffice.
> >
> > template<typename To, typename From>
> > requires requires {
> > requires std::is_trivial_v<To>;
> > requires std::is_trivial_v<From>;
> > requires sizeof(To) == sizeof(From);
> > requires alignof(To) == alignof(From);
> > }
> > To* alias_barrier_cast(From *p) {
> > asm volatile("" : : "rm"(p) : "memory");
> > return reinterpret_cast<To*>(p);
> > }
>
> From a C++ memory model perspective, there is no difference between,
> say, char16_t and short or int: They form their own aliasing domain.
> Converting the pointer with reinterpret_cast or something is NOT
> the problem; the problem is accessing the data before and after.
>

The solution I'm thinking of is that the data can only be
accessed through the returned pointer after the function call.
Do you think that is more workable?

>
> There have been papers in the past that attempted to bless
> regions of memory with a different data type (e.g. to deal with
> mmapped file data); I think such a direction might be worthwhile
> to investigate.

> I certainly don't want to deal with a "solution" that covers
> char8_t / char16_t / char32_t only, if the underlying concerns
> are also applicable elsewhere.
>

Even if there was a use case for a generalized solution, we need to do
something specific for this as we have additional
preconditions, namely that the input is a well-formed sequence of utf code
units.

>
> Jens
>

Received on 2021-01-29 03:27:29