We don't actually promise that an array of char{N}_t is in UTF-N. Just that is the associated encoding. It could easily still be complete nonsense. APIs that take a char8_t* or a std::u8string are asking for trouble if they have a well-formed precondition. Dropped bytes are just far too common. 

On Fri, Jan 29, 2021 at 4:27 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:


On Fri, Jan 29, 2021 at 9:39 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 29/01/2021 08.20, Tom Honermann wrote:
> On 1/28/21 1:57 PM, Jens Maurer via SG16 wrote:
>> On 28/01/2021 19.37, Corentin via SG16 wrote:
>>> On Thu, Jan 28, 2021 at 7:22 PM Peter Brett <pbrett@cadence.com <mailto:pbrett@cadence.com>> wrote:
>>>
>>>     I think the big problem here is trying to make it a template.____
>>>
>>>     __ __
>>>
>>>     Make it named.  It’s literally not possible to use this correctly in generic code.
>>>
>>>
>>> Question then is do we want to solve the issue for wchar_t?
>>> Because having the name of the encoding in the function kinda precludes that - the sizeof(wchar_t) being platform dependant
>> You only get away with  char* -> char8_t* because "char" has special
>> aliasing exceptions.
>>
>> You'll get the full set of aliasing concerns for
>>   wchar_t* -> char16_t* or char32_t*
>
> I think what we're looking for is a portable solution for this ICU hack <https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36> (generalized to make it work for [unsigned] char* conversion to char8_t*); the goal being to enable some form of explicit restricted pointer interconvertibility between same sized/aligned types.

Yes.

> I don't understand the ICU hack sufficiently well to relate it to a memory or object model.  I'm also not sure that it actually works (though it may suffice for the scenarios that are encountered in practice).
>
> Perhaps something like this would suffice.
>
>     template<typename To, typename From>
>     requires requires {
>         requires std::is_trivial_v<To>;
>         requires std::is_trivial_v<From>;
>         requires sizeof(To) == sizeof(From);
>         requires alignof(To) == alignof(From);
>     }
>     To* alias_barrier_cast(From *p) {
>         asm volatile("" : : "rm"(p) : "memory");
>         return reinterpret_cast<To*>(p);
>     }

>From a C++ memory model perspective, there is no difference between,
say, char16_t and short or int: They form their own aliasing domain.
Converting the pointer with reinterpret_cast or something is NOT
the problem; the problem is accessing the data before and after.

The solution I'm thinking of is that the data can only be 
accessed through the returned pointer after the function call.
Do you think that is more workable?
 

There have been papers in the past that attempted to bless
regions of memory with a different data type (e.g. to deal with
mmapped file data); I think such a direction might be worthwhile
to investigate.

I certainly don't want to deal with a "solution" that covers
char8_t / char16_t / char32_t only, if the underlying concerns
are also applicable elsewhere.

Even if there was a use case for a generalized solution, we need to do something specific for this as we have additional
preconditions, namely that the input is a well-formed sequence of utf code units.
 

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16