Date: Fri, 29 Jan 2021 14:56:23 -0500
We don't actually promise that an array of char{N}_t is in UTF-N. Just that
is the associated encoding. It could easily still be complete nonsense.
APIs that take a char8_t* or a std::u8string are asking for trouble if they
have a well-formed precondition. Dropped bytes are just far too common.
On Fri, Jan 29, 2021 at 4:27 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:
>
>
> On Fri, Jan 29, 2021 at 9:39 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 29/01/2021 08.20, Tom Honermann wrote:
>> > On 1/28/21 1:57 PM, Jens Maurer via SG16 wrote:
>> >> On 28/01/2021 19.37, Corentin via SG16 wrote:
>> >>> On Thu, Jan 28, 2021 at 7:22 PM Peter Brett <pbrett_at_[hidden]
>> <mailto:pbrett_at_[hidden]>> wrote:
>> >>>
>> >>> I think the big problem here is trying to make it a template.____
>> >>>
>> >>> __ __
>> >>>
>> >>> Make it named. It’s literally not possible to use this correctly
>> in generic code.
>> >>>
>> >>>
>> >>> Question then is do we want to solve the issue for wchar_t?
>> >>> Because having the name of the encoding in the function kinda
>> precludes that - the sizeof(wchar_t) being platform dependant
>> >> You only get away with char* -> char8_t* because "char" has special
>> >> aliasing exceptions.
>> >>
>> >> You'll get the full set of aliasing concerns for
>> >> wchar_t* -> char16_t* or char32_t*
>> >
>> > I think what we're looking for is a portable solution for this ICU hack
>> <
>> https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36>
>> (generalized to make it work for [unsigned] char* conversion to char8_t*);
>> the goal being to enable some form of explicit restricted pointer
>> interconvertibility between same sized/aligned types.
>>
>> Yes.
>>
>> > I don't understand the ICU hack sufficiently well to relate it to a
>> memory or object model. I'm also not sure that it actually works (though
>> it may suffice for the scenarios that are encountered in practice).
>> >
>> > Perhaps something like this would suffice.
>> >
>> > template<typename To, typename From>
>> > requires requires {
>> > requires std::is_trivial_v<To>;
>> > requires std::is_trivial_v<From>;
>> > requires sizeof(To) == sizeof(From);
>> > requires alignof(To) == alignof(From);
>> > }
>> > To* alias_barrier_cast(From *p) {
>> > asm volatile("" : : "rm"(p) : "memory");
>> > return reinterpret_cast<To*>(p);
>> > }
>>
>> From a C++ memory model perspective, there is no difference between,
>> say, char16_t and short or int: They form their own aliasing domain.
>> Converting the pointer with reinterpret_cast or something is NOT
>> the problem; the problem is accessing the data before and after.
>>
>
> The solution I'm thinking of is that the data can only be
> accessed through the returned pointer after the function call.
> Do you think that is more workable?
>
>
>>
>> There have been papers in the past that attempted to bless
>> regions of memory with a different data type (e.g. to deal with
>> mmapped file data); I think such a direction might be worthwhile
>> to investigate.
>
>
>> I certainly don't want to deal with a "solution" that covers
>> char8_t / char16_t / char32_t only, if the underlying concerns
>> are also applicable elsewhere.
>>
>
> Even if there was a use case for a generalized solution, we need to do
> something specific for this as we have additional
> preconditions, namely that the input is a well-formed sequence of utf code
> units.
>
>
>>
>> Jens
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
is the associated encoding. It could easily still be complete nonsense.
APIs that take a char8_t* or a std::u8string are asking for trouble if they
have a well-formed precondition. Dropped bytes are just far too common.
On Fri, Jan 29, 2021 at 4:27 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:
>
>
> On Fri, Jan 29, 2021 at 9:39 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 29/01/2021 08.20, Tom Honermann wrote:
>> > On 1/28/21 1:57 PM, Jens Maurer via SG16 wrote:
>> >> On 28/01/2021 19.37, Corentin via SG16 wrote:
>> >>> On Thu, Jan 28, 2021 at 7:22 PM Peter Brett <pbrett_at_[hidden]
>> <mailto:pbrett_at_[hidden]>> wrote:
>> >>>
>> >>> I think the big problem here is trying to make it a template.____
>> >>>
>> >>> __ __
>> >>>
>> >>> Make it named. It’s literally not possible to use this correctly
>> in generic code.
>> >>>
>> >>>
>> >>> Question then is do we want to solve the issue for wchar_t?
>> >>> Because having the name of the encoding in the function kinda
>> precludes that - the sizeof(wchar_t) being platform dependant
>> >> You only get away with char* -> char8_t* because "char" has special
>> >> aliasing exceptions.
>> >>
>> >> You'll get the full set of aliasing concerns for
>> >> wchar_t* -> char16_t* or char32_t*
>> >
>> > I think what we're looking for is a portable solution for this ICU hack
>> <
>> https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36>
>> (generalized to make it work for [unsigned] char* conversion to char8_t*);
>> the goal being to enable some form of explicit restricted pointer
>> interconvertibility between same sized/aligned types.
>>
>> Yes.
>>
>> > I don't understand the ICU hack sufficiently well to relate it to a
>> memory or object model. I'm also not sure that it actually works (though
>> it may suffice for the scenarios that are encountered in practice).
>> >
>> > Perhaps something like this would suffice.
>> >
>> > template<typename To, typename From>
>> > requires requires {
>> > requires std::is_trivial_v<To>;
>> > requires std::is_trivial_v<From>;
>> > requires sizeof(To) == sizeof(From);
>> > requires alignof(To) == alignof(From);
>> > }
>> > To* alias_barrier_cast(From *p) {
>> > asm volatile("" : : "rm"(p) : "memory");
>> > return reinterpret_cast<To*>(p);
>> > }
>>
>> From a C++ memory model perspective, there is no difference between,
>> say, char16_t and short or int: They form their own aliasing domain.
>> Converting the pointer with reinterpret_cast or something is NOT
>> the problem; the problem is accessing the data before and after.
>>
>
> The solution I'm thinking of is that the data can only be
> accessed through the returned pointer after the function call.
> Do you think that is more workable?
>
>
>>
>> There have been papers in the past that attempted to bless
>> regions of memory with a different data type (e.g. to deal with
>> mmapped file data); I think such a direction might be worthwhile
>> to investigate.
>
>
>> I certainly don't want to deal with a "solution" that covers
>> char8_t / char16_t / char32_t only, if the underlying concerns
>> are also applicable elsewhere.
>>
>
> Even if there was a use case for a generalized solution, we need to do
> something specific for this as we have additional
> preconditions, namely that the input is a well-formed sequence of utf code
> units.
>
>
>>
>> Jens
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2021-01-29 13:56:37