Date: Thu, 25 Apr 2019 21:21:52 -0400
I agree that a scalar_value type would be useful for checked code, but I
think contracts are the right mechanism for stating the pre and post
conditions. Exceptions, particularly in the middle of text processing,
would mostly insert unnecessary checks. Contracts would at least make that
optional.
I think Range and specializing on ContiguousRange will work, and span
should be a model of CR.
Encode, Decode, and Transcode should really be in terms of std::byte
sequences, but with additional overloads for ergonomics.
On Thu, Apr 25, 2019 at 8:42 PM Lyberta <lyberta_at_[hidden]> wrote:
>
>
> Henri Sivonen:
> > At the turn of the year, I commented on Text_view on Slack, and Tom
> > Honermann asked me to write my comments up in long form to this
> > mailing list. I have now written my comments at
> > https://hsivonen.fi/non-unicode-in-cpp/ (also pasted below for on-list
> > quotability.) My apologies for it taking me this long to get this
> > written.
> I have a few comments:
>
> char32_t is too dumb. It can hold surrogate code points or values >
> 10FFFF. We are not C, we can use strong types that can't hold those
> values. I proposed std::unicode::scalar_value that will throw an
> exception on invalid value.
>
> UTF-8, UTF-16 and UTF-32 are equally easy to implement because they are
> on the code unit level. The rest of the code work on higher levels and
> is totally abstracted.
>
> Ranges are better than std::span.
>
> I think default error handling should be throwing exception. In fact, my
> design for std::unicode::scalar_value_sequence will ensure that given
> sequence is well formed inside constructor and all mutating operations,
> there will be an option to skip validation though.
>
> std::basic_string and std::basic_string_view are a temporary hack and
> will be replaced by std::unicode::code_unit_sequence in my proposal.
>
> See the first part of my design:
> https://github.com/Lyberta/cpp-unicode-fundamental
>
> I do agree that char and wchar_t must die, but I also think that
> char8_t, char16_t and char32_t are also hacks. We don't need dumb
> fundamental types when we can define strong user defined types.
>
> I hope to see [w]char[NN_t] will be completely removed by 2050.
>
> For now, I have std::unicode::utf32_code_unit and
> std::unicode::scalar_value as different types. Mostly because they are
> on the different level of abstraction. After GCC 9 release I'm gonna
> port my entire codebase to be char8_t based and then I'm gonna have
> proper usage experience with my proposed design.
>
> The way from code unit level to scalar value level is giving
> std::unicode::code_unit_sequence[_view] to constructor of
> std::unicode::scalar_value_sequence[_view]. It throws exception by
> default on ill formed sequence. But I expect most users to use std::text
> (that will be different in my proposal) and never go down to those low
> levels in practice.
>
> C++ constexpr can statically prevent programs from putting invalid
> values into UTF-32 buffer. That's the whole point of strong types.
> Forget about char32_t.
>
> Transcoding will be range-based as soon as I get ranges in libstdc++ so
> I can finish my design. All gains of std::span will be saved by having
> implementation defined overload that takes ContiguousRange.
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
think contracts are the right mechanism for stating the pre and post
conditions. Exceptions, particularly in the middle of text processing,
would mostly insert unnecessary checks. Contracts would at least make that
optional.
I think Range and specializing on ContiguousRange will work, and span
should be a model of CR.
Encode, Decode, and Transcode should really be in terms of std::byte
sequences, but with additional overloads for ergonomics.
On Thu, Apr 25, 2019 at 8:42 PM Lyberta <lyberta_at_[hidden]> wrote:
>
>
> Henri Sivonen:
> > At the turn of the year, I commented on Text_view on Slack, and Tom
> > Honermann asked me to write my comments up in long form to this
> > mailing list. I have now written my comments at
> > https://hsivonen.fi/non-unicode-in-cpp/ (also pasted below for on-list
> > quotability.) My apologies for it taking me this long to get this
> > written.
> I have a few comments:
>
> char32_t is too dumb. It can hold surrogate code points or values >
> 10FFFF. We are not C, we can use strong types that can't hold those
> values. I proposed std::unicode::scalar_value that will throw an
> exception on invalid value.
>
> UTF-8, UTF-16 and UTF-32 are equally easy to implement because they are
> on the code unit level. The rest of the code work on higher levels and
> is totally abstracted.
>
> Ranges are better than std::span.
>
> I think default error handling should be throwing exception. In fact, my
> design for std::unicode::scalar_value_sequence will ensure that given
> sequence is well formed inside constructor and all mutating operations,
> there will be an option to skip validation though.
>
> std::basic_string and std::basic_string_view are a temporary hack and
> will be replaced by std::unicode::code_unit_sequence in my proposal.
>
> See the first part of my design:
> https://github.com/Lyberta/cpp-unicode-fundamental
>
> I do agree that char and wchar_t must die, but I also think that
> char8_t, char16_t and char32_t are also hacks. We don't need dumb
> fundamental types when we can define strong user defined types.
>
> I hope to see [w]char[NN_t] will be completely removed by 2050.
>
> For now, I have std::unicode::utf32_code_unit and
> std::unicode::scalar_value as different types. Mostly because they are
> on the different level of abstraction. After GCC 9 release I'm gonna
> port my entire codebase to be char8_t based and then I'm gonna have
> proper usage experience with my proposed design.
>
> The way from code unit level to scalar value level is giving
> std::unicode::code_unit_sequence[_view] to constructor of
> std::unicode::scalar_value_sequence[_view]. It throws exception by
> default on ill formed sequence. But I expect most users to use std::text
> (that will be different in my proposal) and never go down to those low
> levels in practice.
>
> C++ constexpr can statically prevent programs from putting invalid
> values into UTF-32 buffer. That's the whole point of strong types.
> Forget about char32_t.
>
> Transcoding will be range-based as soon as I get ranges in libstdc++ so
> I can finish my design. All gains of std::span will be saved by having
> implementation defined overload that takes ContiguousRange.
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
Received on 2019-04-26 03:22:06