C++ Logo


Advanced search

Re: [SG16-Unicode] String views with strong code unit types

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 4 Jun 2019 07:11:59 -0400
That literals aren't required to be well formed is a subset of the problem
that char8_t data may have come from anywhere and can't be assumed to be
well formed. Real world text is frequently broken.

On Tue, Jun 4, 2019, 06:27 JeanHeyd Meneide <phdofthehouse_at_[hidden]> wrote:

> On Tue, Jun 4, 2019 at 5:39 AM Lyberta <lyberta_at_[hidden]> wrote:
>> We can always modify the standard so that we get strong types via
>> compiler magic. I was thinking:
>> utf8'a' -> std::unicode::utf8_code_unit
>> utf16'a' -> std::unicode::utf16_code_unit
>> utf32'a' -> std::unicode::utf32_code_unit
>> utf8"a" -> std::unicode::utf8_code_unit_sequence_view
>> utf16"a" -> std::unicode::utf16_code_unit_sequence_view
>> utf32"a" -> std::unicode::utf32_code_unit_sequence_view
>> Well, that's future. I want something I can use now.
>> Also, does the standard require well formed sequences in literals?
> No, we lobbied specifically that you can insert "ill-formed" sequences
> (e.g., not perfectly well formed Unicode Scalar Values) into string
> literals. This is specifically to enable people who need literals of types
> that are not exactly conformant for various reasons (testing, or
> specifically creating WTF8/CESU8/etc. literals, and more).
> Granted, the only way you can do this is by writing `\x` values
> specifically in the string literal: it's a very powerful show that someone
> is doing something non-standard. That doesn't mean you can't assume
> char8_t, char16_t, and char32_t are not well-formed: if someone's shoving
> in direct code unit values with backslash-X syntax, you have to assume they
> are a Very Smart Person Who Knows What They Are Getting Themselves Into.
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-06-04 13:12:14