C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Strong code unit types

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 5 Dec 2018 09:15:17 -0500
On 12/5/18 8:05 AM, Steve Downey wrote:
> `codepoint` also, which is probably "just" a char32_t?

No, I think a type that isn't convertible from code unit types is
desirable. (I'm interpreting your response as implying that 'codepoint'
would just be a type alias of 'char32_t' as opposed to a distinct strong
type)

Thinking about the std::isalnum example we discussed this week. The
problem was that it was being called with code unit values, but its
parameter type means something more like a code point. Code like the
following is well-formed and follows current recommendations for correct
use of std::isalnum, but is nevertheless incorrect for multibyte
encodings that reuse valid leading code unit values as trailing code
unit values (e.g.; Shift-JIS).

void f(const char *s) {
   while (*s) {
     if (std::isalnum(static_cast<unsigned char>(*s++)) {
       ...
     }
   }
}

Use of a distinct type for code points that is not implicitly
convertible from a code unit type prevents these kinds of problems.

Tom.

>
> On Wed, Dec 5, 2018, 01:40 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]> wrote:
>
> On 12/4/18 11:17 PM, Lyberta wrote:
> > This is something that hit me recently. Why are we using fundamental
> > types for code units? CppCon 2018 is full of people saying that we
> > should migrate to strong types, that std::size_t should have been a
> > struct, etc.
> The primary reason for using fundamental types for code units is that
> those are the types used for character and string literals.
> >
> > I propose we add strong types for code units:
> >
> > * utf8_code_unit
> > * utf16_code_unit
> > * utf32_code_unit
> >
> > These will hold char8,16,32_t inside of them respectively but
> will not
> > allow the invalid values such as >245 for UTF-8, surrogates and
> >> 0x10FFFF for UTF-32, etc.
> > This will guarantee that all code units are valid and will allow
> us to
> > write much faster code because we will never need to check for
> invalid
> > values.
>
> The downside of such validating types is the validation overhead.
>
> I am in favor of introducing strong types for code points.
>
> Tom.
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden] <mailto:Unicode_at_[hidden]>
> http://www.open-std.org/mailman/listinfo/unicode
>


Received on 2018-12-05 15:15:20