sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Lyberta <lyberta_at_[hidden]>
Date: Tue, 19 Jun 2018 20:19:00 +0000

keld_at_[hidden]:
> Is your code point advisory the same as codepoints in 10646/Unicode, also
> called characters in 10646?

Yes. A code point is unsigned 32 bit integer with the values in the
range of 0-10FFFF. Modern C and C++ have type char32_t which is most
suitable for holding code points.

> And why not just treat these as 32-bit wchar-t?
> I believe this is what we do in C.

Because wide execution character set is implementation defined. So far
nobody has expressed opinion of changing that and Windows violates the
standard by having 16 bit wchar_t.

> Then you can have functions converting to and from wchar-t.

Yes, except if you convert text to UTF-32 before processing it, you will
waste memory and a lot of interfaces still expect char*. More
importantly, if you truly want to work with text, you usually need to
work on the layer above code points - grapheme clusters.

Received on 2018-06-19 22:19:36