sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 20 Jun 2018 12:13:41 -0400

On 06/20/2018 05:34 AM, keld_at_[hidden] wrote:
> On Tue, Jun 19, 2018 at 09:52:05PM -0400, Tom Honermann wrote:
>> On 06/19/2018 04:19 PM, Lyberta wrote:
>>> keld_at_[hidden]:
>>>> Is your code point advisory the same as codepoints in 10646/Unicode, also
>>>> called characters in 10646?
>>> Yes. A code point is unsigned 32 bit integer with the values in the
>>> range of 0-10FFFF. Modern C and C++ have type char32_t which is most
>>> suitable for holding code points.
>>>
>>>> And why not just treat these as 32-bit wchar-t?
>>>> I believe this is what we do in C.
>>> Because wide execution character set is implementation defined. So far
>>> nobody has expressed opinion of changing that and Windows violates the
>>> standard by having 16 bit wchar_t.
>> Technically, Windows doesn't violate the standard by having a 16-bit
>> wchar_t. It violates the standard by using a wide execution character
>> set that defines code points that do not fit in it's (16-bit) wchar_t
>> type. We have an issue (https://github.com/sg16-unicode/sg16/issues/9)
>> to track modifying the standard to enable Microsoft's implementation to
>> be conforming.
> I believe that using a 16-bit wchar_t to handle UCS characters in a UTF-16 form is a violation of the
> C++ standard. You need to do some processing of surrogates, that is not portable to
> other platforms,and is against specs for wchar_t.

I think we are agreeing. Specifically, it violates [lex.ccon]p6
(http://eel.is/c++draft/lex.ccon#6) and [basic.fundamental]p5
(http://eel.is/c++draft/basic.fundamental#5)

>
> I do not think this obsoletes wchar_t, it should not lead to obsoletion
> that some people use it wrongly.

I agree with this sentiment.

>
> Using a 16 bit wchar_t is ok if you restrict yourself to only a 16 bit subset of UCS.

I don't disagree, but for modern applications, limiting support to the
BMP is a pretty significant restriction. And modern applications need
to work on Windows and interact with the wchar_t based Win32 UTF-16 APIs.

>
> I am happy to have a specific type to handle code points that are defined to have
> UCS code point values. I just note that I think APIs to handle such a type would need to
> have exactly the same functionality as for handling wchar_t entities.

If I'm reading this correctly, it sounds like you are expressing a
preference that text interfaces should be consistently provided for
char, wchar_t, char16_t, char32_t (and char8_t). If so, I agree.

Tom.

Received on 2018-06-20 18:13:44