sg16: Re: [SG16-Unicode] code_unit_sequence and code_point

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 20 Jun 2018 13:48:57 -0400

On 06/20/2018 01:24 PM, keld_at_[hidden] wrote:
> On Wed, Jun 20, 2018 at 12:13:41PM -0400, Tom Honermann wrote:
>> On 06/20/2018 05:34 AM, keld_at_[hidden] wrote:
>>> On Tue, Jun 19, 2018 at 09:52:05PM -0400, Tom Honermann wrote:
>>>> On 06/19/2018 04:19 PM, Lyberta wrote:
>>>>> keld_at_[hidden]:
>>> Using a 16 bit wchar_t is ok if you restrict yourself to only a 16 bit
>>> subset of UCS.
>> I don't disagree, but for modern applications, limiting support to the
>> BMP is a pretty significant restriction. And modern applications need
>> to work on Windows and interact with the wchar_t based Win32 UTF-16 APIs.
> I agree that this is not the state of the art. But it once was, and I think it is the reason for
> Microsoft to use 16 bit for wchar_t.

I agree. When Microsoft chose wchar_t, the BMP was all there was, so
16-bit and UCS-2 was a reasonable choice.

>
>>> I am happy to have a specific type to handle code points that are defined
>>> to have
>>> UCS code point values. I just note that I think APIs to handle such a type
>>> would need to
>>> have exactly the same functionality as for handling wchar_t entities.
>> If I'm reading this correctly, it sounds like you are expressing a
>> preference that text interfaces should be consistently provided for
>> char, wchar_t, char16_t, char32_t (and char8_t). If so, I agree.
> My thoughts were only wchar_t and char32_t, The other types would need another layer
> - they cannot generally hold a code point of the processing character type. So they
> cannot be used for portable programs that can be used everywhere.

I see. In text_view, I took the approach of defining a 'character'
class template that holds a code point and an association with a
character set. This enabled using a single builtin character or
integral type to store code point values for multiple character sets
while protecting from inadvertent use of code points from one character
set as code points of a different character set. In a Unicode only
world, char32_t would suffice, but we don't live in a Unicode only world.

https://github.com/tahonermann/text_view/blob/master/include/text_view_detail/character.hpp#L20-L56

I think we can design with an assumption that, in the absence of other
information, values of type char32_t hold Unicode code points.

>
> Most programs I work with are made for the global market, and IMHO, you
> should program for the global market.

+1 :)

Tom.

Received on 2018-06-20 19:48:59