C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] D1628R0 (Unicode character properties)

From: Lyberta <lyberta_at_[hidden]>
Date: Thu, 28 Mar 2019 08:33:00 +0000
> Yes - The longer the namespace, the more likely people are to write "use
> namespace std::unicode;"
> which defeats the purpose - we have bad precedent with std::filesystem.
> Uni is sweet and short, I guess something like uncd would work too,
> it's not as much about the name as it is about the number of letters

Uni is too ambiguous, uncd is better but very ugly. I have no problem
with std::filesystem.

>
>>
>> Unicode always uses the term "code point", not "code point":
>> https://www.unicode.org/glossary/#code_point
>>
>> So the name should be std::uni[code]::code_point.
>
>
> Bike-shedding and while that might be true, is there any gain in
> information ?

"Codepoint" feels very wrong, almost as wrong as strlen and the rest of
C library.

>> In my experience, I never need the code point because surrogates are not
>> allowed in valid UTF. I only ever need unicode scalar values:
>> https://www.unicode.org/glossary/#unicode_scalar_value
>
>
>
> This api (and TR44) is defined in term of code points
> it's actually well behave from all integers from 0 to 0xFFFFFFFF

I guess, but do we really want our users to shove random integers in it?


> The whole reason I am using that codepoint type (which is more a
> __codepoint_hack type) here is to delete
> use with char and wchar_t which is non nonsensical.
> Aka a code point type is not part of this proposal.

That's why my design intended those functions to be member functions of
code point (or scalar value) type. Since constructor is explicit, you
can't shove char or wchar_t in there.

>
> The feedback I got is to just not care and just use uint32_t instead and
> let people
> shoot themselves in the foot.

What about systems where CHAR_BIT != 8, 16 or 32? std::uint32_t is
optional, do we want Unicode on such systems? I'm myself on the edge
between char32_t and std::uint_least32_t.

>> I'm writing a competing proposal where I want to propose
>> std::unicode_code_point and std::unicode_scalar_value that have explicit
>> constructors from char32_t and explicit member function .value() to get
>> char32_t back. I think this is the only way forward. char8_t, char16_t
>> and char32_t are dumb types that have horrible names, we should o.nly
>> use them as a transition mechanism.
>>
>
> In my experience, you will find that it is a very difficult and verbose api
> to use,
> especially that explicit value method.
> I do think char32_t is fine as it was always supposed to be a code-point
> (or even, a code unit which also happens to be a codepoint, it's really the
> most basic building bloc) which it is.
> I do not think scalar value are that important as it is difficult to form
> something that it is not a scalar value as soon as we have the right
> "unicode sandwich" model
> where encoding or input that may produce non-scalar value code point have
> to be decoded at i/o boundary
> then your scalar value just becomes a contract that you can sprinkle
> everywhere.

Yes, contract or invariant means strong type, not dumb char32_t.


Received on 2019-03-28 09:33:12