C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] D1628R0 (Unicode character properties)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 28 Mar 2019 10:00:55 +0100
On Thu, Mar 28, 2019, 9:33 AM Lyberta <lyberta_at_[hidden]> wrote:

> > Yes - The longer the namespace, the more likely people are to write "use
> > namespace std::unicode;"
> > which defeats the purpose - we have bad precedent with std::filesystem.
> > Uni is sweet and short, I guess something like uncd would work too,
> > it's not as much about the name as it is about the number of letters
>
> Uni is too ambiguous, uncd is better but very ugly. I have no problem
> with std::filesystem.
>
> >
> >>
> >> Unicode always uses the term "code point", not "code point":
> >> https://www.unicode.org/glossary/#code_point
> >>
> >> So the name should be std::uni[code]::code_point.
> >
> >
> > Bike-shedding and while that might be true, is there any gain in
> > information ?
>
> "Codepoint" feels very wrong, almost as wrong as strlen and the rest of
> C library.
>
> >> In my experience, I never need the code point because surrogates are not
> >> allowed in valid UTF. I only ever need unicode scalar values:
> >> https://www.unicode.org/glossary/#unicode_scalar_value
> >
> >
> >
> > This api (and TR44) is defined in term of code points
> > it's actually well behave from all integers from 0 to 0xFFFFFFFF
>
> I guess, but do we really want our users to shove random integers in it
>

Yes. I really want a wide contract there

>
> > The whole reason I am using that codepoint type (which is more a
> > __codepoint_hack type) here is to delete
> > use with char and wchar_t which is non nonsensical.
> > Aka a code point type is not part of this proposal.
>
> That's why my design intended those functions to be member functions of
> code point (or scalar value) type. Since constructor is explicit, you
> can't shove char or wchar_t in there
>

That gives the impression these type may have state or caching which they
really shouldn't have. But otherwise yes, if your objects have a wide
contract all the way through - which they won't - having these methods in a
type is possible. I don't think we gain in usability thought, especially
it makes it harder to use these query in ranges.


> >
> > The feedback I got is to just not care and just use uint32_t instead and
> > let people
> > shoot themselves in the foot.
>
> What about systems where CHAR_BIT != 8, 16 or 32? std::uint32_t is
> optional, do we want Unicode on such systems? I'm myself on the edge
> between char32_t and std::uint_least32_t.
>

Good point

>> I'm writing a competing proposal where I want to propose
> >> std::unicode_code_point and std::unicode_scalar_value that have explicit
> >> constructors from char32_t and explicit member function .value() to get
> >> char32_t back. I think this is the only way forward. char8_t, char16_t
> >> and char32_t are dumb types that have horrible names, we should o.nly
> >> use them as a transition mechanism.
> >>
> >
> > In my experience, you will find that it is a very difficult and verbose
> api
> > to use,
> > especially that explicit value method.
> > I do think char32_t is fine as it was always supposed to be a code-point
> > (or even, a code unit which also happens to be a codepoint, it's really
> the
> > most basic building bloc) which it is.
> > I do not think scalar value are that important as it is difficult to form
> > something that it is not a scalar value as soon as we have the right
> > "unicode sandwich" model
> > where encoding or input that may produce non-scalar value code point have
> > to be decoded at i/o boundary
> > then your scalar value just becomes a contract that you can sprinkle
> > everywhere.
>
> Yes, contract or invariant means strong type, not dumb char32_t
>

TR 44 is purposefully dumb by design too.

_______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-03-28 10:01:12