C++ Logo


Advanced search

Re: [SG16-Unicode] D1628R0 (Unicode character properties)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 28 Mar 2019 09:10:39 +0100
On Thu, 28 Mar 2019 at 08:49 Lyberta <lyberta_at_[hidden]> wrote:

> Corentin:
> > As requested by Tom, please find attach D1628R0 which will be discussed
> > during today's meeting \N{WHITE EXCLAMATION MARK ORNAMENT}
> >
> > Feedback welcome :)
> Do we really want std::uni? std::unicode seems much better.

Yes - The longer the namespace, the more likely people are to write "use
namespace std::unicode;"
which defeats the purpose - we have bad precedent with std::filesystem.
Uni is sweet and short, I guess something like uncd would work too,
it's not as much about the name as it is about the number of letters

> Unicode always uses the term "code point", not "code point":
> https://www.unicode.org/glossary/#code_point
> So the name should be std::uni[code]::code_point.

Bike-shedding and while that might be true, is there any gain in
information ?

> In my experience, I never need the code point because surrogates are not
> allowed in valid UTF. I only ever need unicode scalar values:
> https://www.unicode.org/glossary/#unicode_scalar_value

This api (and TR44) is defined in term of code points
it's actually well behave from all integers from 0 to 0xFFFFFFFF

The whole reason I am using that codepoint type (which is more a
__codepoint_hack type) here is to delete
use with char and wchar_t which is non nonsensical.
Aka a code point type is not part of this proposal.

The feedback I got is to just not care and just use uint32_t instead and
let people
shoot themselves in the foot.

> Hence I think using code point interfaces should be discouraged.
> I think constructing code points or scalar values from char8_t or
> char16_t makes no sense. They are at the different levels.
> I'm writing a competing proposal where I want to propose
> std::unicode_code_point and std::unicode_scalar_value that have explicit
> constructors from char32_t and explicit member function .value() to get
> char32_t back. I think this is the only way forward. char8_t, char16_t
> and char32_t are dumb types that have horrible names, we should o.nly
> use them as a transition mechanism.

In my experience, you will find that it is a very difficult and verbose api
to use,
especially that explicit value method.
I do think char32_t is fine as it was always supposed to be a code-point
(or even, a code unit which also happens to be a codepoint, it's really the
most basic building bloc) which it is.
I do not think scalar value are that important as it is difficult to form
something that it is not a scalar value as soon as we have the right
"unicode sandwich" model
where encoding or input that may produce non-scalar value code point have
to be decoded at i/o boundary
then your scalar value just becomes a contract that you can sprinkle

> I'm gonna try to finish the early draft of my proposal and after release
> of GCC 9 I'm gonna port my entire code base on its design so I will have
> usage experience with it.

Great !

> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-03-28 09:10:53