C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] Namespaces
From: JeanHeyd Meneide (phdofthehouse_at_[hidden])
Date: 2019-04-12 18:06:58


 I'm sure many people agree that UTF16 was a mistake. I'm not sure how many
people agree that it deserves deprecation, or removal.

On Fri, Apr 12, 2019 at 4:46 PM Steve Downey <sdowney_at_[hidden]> wrote:

> I'm not placing ECS and WECS?
>

Presumably, Execution Character Set and Wide Execution Character Set.

On Fri, Apr 12, 2019 at 3:00 PM Lyberta <lyberta_at_[hidden]> wrote:

> Previously it was suggested to focus on Unicode so I no longer propose
> std::text namespace but I think we should put Unicode into std::unicode.
>

I ultimately don't have a horse in the race: I'll stick the code wherever
the final bikeshed is built.

> Is there a proposal for those?
>

I am working on a proposal; I believe someone else might be working on a
proposal for it as well. There is also an in-progress implementation.
WIP Proposal: https://thephd.github.io/vendor/future_cxx/papers/d1629.html
WIP Implementation (will be moved to separate repository in a few months):
https://github.com/ThePhD/phd/tree/master/include/phd/text

> Maybe a bit offtopic but I don't think std::narrow_execution and
> std::wide_execution are good names. I think appending _character_set
> would make them less ambiguous.
>

I took a vote on narrow/wide vs. narrow_execution/wide_execution, but not
narrow_character_set/wide_character_set or
narrow_execution_character_set/wide_execution_character_set. I'm down for
making these names as ugly and unpalatable and unspell-able as possible,
because nobody should be using them ever without compelling reason (e.g.,
interop with old code).

> > Regarding earlier points on what the standard does provide: the standard
> > needs to provide encodings for all the encoding types that are
> (currently)
> > pushed out by the standard, and nothing more. This includes: std::utf8,
> > std::utf16, std::utf32, std::wide_execution, and std::narrow_execution.
>
> I agree but I want to stress that this would be a good idea to provide
> only minimal support for ECS and WECS (i.e. transcoding only) and just
> let users migrate to Unicode.
>
>
I agree. The entire unicode library will only work with
unicode_code_point/scalar_value (char32_t or a strong typedef, whatever
people decide). However, in order to compensate for the fact that the
stored text sequences in many places will not be able to use this library,
we need robust transcoding (encode/decode) support. The default is
encodings that:

1. pipe things from code_unit_t -> unicode_code_point_t;
2. (do all your work here);
3. then, pipe things from unicode_code_point_t -> code_unit_t

If you specify the inner bit to not be Unicode, the library should (and
will) loudly and noisily fail you for not providing Unicode it can use. But
maybe someone just wants ebcdic -> wide_ebcdic with some strange
non-unicode intermediary encoding. That's fine too; it just won't work with
all of the Standard because it is Sufficiently Weird. Your encode/decode
will work, and your transcode within that boundary, but not transcoding
outside of it without some way to go from what you have to Unicode.

> > The standard should not vend any other encodings...
>
> Again, it's been suggested to provide full-fledged API to Unicode only.
>

I agree; the core of the library will be built on Unicode and Unicode
Algorithms that work on Unicode Code Points/Unicode Scalar Values. However,
there are one too many text encodings in the wild and serving up production
data -- including obscene amounts of Financial and Government data -- that
is *not* in a Unicode Format of any kind. Telling these industries that
they will not be apart of the new world does not sound like a useful
business proposition; therefore, they will pay the cost of (lazy, eager)
transcoding as described above, and then use the Unicode Algorithms once
they transcode. (They can then optionally translate back down to whatever
they want; e.g., when they're sending it out of their program.)

Note that only the people who do not keep Unicode around will need to pay
the cost of transcoding. If your data is already Unicode-friendly, then the
standard and the interfaces we provide will support you fully. This means
that any hard-coded algorithms that are not templated on encoding /
decoding must provide a range to Unicode Codepoints to work on (or straight
up take char8_t, char16_t, and char32_t, all of which are assumed by
compile-time conventions to be valid Unicode).

ECS and WCS must be transcoded. (Or cast/handled in some similar manner.)

_______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>



SG16 list run by sg16-owner@lists.isocpp.org