sg16: Re: [SG16-Unicode] [Testing] Mail Messages

From: ThePhD <phdofthehouse_at_[hidden]>
Date: Tue, 10 Apr 2018 19:58:24 -0400

I think the issues and goals of the Unicode SG can categorized in the
following ways.

Note that what I say is not binding or agreed upon. I have participated in
a few meetings and listened to everyone who has attended them (the meetings
have been going on for longer than I have been around). I also put forth
ideas of my own. But, this is in no way a normative or binding reference of
what we are doing and getting done.

Let's start with the *Issues.*

*Text Processing in the Standard Library is not good:*
- Archaic interfaces for an ASCII, EBCDIC, and other non-multibyte-encoding
world lead people to poorly handle text outside if that scope.[1]
- Many encodings and text transformations, Unicode or otherwise, are unable
to be handled by current library facilities[2]
- When people encounter the first two points, they often either take a deep
dive ICU (maybe with Boost.Locale), which is not renowned for its
interfaces, or abandon their idea altogether.
- It's embarrassing that this is the best C++ has after decades of people
working outside of standardization to raise the bar on this for a long time.

Unicode is big, and so is text processing, so we're going to focus on two
big parts of it!

*The Big Goals:*
- How do we handle encoding differences, text equivalency (normalization,
collation), and processing (case folding, case mapping, title casing, and
more)? What about text-segmentation (word breaks, line breaks, etc.?)
Iterators, free functions? These things are slowly beginning to form in our
meetings, and progress should be accelerated thanks to the start of this
mailing list.
- What does a string/text abstraction look like in the Bright, Beautiful
C++ Future? Is it a new string class? A bucket of free functions? A mix of
the two?

Those are the big goals. We have more or less found out that iterators are
amazing for encoding: 5 separate people (and more we are not tracking) have
implemented iterator-based encoding and decoding and it is a slam-dunk of
an abstraction. There are some fine-details to tune up, but that is more or
less done. Now it's on to equivalency, processing, and
segmentation/breaking while we nitpick the fine details and function names
of an Encoding concept.

*Some short-term goals we're already getting started on:*
- Advance char8_t proposal[3] through the rest of EWG and LEWG. This will
allow us to -- at compile-time -- identify strings and string literals that
are meant to be utf8, separating us from the "narrow encoding" problem.
- Contacted Alisdair Meredith and other std-going members about updating
references to other standards in the C++ Specification. We are going to do
our best to track the latest and greatest Unicode Standard, starting with
updating Unicode References in the C++ standard to point towards ISO/IEC
10646:2017 and friends. This update will give us the flexbility to at least
use Unicode Version 10, if not 11 by the time of publication.
- Start answering the above Big Goals questions with some concrete
implementation and some serious theorycrafting / bikeshedding.

There is some good news: because C++ is late to the party, a lot of the
wrinkles in Unicode have been smoothed out fairly well. It's not an
entirely bug-free standard, but a lot of work has gone in to making stable
versions of things like Normalization that would make now a good time to
standardize since there are guarantees about how certain algorithms will
behave, regardless of Unicode Version. These are great first things to
focus on and standardize and put implementation effort into.

*Some other things that are happening:*
- Bob Steagall is going to present his work in making transcoding iterators
blazing fast at C++Now in Aspen, Colorado, USA in a month. From the looks
of it his work should put codecvt to shame.
- Mark Zeren is looking into updating `std::string` and cleaning up its
interfaces and iterator guarantees in the Specification's Wording
- Zach Laine is working on what a `std2::string` (plus more) would look
like[4]
- libogonek, by R. Martinho Fernandes, is receiving some general updates[5]
- I am (hopefully in about a month) going to be seeing what it would take
to make Zach Laine's text abstractions more generic to support encodings
outside of the UTF/8/16/32 converting iterators he will ship, combining
ideas from my talk[6], Tom Honermann's text_view[7], and libogonek. I'll
likely be starting from libogonek as a basis.

As I understand it, we are not touching the standardization of regex.
Unicode-conformant regex is MASSIVE and takes a HUGE amount of effort. So
let's take care of the massive beef steak on our plates (from everything
listed above) and then go over to the enormous ham later, yeah?

All this being said, help is always welcome! There's a lot of things to
explore here, and maybe things we're not thinking about yet. We do need
some help bikeshedding and theory-crafting and implementing.

I apologize for the long e-mail, but I wanted to cover almost everything. I
feel like there might be a few things left out, but... that should help, I
think! See also: https://github.com/sg16-unicode/sg16-meetings

Sincerely,
JeanHeyd "ThePhD" Meneide (No I am not a Doctor.) (Yet! Working on it. :D)

[1] - See a proposal for trying to revamp ctype and all the traps the
person falls into while trying to make something that's Designed To Be
Deprecated:
https://groups.google.com/a/isocpp.org/forum/#!topic/std-proposals/Besva70LN3c
[2] -
https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11/17106065#17106065
[3] - http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html
[4] - https://github.com/rmartinho/ogonek
[5] - https://github.com/tzlaine/text
[6] -
https://github.com/ThePhD/ThePhD.github.io/blob/master/presentations/unicode/2018.03.07%20-%20ThePhD%20-%20a%20rudimentary%20unicode%20abstraction.pdf
[7] - https://github.com/tahonermann/text_view

On Tue, Apr 10, 2018, 5:55 PM <keld_at_[hidden]> wrote:

> Hi
>
> Got it!
>
> What are the issues of the Unicode SG?
>
> Best regards
> Keld
>
> On Tue, Apr 10, 2018 at 04:56:43PM -0400, Deruupu Sutoomo wrote:
> > Testing, testing... is this on?
>
> > _______________________________________________
> > Unicode mailing list
> > Unicode_at_[hidden]
> > http://www.open-std.org/mailman/listinfo/unicode
>
>

Received on 2018-04-11 01:58:27