I think the issues and goals of the Unicode SG can categorized in the following ways.

Note that what I say is not binding or agreed upon. I have participated in a few meetings and listened to everyone who has attended them (the meetings have been going on for longer than I have been around). I also put forth ideas of my own. But, this is in no way a normative or binding reference of what we are doing and getting done.

Let's start with the Issues.

Text Processing in the Standard Library is not good:

- Archaic interfaces for an ASCII, EBCDIC, and other non-multibyte-encoding world lead people to poorly handle text outside if that scope.[1]

- Many encodings and text transformations, Unicode or otherwise, are unable to be handled by current library facilities[2]

- When people encounter the first two points, they often either take a deep dive ICU (maybe with Boost.Locale), which is not renowned for its interfaces, or abandon their idea altogether.

- It's embarrassing that this is the best C++ has after decades of people working outside of standardization to raise the bar on this for a long time.

Unicode is big, and so is text processing, so we're going to focus on two big parts of it!

The Big Goals:

- How do we handle encoding differences, text equivalency (normalization, collation), and processing (case folding, case mapping, title casing, and more)? What about text-segmentation (word breaks, line breaks, etc.?) Iterators, free functions? These things are slowly beginning to form in our meetings, and progress should be accelerated thanks to the start of this mailing list.

- What does a string/text abstraction look like in the Bright, Beautiful C++ Future? Is it a new string class? A bucket of free functions? A mix of the two?

Those are the big goals. We have more or less found out that iterators are amazing for encoding: 5 separate people (and more we are not tracking) have implemented iterator-based encoding and decoding and it is a slam-dunk of an abstraction. There are some fine-details to tune up, but that is more or less done. Now it's on to equivalency, processing, and segmentation/breaking while we nitpick the fine details and function names of an Encoding concept.

Some short-term goals we're already getting started on:

- Advance char8_t proposal[3] through the rest of EWG and LEWG. This will allow us to -- at compile-time -- identify strings and string literals that are meant to be utf8, separating us from the "narrow encoding" problem.

- Contacted Alisdair Meredith and other std-going members about updating references to other standards in the C++ Specification. We are going to do our best to track the latest and greatest Unicode Standard, starting with updating Unicode References in the C++ standard to point towards ISO/IEC 10646:2017 and friends. This update will give us the flexbility to at least use Unicode Version 10, if not 11 by the time of publication.

- Start answering the above Big Goals questions with some concrete implementation and some serious theorycrafting / bikeshedding.

There is some good news: because C++ is late to the party, a lot of the wrinkles in Unicode have been smoothed out fairly well. It's not an entirely bug-free standard, but a lot of work has gone in to making stable versions of things like Normalization that would make now a good time to standardize since there are guarantees about how certain algorithms will behave, regardless of Unicode Version. These are great first things to focus on and standardize and put implementation effort into.

Some other things that are happening:

- Bob Steagall is going to present his work in making transcoding iterators blazing fast at C++Now in Aspen, Colorado, USA in a month. From the looks of it his work should put codecvt to shame.

- Mark Zeren is looking into updating `std::string` and cleaning up its interfaces and iterator guarantees in the Specification's Wording

- Zach Laine is working on what a `std2::string` (plus more) would look like[4]

- libogonek, by R. Martinho Fernandes, is receiving some general updates[5]

- I am (hopefully in about a month) going to be seeing what it would take to make Zach Laine's text abstractions more generic to support encodings outside of the UTF/8/16/32 converting iterators he will ship, combining ideas from my talk[6], Tom Honermann's text_view[7], and libogonek. I'll likely be starting from libogonek as a basis.

As I understand it, we are not touching the standardization of regex. Unicode-conformant regex is MASSIVE and takes a HUGE amount of effort. So let's take care of the massive beef steak on our plates (from everything listed above) and then go over to the enormous ham later, yeah?

All this being said, help is always welcome! There's a lot of things to explore here, and maybe things we're not thinking about yet. We do need some help bikeshedding and theory-crafting and implementing.

I apologize for the long e-mail, but I wanted to cover almost everything. I feel like there might be a few things left out, but... that should help, I think! See also: https://github.com/sg16-unicode/sg16-meetings

Sincerely,

JeanHeyd "ThePhD" Meneide (No I am not a Doctor.) (Yet! Working on it. :D)

[1] - See a proposal for trying to revamp ctype and all the traps the person falls into while trying to make something that's Designed To Be Deprecated: https://groups.google.com/a/isocpp.org/forum/#!topic/std-proposals/Besva70LN3c
[2] - https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11/17106065#17106065

[3] - http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html
[4] - https://github.com/rmartinho/ogonek
[5] - https://github.com/tzlaine/text
[6] - https://github.com/ThePhD/ThePhD.github.io/blob/master/presentations/unicode/2018.03.07%20-%20ThePhD%20-%20a%20rudimentary%20unicode%20abstraction.pdf
[7] - https://github.com/tahonermann/text_view

On Tue, Apr 10, 2018, 5:55 PM <keld@keldix.com> wrote:

Hi

Got it!

What are the issues of the Unicode SG?

Best regards
Keld

On Tue, Apr 10, 2018 at 04:56:43PM -0400, Deruupu Sutoomo wrote:
> Testing, testing... is this on?

> _______________________________________________
> Unicode mailing list
> Unicode@isocpp.open-std.org
> http://www.open-std.org/mailman/listinfo/unicode