I think the issues and goals of the Unicode SG can categorized in the following ways.
Note that what I say is not binding or agreed upon. I have participated in a few meetings and listened to everyone who has attended them (the meetings have been going on for longer than I have been around). I also put forth ideas of my own. But, this is in no way a normative or binding reference of what we are doing and getting done.
Let's start with the
Issues.
Text Processing in the Standard Library is not good:
- Archaic interfaces for an ASCII, EBCDIC, and other non-multibyte-encoding world lead people to poorly handle text outside if that scope.[1]
- Many encodings and text transformations, Unicode or otherwise, are unable to be handled by current library facilities[2]
- When people encounter the first two points, they often either take a deep dive ICU (maybe with Boost.Locale), which is not renowned for its interfaces, or abandon their idea altogether.
- It's embarrassing that this is the best C++ has after decades of people working outside of standardization to raise the bar on this for a long time.
Unicode is big, and so is text processing, so we're going to focus on two big parts of it!
The Big Goals:
- How do we handle encoding differences, text equivalency (normalization, collation), and processing (case folding, case mapping, title casing, and more)? What about text-segmentation (word breaks, line breaks, etc.?) Iterators, free functions? These things are slowly beginning to form in our meetings, and progress should be accelerated thanks to the start of this mailing list.
- What does a string/text abstraction look like in the Bright, Beautiful C++ Future? Is it a new string class? A bucket of free functions? A mix of the two?
Those are the big goals. We have more or less found out that iterators are amazing for encoding: 5 separate people (and more we are not tracking) have implemented iterator-based encoding and decoding and it is a slam-dunk of an abstraction. There are some fine-details to tune up, but that is more or less done. Now it's on to equivalency, processing, and segmentation/breaking while we nitpick the fine details and function names of an Encoding concept.
Some short-term goals we're already getting started on: - Advance char8_t proposal[3] through the rest of EWG and LEWG. This will allow us to -- at compile-time -- identify strings and string literals that are meant to be utf8, separating us from the "narrow encoding" problem.
- Contacted Alisdair Meredith and other std-going members about updating references to other standards in the C++ Specification. We are going to do our best to track the latest and greatest Unicode Standard, starting with updating Unicode References in the C++ standard to point towards ISO/IEC 10646:2017 and friends. This update will give us the flexbility to at least use Unicode Version 10, if not 11 by the time of publication.
- Start answering the above Big Goals questions with some concrete implementation and some serious theorycrafting / bikeshedding.
There is some good news: because C++ is late to the party, a lot of the
wrinkles in Unicode have been smoothed out fairly well. It's not an
entirely bug-free standard, but a lot of work has gone in to making
stable versions of things like Normalization that would make now a good
time to standardize since there are guarantees about how certain
algorithms will behave, regardless of Unicode Version. These are great first things to focus on and standardize and put implementation effort into.
Some other things that are happening:
- Bob Steagall is going to present his work in making transcoding iterators blazing fast at C++Now in Aspen, Colorado, USA in a month. From the looks of it his work should put codecvt to shame.
- Mark Zeren is looking into updating `std::string` and cleaning up its interfaces and iterator guarantees in the Specification's Wording
- Zach Laine is working on what a `std2::string` (plus more) would look like[4]
- libogonek, by R. Martinho Fernandes, is receiving some general updates[5]
- I am (hopefully in about a month) going to be seeing what it would take to make Zach Laine's text abstractions more generic to support encodings outside of the UTF/8/16/32 converting iterators he will ship, combining ideas from my talk[6], Tom Honermann's text_view[7], and libogonek. I'll likely be starting from libogonek as a basis.
As I understand it, we are not touching the standardization of regex. Unicode-conformant regex is MASSIVE
and takes a HUGE amount of effort. So let's take care of the massive
beef steak on our plates (from everything listed above) and then go over to the enormous ham later, yeah?
All this being said, help is always welcome! There's a lot of things to explore here, and maybe things we're not thinking about yet. We do need some help bikeshedding and theory-crafting and implementing.
Sincerely,
JeanHeyd "ThePhD" Meneide (No I am not a Doctor.) (Yet! Working on it. :D)
[1] - See a proposal for trying to revamp ctype and all the traps the person falls into while trying to make something that's Designed To Be Deprecated:
https://groups.google.com/a/isocpp.org/forum/#!topic/std-proposals/Besva70LN3c[2] -
https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11/17106065#17106065