On Mar 9, 2022, at 12:46 PM, Tom Honermann <tom@honermann.net> wrote:

Thank you, Steven!

On 3/9/22 1:17 PM, Steven R. Loomis wrote:

Meta comment: Referencing the ICU module or class would be helpful rather than just the feature name.

I agree that would be helpful, but there doesn't seem to be an obvious location to link to within the Doxygen generated docs. And, unfortunately, the link on https://icu.unicode.org/charts/comparison are all broken at the moment. Can you provide an example of where to link to?

It should take you to a page that directs you to https://unicode-org.github.io/icu/userguide/ - but you could definitely file a ticket to fixup the comparison page.

Instead of a link, what I meant is just to name the ICU class or module. UnicodeSet instead of “Sets of Unicode Code Points and Strings” (the C or C++ column of doxygen).

* Definitely consider icu4x and ecma402’s API, with the benefit of many additional years of trial and error. These APIs are in some places purposefully a subset of ICU.

Yes. As we dive into feature categories, we'll definitely want to do some compare and contrast work among these.

* Formatting: As Corentin said, Probably these should be looked at in detail in comparison with std::locale / chrono.

* Locale identifier. I think the feature set here should prioritize BCP47 ( Unicode Locale Identifier but with hypen separators), and not POSIX ids. So en-US, not en_US, but provide ways to convert to other formats.

A bit of a tangent here and I'll be showing my ignorance, but; given a std::locale object, I would assume it is not generally possible to correlate it with a locale identifier. However, I would imagine that an OS locale (as presented by $LANG in POSIX environments or whatever interface is used on Windows) can be mapped to a locale identifier. Does that sound right?

std::locale objects are an implementation of locale data and algorithms, not just the identifier. ICU Locale objects are an identifier. What I mean is that a way to work with locale identifiers is needed. (ICU has a page which discusses the difference, at https://unicode-org.github.io/icu/userguide/locale/#the-locale-concept )

An OS locale should be mapped to a locale identifier. Right now ICU has a lot of platform specific code to do this.

I didn’t mention, but the identifiers are most helpful when they are singletons. Then you can say “is locale A == locale B” rapidly.

* Maps: these are mostly for ICU’s internal use, though could be useful for ther users’s data.

* “Sets of Unicode Code Points and Strings”: This is actually used in a number of processes including implementing other Unicode features. I’m assuming this refers to UnicodeSet. This is analogous to the sets available in Perl Compatible Regular Expressions, allowing the following operations:

• Does a given string contain any characters matching “[:Deva:]” (i.e. Devanagari)? Does it ONLY Consist of “[:Deva:]” ?

• If the Adangme language has an expected repertoire of "[a á ã b d e é ɛ {ɛ\u0301} {ɛ\u0303} f g h i í ĩ j k l m n o ɔ {ɔ\u0301} {ɔ\u0303} p s t u v w y z]” , how does the set of a certain string relate to that repertoire?

• Related, a certain font may have a certain repertoire, and this can be compared to scripts or certain languages.

• I found this snippet that relates POSIX locale categories with UnicodeSet. (It’s used for exporting CLDR data to POSIX format, but might demonstrate some use cases)

{ "upper", "[:Uppercase:]" },
{ "lower", "[:Lowercase:]" },
{ "alpha", "[[:Alphabetic:]-[[:Uppercase:][:Lowercase:]]]" },
{ "space", "[:Whitespace:]" },
{ "cntrl", "[:Control:]" },
{ "graph", "[^[:Whitespace:][:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
{ "print", "[^[:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
{ "punct", "[:Punctuation:]" },
{ "digit", "[0-9]" },
{ "xdigit", "[0-9 a-f A-F]" },
{ "blank", "[[:Whitespace:]-[\\u000A-\\u000D \\u0085 [:Line_Separator:][:Paragraph_Separator:]]]" } };

In any event, this could be considered as an alternate way to construct a std::set<char32_t>

Note {ɛ\u0301} represents ɛ́ (e + ´)

* Unicode Text Compression:

SCSU has been stabilized: https://www.unicode.org/reports/tr6/ > SCSU defines a compact encoding, which is sometimes useful. However, Unicode text is much more commonly stored and transmitted in UTF-8 which is less compact (except for ASCII), much simpler, and does not present any security issues. For longer texts, general-purpose compression is effective and common. Therefore, there is no need to develop this report any further.

BOCU-1 was withdrawn as a UTS -> https://www.unicode.org/reports/tr40/

* Index Characters - motivation here is for UI display of tabbed entries, such as a personal address book. It’s related to collation.

* Arabic shaping - This has to do with converting Arabic text into preformatted form. It may be too specialized for a general library operation.

* Complex Text Layout - removed from ICU, “use Harfbuzz instead”

* Paragraph Layout - Used by some. Depends on Harfbuzz and on ICU services.

Thank you for all of those notes!

Welcome…

Tom.

Steven

--

Steven R. Loomis

Code Hive Tx, LLC

https://codehivetx.us

On Mar 9, 2022, at 4:01 AM, Corentin Jabot via SG16 <sg16@lists.isocpp.org> wrote:

A few random comments that may be useful

A relatively new project, icu4x - aims to provide similar features to ICU while rethinking some of the fundamental design decisions (for example, ICU has a UTF16-first interface, which isn't optimal) https://github.com/unicode-org/icu4x

Currency/Number/Date/Time formatting in ICU/Unicode/CLDR are significantly different from what std::locale can offer, and would deserve further consideration in that is is not "already provided by chrono"

In addition to ICU, ecma402 is worth considering - https://tc39.es/ecma402/ and https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl

Case mapping seems to be missing from the document, and may be worth considering

A lot of algorithms (casing, clusterization) have both locale dependant, and locale independent algorithms

A locale identifier object is certainly a prerequisite for further unicode locale work - https://unicode.org/reports/tr35/tr35.html#Unicode_locale_identifier https://unicode-org.github.io/icu4x-docs/doc/icu_locid/struct.Locale.html

Regards,

Corentin

On Tue, Mar 8, 2022 at 10:07 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

This is your friendly reminder that this telecon is taking place tomorrow.
If you haven't yet, please review the linked Google Doc. Please add comments; especially with regard to any feature sets that you would like to discuss during the telecon.

Tom.

On 3/7/22 12:23 AM, Tom Honermann via SG16 wrote:

SG16 will hold a telecon on Wednesday, March 9th at 19:30 UTC (timezone conversion).
The agenda is:

ICU features to consider for C++26

During our last telecon, Jens suggested the possibility of a roadmap towards providing support for the ICU feature set in the C++ standard. To that end, I put together a Google Doc that lists categories of features that ICU provides. The doc contains a table in which I have pre-populated indications of which features I think might be reasonable for standardization in C++26. This is intended to be less of a roadmap and more a list of features for which papers are encouraged and that we would like to spend time on. Please feel free to edit the doc to add comments or challenge my yes, no, and maybe indications. The feature list is derived from the documented ICU Module List. It may be useful to peruse the ICU Feature Comparison Chart for additional features to add (I haven't done so yet due to time limitations). It is likely that I have misinterpreted what is provided by some of the feature categories so if you see something that surprises you, please speak up!

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16