Date: Wed, 9 Mar 2022 12:17:03 -0600
Meta comment: Referencing the ICU module or class would be helpful rather than just the feature name.
* Definitely consider icu4x and ecma402’s API, with the benefit of many additional years of trial and error. These APIs are in some places purposefully a subset of ICU.
* Formatting: As Corentin said, Probably these should be looked at in detail in comparison with std::locale / chrono.
* Locale identifier. I think the feature set here should prioritize BCP47 ( Unicode Locale Identifier but with hypen separators), and not POSIX ids. So en-US, not en_US, but provide ways to convert to other formats.
* Maps: these are mostly for ICU’s internal use, though could be useful for ther users’s data.
* “Sets of Unicode Code Points and Strings”: This is actually used in a number of processes including implementing other Unicode features. I’m assuming this refers to UnicodeSet. This is analogous to the sets available in Perl Compatible Regular Expressions, allowing the following operations:
• Does a given string contain any characters matching “[:Deva:]” (i.e. Devanagari)? Does it ONLY Consist of “[:Deva:]” ?
• If the Adangme language has an expected repertoire of "[a á ã b d e é ɛ {ɛ\u0301} {ɛ\u0303} f g h i í ĩ j k l m n o ɔ {ɔ\u0301} {ɔ\u0303} p s t u v w y z]” , how does the set of a certain string relate to that repertoire?
• Related, a certain font may have a certain repertoire, and this can be compared to scripts or certain languages.
• I found this snippet that relates POSIX locale categories with UnicodeSet. (It’s used for exporting CLDR data to POSIX format, but might demonstrate some use cases)
{ "upper", "[:Uppercase:]" },
{ "lower", "[:Lowercase:]" },
{ "alpha", "[[:Alphabetic:]-[[:Uppercase:][:Lowercase:]]]" },
{ "space", "[:Whitespace:]" },
{ "cntrl", "[:Control:]" },
{ "graph", "[^[:Whitespace:][:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
{ "print", "[^[:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
{ "punct", "[:Punctuation:]" },
{ "digit", "[0-9]" },
{ "xdigit", "[0-9 a-f A-F]" },
{ "blank", "[[:Whitespace:]-[\\u000A-\\u000D \\u0085 [:Line_Separator:][:Paragraph_Separator:]]]" } };
In any event, this could be considered as an alternate way to construct a std::set<char32_t>
Note {ɛ\u0301} represents ɛ́ (e + ´)
* Unicode Text Compression:
SCSU has been stabilized: https://www.unicode.org/reports/tr6/ > SCSU defines a compact encoding, which is sometimes useful. However, Unicode text is much more commonly stored and transmitted in UTF-8 which is less compact (except for ASCII), much simpler, and does not present any security issues. For longer texts, general-purpose compression is effective and common. Therefore, there is no need to develop this report any further.
BOCU-1 was withdrawn as a UTS -> https://www.unicode.org/reports/tr40/
* Index Characters - motivation here is for UI display of tabbed entries, such as a personal address book. It’s related to collation.
* Arabic shaping - This has to do with converting Arabic text into preformatted form. It may be too specialized for a general library operation.
* Complex Text Layout - removed from ICU, “use Harfbuzz instead”
* Paragraph Layout - Used by some. Depends on Harfbuzz and on ICU services.
Steven
* Definitely consider icu4x and ecma402’s API, with the benefit of many additional years of trial and error. These APIs are in some places purposefully a subset of ICU.
* Formatting: As Corentin said, Probably these should be looked at in detail in comparison with std::locale / chrono.
* Locale identifier. I think the feature set here should prioritize BCP47 ( Unicode Locale Identifier but with hypen separators), and not POSIX ids. So en-US, not en_US, but provide ways to convert to other formats.
* Maps: these are mostly for ICU’s internal use, though could be useful for ther users’s data.
* “Sets of Unicode Code Points and Strings”: This is actually used in a number of processes including implementing other Unicode features. I’m assuming this refers to UnicodeSet. This is analogous to the sets available in Perl Compatible Regular Expressions, allowing the following operations:
• Does a given string contain any characters matching “[:Deva:]” (i.e. Devanagari)? Does it ONLY Consist of “[:Deva:]” ?
• If the Adangme language has an expected repertoire of "[a á ã b d e é ɛ {ɛ\u0301} {ɛ\u0303} f g h i í ĩ j k l m n o ɔ {ɔ\u0301} {ɔ\u0303} p s t u v w y z]” , how does the set of a certain string relate to that repertoire?
• Related, a certain font may have a certain repertoire, and this can be compared to scripts or certain languages.
• I found this snippet that relates POSIX locale categories with UnicodeSet. (It’s used for exporting CLDR data to POSIX format, but might demonstrate some use cases)
{ "upper", "[:Uppercase:]" },
{ "lower", "[:Lowercase:]" },
{ "alpha", "[[:Alphabetic:]-[[:Uppercase:][:Lowercase:]]]" },
{ "space", "[:Whitespace:]" },
{ "cntrl", "[:Control:]" },
{ "graph", "[^[:Whitespace:][:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
{ "print", "[^[:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
{ "punct", "[:Punctuation:]" },
{ "digit", "[0-9]" },
{ "xdigit", "[0-9 a-f A-F]" },
{ "blank", "[[:Whitespace:]-[\\u000A-\\u000D \\u0085 [:Line_Separator:][:Paragraph_Separator:]]]" } };
In any event, this could be considered as an alternate way to construct a std::set<char32_t>
Note {ɛ\u0301} represents ɛ́ (e + ´)
* Unicode Text Compression:
SCSU has been stabilized: https://www.unicode.org/reports/tr6/ > SCSU defines a compact encoding, which is sometimes useful. However, Unicode text is much more commonly stored and transmitted in UTF-8 which is less compact (except for ASCII), much simpler, and does not present any security issues. For longer texts, general-purpose compression is effective and common. Therefore, there is no need to develop this report any further.
BOCU-1 was withdrawn as a UTS -> https://www.unicode.org/reports/tr40/
* Index Characters - motivation here is for UI display of tabbed entries, such as a personal address book. It’s related to collation.
* Arabic shaping - This has to do with converting Arabic text into preformatted form. It may be too specialized for a general library operation.
* Complex Text Layout - removed from ICU, “use Harfbuzz instead”
* Paragraph Layout - Used by some. Depends on Harfbuzz and on ICU services.
Steven
-- Steven R. Loomis Code Hive Tx, LLC https://codehivetx.us > On Mar 9, 2022, at 4:01 AM, Corentin Jabot via SG16 <sg16_at_[hidden]> wrote: > > A few random comments that may be useful > A relatively new project, icu4x - aims to provide similar features to ICU while rethinking some of the fundamental design decisions (for example, ICU has a UTF16-first interface, which isn't optimal) https://github.com/unicode-org/icu4x <https://github.com/unicode-org/icu4x> > Currency/Number/Date/Time formatting in ICU/Unicode/CLDR are significantly different from what std::locale can offer, and would deserve further consideration in that is is not "already provided by chrono" > In addition to ICU, ecma402 is worth considering - https://tc39.es/ecma402/ <https://tc39.es/ecma402/> and https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl> > Case mapping seems to be missing from the document, and may be worth considering > A lot of algorithms (casing, clusterization) have both locale dependant, and locale independent algorithms > A locale identifier object is certainly a prerequisite for further unicode locale work - https://unicode.org/reports/tr35/tr35.html#Unicode_locale_identifier <https://unicode.org/reports/tr35/tr35.html#Unicode_locale_identifier> https://unicode-org.github.io/icu4x-docs/doc/icu_locid/struct.Locale.html <https://unicode-org.github.io/icu4x-docs/doc/icu_locid/struct.Locale.html> > > Regards, > Corentin > > On Tue, Mar 8, 2022 at 10:07 PM Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote: > This is your friendly reminder that this telecon is taking place tomorrow. > > If you haven't yet, please review the linked Google Doc <https://docs.google.com/document/d/1f-CLhYZIf_L0q1QBEqe2sVHyAofGx8Akt_xJKDGhcgA/edit?usp=sharing>. Please add comments; especially with regard to any feature sets that you would like to discuss during the telecon. > > Tom. > > On 3/7/22 12:23 AM, Tom Honermann via SG16 wrote: >> SG16 will hold a telecon on Wednesday, March 9th at 19:30 UTC (timezone conversion <https://www.timeanddate.com/worldclock/converter.html?iso=20220309T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>). >> >> The agenda is: >> >> ICU features to consider for C++26 >> During our last telecon <https://github.com/sg16-unicode/sg16-meetings#february-23rd-2022>, Jens suggested the possibility of a roadmap towards providing support for the ICU feature set in the C++ standard. To that end, I put together a Google Doc <https://docs.google.com/document/d/1f-CLhYZIf_L0q1QBEqe2sVHyAofGx8Akt_xJKDGhcgA/edit?usp=sharing> that lists categories of features that ICU provides. The doc contains a table in which I have pre-populated indications of which features I think might be reasonable for standardization in C++26. This is intended to be less of a roadmap and more a list of features for which papers are encouraged and that we would like to spend time on. Please feel free to edit the doc to add comments or challenge my yes, no, and maybe indications. The feature list is derived from the documented ICU Module List <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/#Module>. It may be useful to peruse the ICU Feature Comparison Chart <https://icu.unicode.org/charts/comparison> for additional features to add (I haven't done so yet due to time limitations). It is likely that I have misinterpreted what is provided by some of the feature categories so if you see something that surprises you, please speak up! >> >> Tom. >> >> >> > -- > SG16 mailing list > SG16_at_[hidden] <mailto:SG16_at_[hidden]> > https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16> > -- > SG16 mailing list > SG16_at_[hidden] > https://lists.isocpp.org/mailman/listinfo.cgi/sg16
Received on 2022-03-09 18:17:08