ISOCPP sg16 List: Re: Agenda for the 2022-03-09 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 9 Mar 2022 13:46:59 -0500

Thank you, Steven!

On 3/9/22 1:17 PM, Steven R. Loomis wrote:
> Meta comment: Referencing the ICU module or class would be helpful
> rather than just the feature name.
I agree that would be helpful, but there doesn't seem to be an obvious
location to link to within the Doxygen generated docs. And,
unfortunately, the link on https://icu.unicode.org/charts/comparison are
all broken at the moment. Can you provide an example of where to link to?
>
> * Definitely consider icu4x and ecma402’s API, with the benefit of
> many additional years of trial and error. These APIs are in some
> places purposefully a subset of ICU.
Yes. As we dive into feature categories, we'll definitely want to do
some compare and contrast work among these.
>
> * Formatting: As Corentin said, Probably these should be looked at in
> detail in comparison with std::locale / chrono.
>
> * Locale identifier. I think the feature set here should prioritize
> BCP47 ( Unicode Locale Identifier but with hypen separators), and not
> POSIX ids. So en-US, not en_US, but provide ways to convert to other
> formats.
A bit of a tangent here and I'll be showing my ignorance, but; given a
std::locale object, I would assume it is not generally possible to
correlate it with a locale identifier. However, I would imagine that an
OS locale (as presented by $LANG in POSIX environments or whatever
interface is used on Windows) can be mapped to a locale identifier. Does
that sound right?
>
> * Maps: these are mostly for ICU’s internal use, though could be
> useful for ther users’s data.
>
> * “Sets of Unicode Code Points and Strings”: This is actually used in
> a number of processes including implementing other Unicode features.
> I’m assuming this refers to UnicodeSet. This is analogous to the
> sets available in Perl Compatible Regular Expressions, allowing the
> following operations:
> • Does a given string contain any characters matching “[:Deva:]”
> (i.e. Devanagari)? Does it ONLY Consist of “[:Deva:]” ?
> • If the Adangme language has an expected repertoire of "[a á ã b d
> e é ɛ {ɛ\u0301} {ɛ\u0303} f g h i í ĩ j k l m n o ɔ {ɔ\u0301}
> {ɔ\u0303} p s t u v w y z]” , how does the set of a certain string
> relate to that repertoire?
> • Related, a certain font may have a certain repertoire, and this
> can be compared to scripts or certain languages.
>
> • I found this snippet that relates POSIX locale categories with
> UnicodeSet. (It’s used for exporting CLDR data to POSIX format, but
> might demonstrate some use cases)
>
> { "upper", "[:Uppercase:]" },
> { "lower", "[:Lowercase:]" },
> { "alpha", "[[:Alphabetic:]-[[:Uppercase:][:Lowercase:]]]" },
> { "space", "[:Whitespace:]" },
> { "cntrl", "[:Control:]" },
> { "graph", "[^[:Whitespace:][:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
> { "print", "[^[:Control:][:Format:][:Surrogate:][:Unassigned:]]" },
> { "punct", "[:Punctuation:]" },
> { "digit", "[0-9]" },
> { "xdigit", "[0-9 a-f A-F]" },
> { "blank", "[[:Whitespace:]-[\\u000A-\\u000D \\u0085
> [:Line_Separator:][:Paragraph_Separator:]]]" } };
>
> In any event, this could be considered as an alternate way to
> construct a std::set<char32_t>
>
> Note {ɛ\u0301} represents ɛ́ (e + ´)
>
> * Unicode Text Compression:
> SCSU has been stabilized: https://www.unicode.org/reports/tr6/
> > SCSU defines a compact encoding, which is sometimes useful.
> However, Unicode text is much more commonly stored and transmitted in
> UTF-8 which is less compact (except for ASCII), much simpler, and does
> not present any security issues. For longer texts, general-purpose
> compression is effective and common. Therefore, there is no need to
> develop this report any further.
>
> BOCU-1 was withdrawn as a UTS -> https://www.unicode.org/reports/tr40/
>
> * Index Characters - motivation here is for UI display of tabbed
> entries, such as a personal address book. It’s related to collation.
>
> * Arabic shaping - This has to do with converting Arabic text into
> preformatted form. It may be too specialized for a general library
> operation.
>
> * Complex Text Layout - removed from ICU, “use Harfbuzz instead”
>
> * Paragraph Layout - Used by some. Depends on Harfbuzz and on ICU
> services.

Thank you for all of those notes!

Tom.

>
> Steven
>
> --
> Steven R. Loomis
> Code Hive Tx, LLC
> https://codehivetx.us
>
>
>
>> On Mar 9, 2022, at 4:01 AM, Corentin Jabot via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> A few random comments that may be useful
>>
>> * A relatively new project, icu4x - aims to provide similar
>> features to ICU while rethinking some of the fundamental design
>> decisions (for example, ICU has a UTF16-first interface, which
>> isn't optimal) https://github.com/unicode-org/icu4x
>> * Currency/Number/Date/Time formatting in ICU/Unicode/CLDR are
>> significantly different from what std::locale can offer, and
>> would deserve further consideration in that is is not "already
>> provided by chrono"
>> * In addition to ICU, ecma402 is worth considering -
>> https://tc39.es/ecma402/ and
>> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl
>>
>> * Case mapping seems to be missing from the document, and may be
>> worth considering
>> * A lot of algorithms (casing, clusterization) have both locale
>> dependant, and locale independent algorithms
>> * A locale identifier object is certainly a prerequisite for
>> further unicode locale work -
>> https://unicode.org/reports/tr35/tr35.html#Unicode_locale_identifier
>> https://unicode-org.github.io/icu4x-docs/doc/icu_locid/struct.Locale.html
>>
>>
>>
>> Regards,
>> Corentin
>>
>> On Tue, Mar 8, 2022 at 10:07 PM Tom Honermann via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> This is your friendly reminder that this telecon is taking place
>> tomorrow.
>>
>> If you haven't yet, please review the linked Google Doc
>> <https://docs.google.com/document/d/1f-CLhYZIf_L0q1QBEqe2sVHyAofGx8Akt_xJKDGhcgA/edit?usp=sharing>.
>> Please add comments; especially with regard to any feature sets
>> that you would like to discuss during the telecon.
>>
>> Tom.
>>
>> On 3/7/22 12:23 AM, Tom Honermann via SG16 wrote:
>>>
>>> SG16 will hold a telecon on Wednesday, March 9th at 19:30 UTC
>>> (timezone conversion
>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20220309T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>).
>>>
>>> The agenda is:
>>>
>>> * ICU features to consider for C++26
>>>
>>> During our last telecon
>>> <https://github.com/sg16-unicode/sg16-meetings#february-23rd-2022>,
>>> Jens suggested the possibility of a roadmap towards providing
>>> support for the ICU feature set in the C++ standard. To that
>>> end, I put together a Google Doc
>>> <https://docs.google.com/document/d/1f-CLhYZIf_L0q1QBEqe2sVHyAofGx8Akt_xJKDGhcgA/edit?usp=sharing>
>>> that lists categories of features that ICU provides. The doc
>>> contains a table in which I have pre-populated indications of
>>> which features I think might be reasonable for standardization
>>> in C++26. This is intended to be less of a roadmap and more a
>>> list of features for which papers are encouraged and that we
>>> would like to spend time on. Please feel free to edit the doc to
>>> add comments or challenge my yes, no, and maybe indications. The
>>> feature list is derived from the documented ICU Module List
>>> <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/#Module>.
>>> It may be useful to peruse the ICU Feature Comparison Chart
>>> <https://icu.unicode.org/charts/comparison> for additional
>>> features to add (I haven't done so yet due to time limitations).
>>> It is likely that I have misinterpreted what is provided by some
>>> of the feature categories so if you see something that surprises
>>> you, please speak up!
>>>
>>> Tom.
>>>
>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-03-09 18:47:05