C++ Logo

sg16

Advanced search

Re: Unicode event: Overview of Internationalization and Unicode Projects

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 1 Oct 2022 11:09:16 -0400
A summary and notes from the event follow. Please note that I did not
attempt to capture everything.

The recorded event will be made available at
https://www.youtube.com/channel/UCQNrSepJnz8BjWT7lrKH9Tw. I will send an
update when/if I become aware that the recorded event is available.

Several themes were echoed across the presentations.

The first is that the Unicode Consortium (UC) manages three main
projects. The Unicode Standard
<https://www.unicode.org/versions/latest/> provides a specification of
characters, scripts, encodings, character properties, algorithms, and
more. The CLDR <https://cldr.unicode.org/> provides a specification of
languages, locales, regions, and cultural conventions. ICU
<https://icu.unicode.org/>, and now ICU4X <https://icu4x.unicode.org>,
provide portable libraries that enable applications to provide support
for internationalization, localization, and much more. Each project
builds on top of the previous ones.

The second theme is that the UC invites involvement. Each of the project
presentations explained how to get involved and what opportunities are
available. It was noted that opportunities are not limited to those with
deep language experience! There are opportunities for translators,
linguists, researchers, PMs, technical writers, UI designers, and of
course, programmers!


  An Introduction to Internationalization (i18n) - Addison Phillips,
  Internationalization Engineer

Addison's presentation, as its title suggests, provided an introduction
to topics of internationalization. Included was discussion of
differences between written languages, graphs of language use, examples
of collation and other cultural differences, and a definition of
internationalization and localization. Message formatting was also
discussed. The presentation ended with a set of commitments programmers
are encouraged to adhere to when writing software. Those are:

 1. Use i18n best practices.
 2. Use Unicode.
 3. Use locales.
 4. Use resources (e.g., language resource bundles).
 5. Use message formatting APIs.


  Overview of the Unicode Consortium: History and Future - Mark Davis,
  Cofounder and President

Mark provided an overview of the three main UC projects listed above,
the organization of the UC, a timeline of significant events that
contributed to the development of Unicode and status of the UC, and
on-going work we can expect to see more of in the future.

Some of the mentioned timeline events included:

 1. The invention of writing systems ~3400 BC.
 2. The standardization of ASCII in 1963.
 3. The standardization of Unicode in 1991.
 4. The introduction of Unicode character properties in 1995.
 5. The CLDR in 2003.
 6. The adopt a character program in 2015 and funding for digitally
    disadvantaged languages.
 7. The inclusion of ICU as a UC project in 2016.

The UC is organized around the UC projects; the following committees and
sub-committees were discussed.

  * The UTC committee and its subcommittees:
      o Scripts
      o CJK and Unihan
      o Emoji
      o Properties and Algorithms
          + The Source Code Ad Hoc Group (SCWG)
      o Editorial
  * The CLDR committee and its subcommittees:
      o Message Formatting
      o Keyboards
      o Person names
  * The ICU committee and its subcommittees:
      o ICU4X

Future work we can expect to see coming out of the UC includes:

  * Work related to keyboards.
  * Specifications for formatting of person names.
  * Enhanced message formatting APIs.
  * Enhanced support for measurement units.
  * Enhanced grammatical support for dates, times, measurement units,
    and more.
  * Other unmentioned work.


  Scripts and Character Encoding - Deborah Anderson, Chair of the Script
  Ad Hoc Committee

Deborah explained the role of the UTC subcommittees; to study and review
proposals and to make recommendations to the UTC. Much of the
presentation focused on the work of the Script Ad Hoc committee:

  * The committee is composed of 10-15 experts.
  * It can take several years to to review and iterate on a new script
    proposal to make it ready for standardization. For example, it took
    six years from the date of a first proposal to produce the final
    proposal for support of the Hanifi Rohingya script.
  * Support for fonts and locales comes after adoption of a new script
    and might require several more years to be made widely available.

Deborah highlighted the importance of work to adopt and improve script
support:

  * Script standardization makes it possible to digitally preserve
    historic texts.
  * Lack of script support limits the abilities of its users to
    participate in the digital revolusion.


  The Common Locale Data Repository (CLDR) - Mark Davis and Annemarie
  Apple, Chair and Vice Chair of the CLDR Committee

Mark and Annemarie explained that the CLDR is provided by all major
operating systems.

Example capabilities that the CLDR provides were demonstrated. These
included:

  * Query normalization; searches for case-insensitive or
    accent-insensitive matches.
  * Number and duration formatting.
  * Date formatting.
  * Unit formatting with conversion to units appropriate for a given locale.
  * Relative time formatting.

A language survey tool is used to help build the CLDR data. The tool
enables language researchers to produce a consensus driven specification
of cultural expectations for a given locale. Note that multiple locales
may use the same language, but have different expectations for how the
language is written. Examples of what the tool can be used to specify
include:

  * What characters are used.
  * How time durations are written.
  * What measurement units are used and, for gendered languages, their
    associated gender and how they compose.
  * Both positive and negative examples of cultural expectations.

The CLDR itself consists of data in a structured format, specifications
for how to use that data, and release overviews.

Language and locale support is characterized by how fully the CLDR
supports it. There are four categories:

  * Modern: 95 languages, 366 locales; suitable for full UI i18n.
  * Moderate: 6 languages, 11 locales; suitable for document content.
  * Basic: 29 languages, 43 locales; suitable for locale selection.
  * Other: 183 languages; in development; these correspond to digitally
    disadvantaged languages.

The CLDR exists to protect investment in written languages, prioritize
language support improvements, provide interoperability, and acknowledge
digitally disadvantaged languages.

There are opportunities to contribute to the CLDR for translators,
linguists, language researchers, project managers, tech writers, UI
designers, and programmers.


  International Components for Unicode (ICU) - Markus Scherer, Chair of
  ICU Committee

Markus provided a demonstration of locale dependent collation using the
ICU online tool available at
https://icu4c-demos.unicode.org/icu-bin/collation.html. The
demonstration showed how the order in which a list of names is presented
changes based on locale selection.

Major benefits of ICU include:

  * Stable APIs
  * Rich services; ICU is intended to be an all-in-one i18n library.
  * High performance.
  * Support for many platforms.

ICU originated in the 1990s. As such, like all long-lived products, it
has acquired technical debt and its interfaces reflect the design
principles customary at the time.

ICU4x is intended to provide more modern interfaces. ICU4X is not
intended to replace ICU.


  Bringing Internationalization to More Programming Languages and
  Resource-Constrained Environments (ICU4X) - Shane Carr, Chair of ICU4X
  Subcommittee

Shane explained that there is a need to provide i18n support for more
programming languages, for smaller devices, and for client side
frameworks where ICU is not always a good fit.

ICU4X is written in Rust and designed to be lightweight, portable, and
secure. Benefits of these goals were described as:

  * Lightweight: A small binary size, low memory usage, and low CPU
    performance are needed for many applications. Locale data can be
    reduced to what is minimally required for a given application.
  * Portable: Programming language bindings are provided for C, C++,
    JavaScript, and TypeScript. Plugins can provide support for
    additional programming languages.
  * Secure: Rust provides memory safety assurances thereby preventing a
    significant classes of programmer errors and security issues.

Some key decisions contributed to the ICU4X effort:

  * A new library is needed due to the large engineering effort that
    would be required to re-design and adapt ICU for the goals that
    ICU4X was created to pursue.
  * ICU4X allows locale data to be consumed from multiple providers.
    This allows it to be used as a polyfill solution; for example, it
    can consume locale data provided with a platform, but also consume
    locale data provided with an application to add support for locales
    not yet supported by the platform. Likewise, locale data can be
    constrained to a subset needed for a particular application.
  * The ICU4X data files are designed for forward and backward
    compatibility so that they can be shared across ICU4X versions. The
    data files can be upgraded at run-time without having to reload the
    library.

Version 1.0 was just released.

An online demo that illustrates fixed decimal formatting, date and time
formatting, and word segmentation was shown.


  Q & A

Mark Davis participated in a Q & A session following the presentations.
At one point, he was asked what he is most proud of. He answered, "the
idea of Unicode" and explained that, before Unicode, the proliferation
of code pages produced a disaster with effects that can still be seen to
this day. A truth that we in SG16 know all too well!

Tom.

On 9/12/22 4:45 PM, Tom Honermann via SG16 wrote:
>
> The Unicode Consortium will be hosting an ~2 hour free online event on
> Wednesday, September 28th, 2022 at 16:30 UTC (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20220928T163000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The topics and speakers include:
>
> 1. An Introduction to Internationalization (i18n) - Addison Phillips,
> Internationalization Engineer
> 2. Overview of the Unicode Consortium: History and Future - Mark
> Davis, Cofounder and President
> 3. Scripts and Character Encoding - Deborah Anderson, Chair of the
> Script Ad Hoc Committee
> 4. The Common Locale Data Repository (CLDR) - Mark Davis and
> Annemarie Apple, Chair and Vice Chair of the CLDR Committee
> 5. International Components for Unicode (ICU) - Markus Scherer, Chair
> of ICU Committee
> 6. Bringing Internationalization to More Programming Languages and
> Resource-Constrained Environments (ICU4X) - Shane Carr, Chair of
> ICU4X Subcommittee
>
> Additional details and registration information are available here
> <https://us06web.zoom.us/webinar/register/WN_ViDf3YFyS7WiAXnHYp88kw>.
>
> Tom.
>
>

Received on 2022-10-01 15:09:19