sg16: Re: [SG16] New Unicode Working Group: Message Formatting

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 24 Jan 2020 00:32:28 -0500

On 1/14/20 5:56 PM, Steven R. Loomis via SG16 wrote:
>
>> El ene. 13, 2020, a las 5:46 p. m., Corentin Jabot
>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> escribió:
>>
>
>>
>> C++20 (which is on course to be approved next month) will provide a
>> new feature in the name of std::format derived from the popular fmt
>> library (https://fmt.dev/), itself heavily inspired and sharing the
>> syntax of python's format function.
>>
>> std::format("Hello {}", "World") -> "Hello World";
>> std::format("{2} + {1} = {0}", 3, 1.0, 2) -> "1.0 + 2 = 3";
>>
>> Of interest to Unicode and localization:
>>
>> * For now this function is mostly byte based, in that it is
>> encoding agnostic.
>> * However we made the interesting decision that padding is based on
>> display width (which is fuzzily specified), as we realized the
>> primary use case for padding was the creation of console interface
>>
>
> display width is complex… Unicode’s East Asian Width is often used for
> character width, but there’s more to that (and see
> <https://www.unicode.org/reports/tr11/#Scope> … see for example
> https://github.com/nodejs/node/blob/b0a762157793b0d9143eaa7c270da91932f2a64f/src/node_i18n.cc#L729 in
> Node.js — going beyond wcwidth, etc which do not reflect many terminal
> emulators’ behavior.

Thank you for sharing this, Steven.

The std::format facility standardized for C++20 via P0645
<https://wg21.link/p0645> and as expected to be modified as described in
P1868 <https://wg21.link/P1868> specifies display width in terms of a
hard-coded set of Unicode code points. See the wording section
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html#wording>.
The set of code points and associated display widths were taken from
Markus Kuhn's wcswidth() implementation
<https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>. We know that the
current list of code points is incomplete. For example, no code points
are assigned a width of 0, and handling of outliers like U+FDFD {ARABIC
LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM} is absent.

I'm not sufficiently educated to evaluate the relative merits of Markus
Kuhn's implementation and the implementation in your Node.js link
above. If you have more information to share on the subject, it would
be appreciated.

Wouldn't it be nice if Unicode were to offer an
Extended-Grapheme-Cluster-width-in-monospace-font algorithm? :)

Tom.

Received on 2020-01-23 23:44:14