sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sat, 7 Sep 2019 19:44:46 -0700

> Is field width measured in code units, code points, or something else?

I think the main consideration here is that width should be
locale-independent by default for consistency with the rest of
std::format's design. If we can say that width is measured in grapheme
clusters or code points based on the execution encoding (or whatever the
standardese term) without querying the locale then I suggest doing so. I
have slight preference for grapheme clusters since those correspond to
user-perceived characters, but only have implementation experience with
code points (this is what both the fmt library and Python do).

Cheers,
Victor

On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <lib_at_[hidden]>
wrote:

> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7> states:
>
> The *positive-integer* in *width* is a decimal integer defining the
> minimum field width. If *width* is not specified, there is no minimum
> field width, and the field width is determined based on the content of the
> field.
>
> Is field width measured in code units, code points, or something else?
>
> Consider the following example assuming a UTF-8 locale:
>
> std::format("{}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL
> LETTER A WITH ACUTE }
> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
> LETTER A } { COMBINING ACUTE ACCENT }
>
> In both cases, the arguments encode the same user-perceived character
> (Á). The first uses two UTF-8 code units to encode a single code point
> that represents a single glyph using a composed Unicode normalization
> form. The second uses three code units to encode two code points that
> represent the same glyph using a decomposed Unicode normalization form.
>
> How is the field width determined? If measured in code units, the first
> has a width of 2 and the second of 3. If measured in code points, the
> first has a width of 1 and the second of 2. If measured in grapheme
> clusters, both have a width of 1. Is the determination locale dependent?
>
> *Proposed resolution:*
>
> Field widths are measured in code units and are not locale dependent.
> Modify [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
> as follows:
>
> The *positive-integer* in *width* is a decimal integer defining the
> minimum field width. If *width* is not specified, there is no minimum
> field width, and the field width is determined based on the content of the
> field. *Field width is measured in code units. Each byte of a multibyte
> character contributes to the field width.*
>
> (*code unit* is not formally defined in the standard. Most uses occur in
> UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
> <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic
> context.)
>
> Tom.
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>

Received on 2019-09-08 04:45:00