sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 7 Sep 2019 23:30:48 -0400

On 9/7/19 10:44 PM, Victor Zverovich wrote:
> > Is field width measured in code units, code points, or something else?
>
> I think the main consideration here is that width should be
> locale-independent by default for consistency with the rest of
> std::format's design.
I agree with that goal, but...
> If we can say that width is measured in grapheme clusters or code
> points based on the execution encoding (or whatever the standardese
> term) without querying the locale then I suggest doing so.
I don't know how to do that. From my response to Zach, if code units
aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.
> I have slight preference for grapheme clusters since those correspond
> to user-perceived characters, but only have implementation experience
> with code points (this is what both the fmt library and Python do).

I would definitely vote for EGCs over code points. I think code points
are probably the worst of the options since it makes the results
dependent on Unicode normalization form.

Tom.

>
> Cheers,
> Victor
>
> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib
> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
>
> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
> states:
>
>> The /positive-integer/ in /width/ is a decimal integer defining
>> the minimum field width. If /width/ is not specified, there is
>> no minimum field width, and the field width is determined based
>> on the content of the field.
>>
> Is field width measured in code units, code points, or something else?
>
> Consider the following example assuming a UTF-8 locale:
>
> std::format("{}", "\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER
> A WITH ACUTE }
> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN
> CAPITAL LETTER A } { COMBINING ACUTE ACCENT }
>
> In both cases, the arguments encode the same user-perceived
> character (Á). The first uses two UTF-8 code units to encode a
> single code point that represents a single glyph using a composed
> Unicode normalization form. The second uses three code units to
> encode two code points that represent the same glyph using a
> decomposed Unicode normalization form.
>
> How is the field width determined? If measured in code units, the
> first has a width of 2 and the second of 3. If measured in code
> points, the first has a width of 1 and the second of 2. If
> measured in grapheme clusters, both have a width of 1. Is the
> determination locale dependent?
>
> *Proposed resolution:*
>
> Field widths are measured in code units and are not locale
> dependent. Modify [format.string.std]p7
> <http://eel.is/c++draft/format#string.std-7> as follows:
>
>> The /positive-integer/ in /width/ is a decimal integer defining
>> the minimum field width. If /width/ is not specified, there is
>> no minimum field width, and the field width is determined based
>> on the content of the field. *Field width is measured in code
>> units. Each byte of a multibyte character contributes to the
>> field width.*
>>
> (/code unit/ is not formally defined in the standard. Most uses
> occur in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
> <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic
> context.)
>
> Tom.
>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden] <mailto:Lib_at_[hidden]>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>

Received on 2019-09-08 05:30:54