C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 11 Sep 2019 16:06:39 -0400
On 9/11/19 3:32 PM, Marshall Clow wrote:
> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib
> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
>
> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
> states:
>
>> The /positive-integer/ in /width/ is a decimal integer defining
>> the minimum field width. If /width/ is not specified, there is
>> no minimum field width, and the field width is determined based
>> on the content of the field.
>>
> Is field width measured in code units, code points, or something else?
>
> Consider the following example assuming a UTF-8 locale:
>
> std::format("{}", "\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER
> A WITH ACUTE }
> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN
> CAPITAL LETTER A } { COMBINING ACUTE ACCENT }
>
> In both cases, the arguments encode the same user-perceived
> character (Á). The first uses two UTF-8 code units to encode a
> single code point that represents a single glyph using a composed
> Unicode normalization form. The second uses three code units to
> encode two code points that represent the same glyph using a
> decomposed Unicode normalization form.
>
> How is the field width determined? If measured in code units, the
> first has a width of 2 and the second of 3. If measured in code
> points, the first has a width of 1 and the second of 2. If
> measured in grapheme clusters, both have a width of 1. Is the
> determination locale dependent?
>
>
>
> (Coming late to the party)
> Let's ask a different question.
>
> std::string s = "/* some content */";
> std::ostringstream oss;
> oss << std::setw(22) << s;
> std::string result1 = oss.str();
> std::string result2 = std::format("{:22}", s);
>
> What can we say about the contents of "result1" and "result2"?
> Are they the same? Does it matter what the contents of `s` is?

Excellent questions.

I really want them to be the same (at least by default, additional
opt-in support for locale/encoding sensitive alignment strike me as
potentially reasonable assuming identification of compelling use cases).
I don't think the contents of `s` should matter (without additional opt-in).

Tom.

>
> -- Marshall



Received on 2019-09-11 22:06:42