sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sun, 8 Sep 2019 08:08:25 +0200

On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib <lib_at_[hidden]>
wrote:

> On 9/7/19 10:44 PM, Victor Zverovich wrote:
>
> > Is field width measured in code units, code points, or something else?
>
> I think the main consideration here is that width should be
> locale-independent by default for consistency with the rest of
> std::format's design.
>
> I agree with that goal, but...
>
> If we can say that width is measured in grapheme clusters or code points
> based on the execution encoding (or whatever the standardese term) without
> querying the locale then I suggest doing so.
>
> I don't know how to do that. From my response to Zach, if code units
> aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.
>
> I have slight preference for grapheme clusters since those correspond to
> user-perceived characters, but only have implementation experience with
> code points (this is what both the fmt library and Python do).
>
> I would definitely vote for EGCs over code points. I think code points
> are probably the worst of the options since it makes the results dependent
> on Unicode normalization form.
>

I disagree. Code Units is the worse option. For me anything involving code
units is a big red flag. I agree that EGCS is the best option. That doesn't
drag locale, might be a bit involved for implementers in 20.
I don't think specify EGCS for Unicode text and codepoints otherwise
wouldn't be too difficult - implementation might be a bit challenging on
some platforms in the 20 time frame but they could fallback to codepoints
in the meantime. Not perfect but I think we need a good long term solution
rather than a bad short term one

Tom.
>
>
> Cheers,
> Victor
>
> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <lib_at_[hidden]>
> wrote:
>
>> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>> states:
>>
>> The *positive-integer* in *width* is a decimal integer defining the
>> minimum field width. If *width* is not specified, there is no minimum
>> field width, and the field width is determined based on the content of the
>> field.
>>
>> Is field width measured in code units, code points, or something else?
>>
>> Consider the following example assuming a UTF-8 locale:
>>
>> std::format("{}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL
>> LETTER A WITH ACUTE }
>> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
>> LETTER A } { COMBINING ACUTE ACCENT }
>>
>> In both cases, the arguments encode the same user-perceived character
>> (Á). The first uses two UTF-8 code units to encode a single code point
>> that represents a single glyph using a composed Unicode normalization
>> form. The second uses three code units to encode two code points that
>> represent the same glyph using a decomposed Unicode normalization form.
>>
>> How is the field width determined? If measured in code units, the first
>> has a width of 2 and the second of 3. If measured in code points, the
>> first has a width of 1 and the second of 2. If measured in grapheme
>> clusters, both have a width of 1. Is the determination locale dependent?
>>
>> *Proposed resolution:*
>>
>> Field widths are measured in code units and are not locale dependent.
>> Modify [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>> as follows:
>>
>> The *positive-integer* in *width* is a decimal integer defining the
>> minimum field width. If *width* is not specified, there is no minimum
>> field width, and the field width is determined based on the content of the
>> field. *Field width is measured in code units. Each byte of a
>> multibyte character contributes to the field width.*
>>
>> (*code unit* is not formally defined in the standard. Most uses occur
>> in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
>> <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic
>> context.)
>>
>> Tom.
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>>
>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13446.php
>

Received on 2019-09-08 08:08:38