Date: Sat, 7 Sep 2019 20:39:05 -0700
> if code units aren't used, then behavior should be different for LANG=C
vs LANG=C.UTF-8.
In that case I agree with your proposed resolution of using code units
because all of std::format is locale-independent by default by design and
it would be very unfortunate to break this property and make the output
depend on the global locale (or the passed locale for some overloads).
In the future Unicode overloads we'll (hopefully) be able to do better and
use grapheme clusters or at least code points because the encoding is
static. Also we could introduce a separate format specifier for
locale-specific formatting of string arguments similarly to the ones we
already have for numbers and only query the locale if explicitly requested
by the user.
- Victor
On Sat, Sep 7, 2019 at 8:30 PM Tom Honermann <tom_at_[hidden]> wrote:
> On 9/7/19 10:44 PM, Victor Zverovich wrote:
>
> > Is field width measured in code units, code points, or something else?
>
> I think the main consideration here is that width should be
> locale-independent by default for consistency with the rest of
> std::format's design.
>
> I agree with that goal, but...
>
> If we can say that width is measured in grapheme clusters or code points
> based on the execution encoding (or whatever the standardese term) without
> querying the locale then I suggest doing so.
>
> I don't know how to do that. From my response to Zach, if code units
> aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.
>
> I have slight preference for grapheme clusters since those correspond to
> user-perceived characters, but only have implementation experience with
> code points (this is what both the fmt library and Python do).
>
> I would definitely vote for EGCs over code points. I think code points
> are probably the worst of the options since it makes the results dependent
> on Unicode normalization form.
>
> Tom.
>
>
> Cheers,
> Victor
>
> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <lib_at_[hidden]>
> wrote:
>
>> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>> states:
>>
>> The *positive-integer* in *width* is a decimal integer defining the
>> minimum field width. If *width* is not specified, there is no minimum
>> field width, and the field width is determined based on the content of the
>> field.
>>
>> Is field width measured in code units, code points, or something else?
>>
>> Consider the following example assuming a UTF-8 locale:
>>
>> std::format("{}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL
>> LETTER A WITH ACUTE }
>> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
>> LETTER A } { COMBINING ACUTE ACCENT }
>>
>> In both cases, the arguments encode the same user-perceived character
>> (Á). The first uses two UTF-8 code units to encode a single code point
>> that represents a single glyph using a composed Unicode normalization
>> form. The second uses three code units to encode two code points that
>> represent the same glyph using a decomposed Unicode normalization form.
>>
>> How is the field width determined? If measured in code units, the first
>> has a width of 2 and the second of 3. If measured in code points, the
>> first has a width of 1 and the second of 2. If measured in grapheme
>> clusters, both have a width of 1. Is the determination locale dependent?
>>
>> *Proposed resolution:*
>>
>> Field widths are measured in code units and are not locale dependent.
>> Modify [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>> as follows:
>>
>> The *positive-integer* in *width* is a decimal integer defining the
>> minimum field width. If *width* is not specified, there is no minimum
>> field width, and the field width is determined based on the content of the
>> field. *Field width is measured in code units. Each byte of a
>> multibyte character contributes to the field width.*
>>
>> (*code unit* is not formally defined in the standard. Most uses occur
>> in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
>> <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic
>> context.)
>>
>> Tom.
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>>
>
>
vs LANG=C.UTF-8.
In that case I agree with your proposed resolution of using code units
because all of std::format is locale-independent by default by design and
it would be very unfortunate to break this property and make the output
depend on the global locale (or the passed locale for some overloads).
In the future Unicode overloads we'll (hopefully) be able to do better and
use grapheme clusters or at least code points because the encoding is
static. Also we could introduce a separate format specifier for
locale-specific formatting of string arguments similarly to the ones we
already have for numbers and only query the locale if explicitly requested
by the user.
- Victor
On Sat, Sep 7, 2019 at 8:30 PM Tom Honermann <tom_at_[hidden]> wrote:
> On 9/7/19 10:44 PM, Victor Zverovich wrote:
>
> > Is field width measured in code units, code points, or something else?
>
> I think the main consideration here is that width should be
> locale-independent by default for consistency with the rest of
> std::format's design.
>
> I agree with that goal, but...
>
> If we can say that width is measured in grapheme clusters or code points
> based on the execution encoding (or whatever the standardese term) without
> querying the locale then I suggest doing so.
>
> I don't know how to do that. From my response to Zach, if code units
> aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.
>
> I have slight preference for grapheme clusters since those correspond to
> user-perceived characters, but only have implementation experience with
> code points (this is what both the fmt library and Python do).
>
> I would definitely vote for EGCs over code points. I think code points
> are probably the worst of the options since it makes the results dependent
> on Unicode normalization form.
>
> Tom.
>
>
> Cheers,
> Victor
>
> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <lib_at_[hidden]>
> wrote:
>
>> [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>> states:
>>
>> The *positive-integer* in *width* is a decimal integer defining the
>> minimum field width. If *width* is not specified, there is no minimum
>> field width, and the field width is determined based on the content of the
>> field.
>>
>> Is field width measured in code units, code points, or something else?
>>
>> Consider the following example assuming a UTF-8 locale:
>>
>> std::format("{}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL
>> LETTER A WITH ACUTE }
>> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
>> LETTER A } { COMBINING ACUTE ACCENT }
>>
>> In both cases, the arguments encode the same user-perceived character
>> (Á). The first uses two UTF-8 code units to encode a single code point
>> that represents a single glyph using a composed Unicode normalization
>> form. The second uses three code units to encode two code points that
>> represent the same glyph using a decomposed Unicode normalization form.
>>
>> How is the field width determined? If measured in code units, the first
>> has a width of 2 and the second of 3. If measured in code points, the
>> first has a width of 1 and the second of 2. If measured in grapheme
>> clusters, both have a width of 1. Is the determination locale dependent?
>>
>> *Proposed resolution:*
>>
>> Field widths are measured in code units and are not locale dependent.
>> Modify [format.string.std]p7 <http://eel.is/c++draft/format#string.std-7>
>> as follows:
>>
>> The *positive-integer* in *width* is a decimal integer defining the
>> minimum field width. If *width* is not specified, there is no minimum
>> field width, and the field width is determined based on the content of the
>> field. *Field width is measured in code units. Each byte of a
>> multibyte character contributes to the field width.*
>>
>> (*code unit* is not formally defined in the standard. Most uses occur
>> in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
>> <http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic
>> context.)
>>
>> Tom.
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>>
>
>
Received on 2019-09-08 05:39:18