Date: Sat, 7 Sep 2019 20:13:12 -0400
[format.string.std]p7 <http://eel.is/c++draft/format#string.std-7> states:
> The /positive-integer/ in /width/ is a decimal integer defining the
> minimum field width. If /width/ is not specified, there is no minimum
> field width, and the field width is determined based on the content of
> the field.
>
Is field width measured in code units, code points, or something else?
Consider the following example assuming a UTF-8 locale:
std::format("{}", "\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A
WITH ACUTE }
std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
LETTER A } { COMBINING ACUTE ACCENT }
In both cases, the arguments encode the same user-perceived character
(Á). The first uses two UTF-8 code units to encode a single code point
that represents a single glyph using a composed Unicode normalization
form. The second uses three code units to encode two code points that
represent the same glyph using a decomposed Unicode normalization form.
How is the field width determined? If measured in code units, the first
has a width of 2 and the second of 3. If measured in code points, the
first has a width of 1 and the second of 2. If measured in grapheme
clusters, both have a width of 1. Is the determination locale dependent?
*Proposed resolution:*
Field widths are measured in code units and are not locale dependent.
Modify [format.string.std]p7
<http://eel.is/c++draft/format#string.std-7> as follows:
> The /positive-integer/ in /width/ is a decimal integer defining the
> minimum field width. If /width/ is not specified, there is no minimum
> field width, and the field width is determined based on the content of
> the field. *Field width is measured in code units. Each byte of a
> multibyte character contributes to the field width.*
>
(/code unit/ is not formally defined in the standard. Most uses occur
in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
<http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic context.)
Tom.
> The /positive-integer/ in /width/ is a decimal integer defining the
> minimum field width. If /width/ is not specified, there is no minimum
> field width, and the field width is determined based on the content of
> the field.
>
Is field width measured in code units, code points, or something else?
Consider the following example assuming a UTF-8 locale:
std::format("{}", "\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A
WITH ACUTE }
std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
LETTER A } { COMBINING ACUTE ACCENT }
In both cases, the arguments encode the same user-perceived character
(Á). The first uses two UTF-8 code units to encode a single code point
that represents a single glyph using a composed Unicode normalization
form. The second uses three code units to encode two code points that
represent the same glyph using a decomposed Unicode normalization form.
How is the field width determined? If measured in code units, the first
has a width of 2 and the second of 3. If measured in code points, the
first has a width of 1 and the second of 2. If measured in grapheme
clusters, both have a width of 1. Is the determination locale dependent?
*Proposed resolution:*
Field widths are measured in code units and are not locale dependent.
Modify [format.string.std]p7
<http://eel.is/c++draft/format#string.std-7> as follows:
> The /positive-integer/ in /width/ is a decimal integer defining the
> minimum field width. If /width/ is not specified, there is no minimum
> field width, and the field width is determined based on the content of
> the field. *Field width is measured in code units. Each byte of a
> multibyte character contributes to the field width.*
>
(/code unit/ is not formally defined in the standard. Most uses occur
in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
<http://eel.is/c++draft/lex.ext#5> uses it in an encoding agnostic context.)
Tom.
Received on 2019-09-08 02:13:15