Hi Tom,

What a very helpful discussion – thank you!

I wouldn’t object to wording that gives implementers the freedom to restore alignment by overfilling/underfilling at other points in the formatted output. My view: definitely do not mandate it, but perhaps avoid explicitly banning it.

I would also be interested in wording that permits width estimation in non-Unicode but mandates it for Unicode. I feel there have been too many things recently that carve out special cases for UTF-8 when we could have given a more general permission with only a small amount of extra work.

Best regards,

Peter

From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Tom Honermann via SG16
Sent: 01 December 2021 05:03
To: SG16 <sg16@lists.isocpp.org>
Cc: Tom Honermann <tom@honermann.net>
Subject: [SG16] Proposed resolution for LWG3639: Handling of fill character width is underspecified in std::format

EXTERNAL MAIL

The following is in preparation for the SG16 telecon scheduled for tomorrow.

LWG3639 tracks how implementations should handle fill characters that have an estimated width other than 1 (the current proposed resolution for LWG3576 limits fill characters to those encodeable as a single code point in the ordinary literal encoding). The issue discussion records three possible ways to resolve the issue:

s == "🤡🤡🤡🤡42": use the estimated display width, correctly displayed on compatible terminals.
s == "🤡🤡🤡🤡🤡🤡🤡🤡42": assume the display width of 1, incorrectly displayed.
Require the fill character to have the estimated width of 1.

Discussion:

Assuming that we elect to allow characters with an estimated width other than 1 to be used as fill characters (e.g., we reject option 3 above), I think the normative guidance offered in [format.string.std]p11 suffices to direct implementations to choose between options 1 and 2; option 1 should be used for Unicode encodings, and option 2 otherwise. This effectively limits the options under consideration to just two.

If characters that have estimated widths other than 1 are not permitted as fill characters, then whether a program is correct or not may depend on what encoding is used for the ordinary literal encoding. For example, Shift-JIS supports encoding katakana as either half-width or full-width (e.g., ｶ U+FF76 {HALFWIDTH KATAKANA LETTER KA} or カ U+30AB {KATAKANA LETTER KA}). Since Shift-JIS is not a Unicode encoding, an implementation may assign all such characters an estimated width of 1, but the full-width variants would have an estimated width of 2 for a Unicode encoding. I have no perspective on whether such full-width characters would ever be desirable as fill characters and thus prefer not to prohibit them without strong cause.

If the estimated width of the fill character is greater than 1, then alignment to the end of the field might not be possible. For example:
std::format("{:🤡>4}", 123);
There are a number of options available to handle such cases:

Underfill the available space leaving the field misaligned.
Overfill the available space leaving the field misaligned.
Substitute the default fill character (that has an estimated width of 1) once the available space is reduced to less than the estimated width of the requested fill character.
Throw an exception.
UB.

There are also two related wording omissions:

Table [tab.format.align] doesn't specify how alignment is achieved for the '<' and '>' options (the wording doesn't state to insert fill characters as it does for the '^' option).
It is not specified what happens when alignment to the end of the field is requested, but the width of the formatted value exceeds the field width thereby making such alignment impossible (it is presumably intended that the available space is overflowed resulting in misalignment; truncation, throwing an exception, or UB are alternate options).
std::format("{:X>1}}, 9999);

In some cases, it would be possible for field alignment to be restored by tracking underfill and overfill counts and underfilling or overfilling later fields. That ... is probably not a good idea.

Proposal:

Allow characters that have an estimated width other than 1 to be used as fill characters.
Underfill the available field space when inserting an additional fill character would otherwise lead to overfill.
Do not substitute a default fill character, throw an exception, or specify UB when alignment is not possible. This is based on the assumption that misalignment is preferred over the other possibilities.
Overflow the available field space when the formatted value exceeds the field width. This is based on the assumption that outputting all available data is preferred over the other possibilities.
Do not try to cleverly count underfill and overfill and adjust later fields.

Proposed Resolution:

The wording below is intended to address LWG3639 and to supersede the current proposed resolution for LWG3576.

Change [format.string.std]p1:

[...]

The syntax of format specifications is as follows:

[...]

fill:

any ~~character~~member of the translation character set ([lex.charset]) other than {U+007B LEFT CURLY BRACKET or }U+007D RIGHT CURLY BRACKET

[...]

Change [format.string.std]p2:

The fill character is the character denoted by the fill specifier or, if the fill specifier is absent, U+0020 SPACE.

[Note 2: ~~The fill character can be any character other than { or }.~~ The presence of a ~~fill character~~fill specifier is signaled by the character following it, which must be one of the alignment options. If the second character of std-format-spec is not a valid alignment option, then it is assumed that ~~both the fill character and the alignment option are~~the fill-and-align specifier is absent. — end note]

Change [format.string.std]p3:

The align specifier applies to all argument types. The meaning of the various alignment options is as specified in Table 62.

[Example 1:
char c = 120;
string s0 = format("{:6}", 42);           // value of s0 is "    42"
string s1 = format("{:6}", 'x');        // value of s1 is "x     "
string s2 = format("{:*<6}", 'x');        // value of s2 is "x*****"
string s3 = format("{:*>6}", 'x');        // value of s3 is "*****x"
string s4 = format("{:*^6}", 'x');        // value of s4 is "**x***"
string s5 = format("{:6d}", c);           // value of s5 is "   120"
string s6 = format("{:6}", true);         // value of s6 is "true "
string s7 = format("{:*6>}", "12345678"); // value of s7 is "12345678"
— end example]

[Note 3: Unless a minimum field width is defined, the field width is determined by the size of the content and the alignment option has no effect. — end note]

[Note 4: If the width of the formatting argument value exceeds the field width, then the alignment option has no effect. — end note]

[Note 5: It may not be possible to exactly align the formatting argument value within the available space when the fill character has an estimated width other than 1. — end note]

Table 62: Meaning of align options [tab:format.align]

Option

Meaning

<

Forces the field to be aligned to the start of the available space by inserting n fill characters after the formatting argument value where n is the number of fill characters needed to most closely align to the field width without exceeding it. This is the default for non-arithmetic types, charT, and bool, unless an integer presentation type is specified.

>

Forces the field to be aligned to the end of the available space by inserting n fill characters before the formatting argument value where n is the number of fill characters needed to most closely align to the field width without exceeding it. This is the default for arithmetic types other than charT and bool or when an integer presentation type is specified.

^

Forces the field to be centered within the available space by inserting ⌊n/2⌋ fill characters before and ⌈n/2⌉ fill characters after the formatting argument value, where n is the total number of fill characters ~~to insert~~needed to most closely align to the field width without exceeding it.

Tom.