Dear Unicoders,

The "discovery" of space-like double width characters is interesting but I don't think it changes anything. We already knew that there are many characters with width > 1 that could potentially be used as separators. However, being used as a separator doesn't make it automatically eligible for being a fill by any stretch of imagination. The design std::format and other facilities it is based on doesn't meaningfully work with such characters. To understand why let's look at types supported by std::format and observe that pretty much all the arguments do not have width which is a multiple of 2:

bool: is printed as true and false with the width of the latter not being a multiple of 2; only localized format can potentially be meaningful
numeric types, pointers: using arabic numerals, cannot be meaningfully restricted to multiples of 2 even in localized format
character types: char cannot be meaningfully restricted to multiples of 2
string types: can potentially be restricted
chrono types: cannot be meaningfully restricted to multiples of 2 except for potentially some locales (although the response from the actual users was negative!)

As we can see pretty much the only case where a double width fill makes sense is when the arguments are restricted to a small subset of inputs in a localized environment. Although we could invent some handling of the mix of double and single width inputs, as Tom's analysis clearly showed none of the solutions is satisfactory (they are basically hacks).

All of this suggests that this functionality doesn't belong to std::format but at most some localization facility which works exclusively with double-width inputs. It could potentially use std::format or formatter specializations as part of implementation though.

The fact that it would introduce a nonzero penalty for the common case violating the "don't pay for what you don't use" is also worrying. Fill, width and alignment have to be supported by all formatters and therefore even seemingly small changes can have significant costs. It's particularly worrying to pay for something which is clearly broken for most inputs.

Therefore I continue to be strongly opposed to introducing this novel design by committee to std::format to the extent that I'm actually willing to write a paper arguing about not doing this even though it's a huge waste of time that distracts us from doing things that actually matter.


On Thu, Dec 2, 2021 at 8:30 AM Corentin via SG16 <> wrote:

- At yesterday's telecon, I think I heard 2 arguments
  • Double width codepoints are useful for some use cases and cultures and should be supported
  • Double width codepoints cannot be aligned properly, and I do not care about them, they should be ill-formed.
I noted that, while we can, in all cases, specify a predictable and stable sequence of codepoints in the output, we cannot and do not guarantee a visual alignment.
Indeed, we should consider that
  • Character width isn't specified by unicode and our definition of them can be summarized as: full width characters are double width, everything else is 1.
  • Under that definition, 𒈙 Cuneiform Sign Lugal Opposing Lugal (which occupies no less than 6 column in my IDE) counts as one.
  • ﷽ Arabic Ligature Bismillah Ar-Rahman Ar-Raheem counts as one width, despite being long enough to align perfectly with "static_cast"
  • ☹ White Frowning Face Emoji counts as one in the standard, but 🙂 Slightly Smiling Face counts as 2, and so, if we standardize only one fill character, we would only standardize sadness.
  • Indeed, in the absence of a proper unicode property for character-width, we have no choice but to resort to imprecise ranges of codepoints.
  • Alignment is entirely dependent on the existence of a suitable monospace glyph, which isn't and cannot be guaranteed.
  • Parameters longer than the alignment specifiers are never aligned
  • We do not align with ZWJ, ZWSP, BELL, or Set Transmit State ill-formed and I'd argue that if we allow these, we have no ground to make the potted plant emoji ill-formed.
And so, the status quo is that we make some effort to align, and it might be off by in some cases, even by quite a lot.
And this is fine. Nowhere in the standard do we promise a pixel perfect alignment on a 8K HDR screen, and I think Donald Knuth would not hold the C++ standard responsible for not doing proper typesetting in terminals.

In that sense, supporting estimated width greater than 1 would not be a deviation from the status quo. Maybe we can add a note that alignment only works in some cases.

So for me the arguments :  "It should be ill-formed because it cannot align perfectly" or "it should be ill-formed because I don't see a use case" fall flat.
We cannot reach perfection, but that shouldn't stop us from settling for good enough.

And Peter Brett's motivation of scenarios in which every codepoint is double width is very, very strong.

I'm sympathetic to Zach's argument that the sequence of codepoints should be predictable for testing purposes - so it should be fully specified, with the caveat that we should not guarantee stability of the estimated width.

I'm also sympathetic with Victor's argument that we should provide a resolution for which there exists an implementation and no performance concern.

I think a good path forward would be to
  • Gain implementation experience with option 1, as drafted by Tom.
  • Draft a note that explains very clearly that alignment is not guaranteed if the formatted output contains characters represented by the output device as anything but 1 column wide monospace glyphs.
  • Adopt the revised resolution proposed by Tom, modulo wording tweaks.


SG16 mailing list