sg16: [SG16] More Ruminations about fill characters and alignement (LWG3639)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 2 Dec 2021 17:29:42 +0100

Hello,

- At yesterday's telecon, I think I heard 2 arguments

   - Double width codepoints are useful for some use cases and cultures and
   should be supported
   - Double width codepoints cannot be aligned properly, and I do not care
   about them, they should be ill-formed.

I noted that, while we can, in all cases, specify a predictable and stable
sequence of codepoints in the output, we cannot and do not guarantee a
visual alignment.
Indeed, we should consider that

   - Character width isn't specified by unicode and our definition of them
   can be summarized as: full width characters are double width, everything
   else is 1.
   - Under that definition, 𒈙 Cuneiform Sign Lugal Opposing Lugal (which
   occupies no less than 6 column in my IDE) counts as one.
   - ﷽ Arabic Ligature Bismillah Ar-Rahman Ar-Raheem counts as one width,
   despite being long enough to align perfectly with "static_cast"
   - ☹ White Frowning Face Emoji counts as one in the standard, but 🙂
   Slightly Smiling Face counts as 2, and so, if we standardize only one fill
   character, we would only standardize sadness.
   - Indeed, in the absence of a proper unicode property for
   character-width, we have no choice but to resort to imprecise ranges of
   codepoints.
   - Alignment is entirely dependent on the existence of a suitable
   monospace glyph, which isn't and cannot be guaranteed.
   - Parameters longer than the alignment specifiers are never aligned
   - We do not align with ZWJ, ZWSP, BELL, or Set Transmit State ill-formed
   and I'd argue that if we allow these, we have no ground to make the potted
   plant emoji ill-formed.

And so, the status quo is that we make some effort to align, and it might
be off by in some cases, even by quite a lot.
And this is fine. Nowhere in the standard do we promise a pixel perfect
alignment on a 8K HDR screen, and I think Donald Knuth would not hold the
C++ standard responsible for not doing proper typesetting in terminals.

In that sense, supporting estimated width greater than 1 would not be
a deviation from the status quo. Maybe we can add a note that
alignment only works in some cases.

So for me the arguments : "It should be ill-formed because it cannot align
perfectly" or "it should be ill-formed because I don't see a use case" fall
flat.
We cannot reach perfection, but that shouldn't stop us from settling for
good enough.

And Peter Brett's motivation of scenarios in which every codepoint is
double width is very, very strong.

I'm sympathetic to Zach's argument that the sequence of codepoints should
be predictable for testing purposes - so it should be fully specified, with
the caveat that we should not guarantee stability of the estimated width.

I'm also sympathetic with Victor's argument that we should provide a
resolution for which there exists an implementation and no performance
concern.

I think a good path forward would be to

   - Gain implementation experience with option 1, as drafted by Tom.
   - Draft a note that explains very clearly that alignment is not
   guaranteed if the formatted output contains characters represented by the
   output device as anything but 1 column wide monospace glyphs.
   - Adopt the revised resolution proposed by Tom, modulo wording tweaks.

Corentin

Received on 2021-12-02 10:29:55