sg16: Re: [SG16] More Ruminations about fill characters and alignement (LWG3639)

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sat, 18 Dec 2021 09:07:11 -0800

There will be nontrivial overhead added to width/fill/alignment handling
(there are some binary size/perf trade offs but none of this is free). Not
to penalize the common case we'll probably have to split double width
handling and also deal with edge cases like non-even space to fill. Even
without this fill/width/alignment is already one of the biggest
contributors to formatter bloat for built-in and string types. With ranges
there will be an explosion in the number of formatter specializations and
we should be very careful with adding more complexity in this space. And
with ranges like with most other types this functionality will be useless
(actually somewhat harmful if you consider how easy it is to misuse and get
something that seems to work but doesn't) because range output cannot be
meaningfully restricted to double width (even if range elements can!)

Cheers,
Victor

On Sun, Dec 12, 2021 at 9:54 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Sun, Dec 12, 2021 at 3:55 PM Victor Zverovich <
> victor.zverovich_at_[hidden]> wrote:
>
>> Dear Unicoders,
>>
>> The "discovery" of space-like double width characters is interesting but
>> I don't think it changes anything. We already knew that there are many
>> characters with width > 1 that could potentially be used as separators.
>> However, being used as a separator doesn't make it automatically eligible
>> for being a fill by any stretch of imagination. The design std::format and
>> other facilities it is based on doesn't meaningfully work with such
>> characters. To understand why let's look at types supported by std::format
>> and observe that pretty much all the arguments do not have width which is a
>> multiple of 2:
>>
>> bool: is printed as true and false with the width of the latter not being
>> a multiple of 2; only localized format can potentially be meaningful
>> numeric types, pointers: using arabic numerals, cannot be
>> meaningfully restricted to multiples of 2 even in localized format
>> character types: char cannot be meaningfully restricted to multiples of 2
>> string types: can potentially be restricted
>> chrono types: cannot be meaningfully restricted to multiples of 2 except
>> for potentially some locales (although the response from the actual users
>> was negative!)
>>
>> As we can see pretty much the only case where a double width fill makes
>> sense is when the arguments are restricted to a small subset of inputs in a
>> localized environment. Although we could invent some handling of the mix of
>> double and single width inputs, as Tom's analysis clearly showed none of
>> the solutions is satisfactory (they are basically hacks).
>>
>> All of this suggests that this functionality doesn't belong to
>> std::format but at most some localization facility which works exclusively
>> with double-width inputs. It could potentially use std::format or formatter
>> specializations as part of implementation though.
>>
>> The fact that it would introduce a nonzero penalty for the common case
>> violating the "don't pay for what you don't use" is also worrying. Fill,
>> width and alignment have to be supported by all formatters and therefore
>> even seemingly small changes can have significant costs. It's particularly
>> worrying to pay for something which is clearly broken for most inputs.
>>
>> Therefore I continue to be strongly opposed to introducing this novel
>> design by committee to std::format to the extent that I'm actually willing
>> to write a paper arguing about not doing this even though it's a huge waste
>> of time that distracts us from doing things that actually matter.
>>
>
> I really do not think you need to write a paper (and I would hate to force
> you to do more work).
> However, can you explain in a few words what the implementation
> challenges/cost are?
> Storing the width of the codepoint or something else?
>
> To be clear, we agree that the scenarios in which is is useful would be
> limited, and we are not asking for heroics (underfilling in these cases is
> perfectly acceptable)
>
> Thanks a lot,
>
> Corentin
>
>
>>
>> Cheers,
>> Victor
>>
>> On Thu, Dec 2, 2021 at 8:30 AM Corentin via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>>> Hello,
>>>
>>> - At yesterday's telecon, I think I heard 2 arguments
>>>
>>> - Double width codepoints are useful for some use cases and cultures
>>> and should be supported
>>> - Double width codepoints cannot be aligned properly, and I do not
>>> care about them, they should be ill-formed.
>>>
>>> I noted that, while we can, in all cases, specify a predictable and
>>> stable sequence of codepoints in the output, we cannot and do not
>>> guarantee a visual alignment.
>>> Indeed, we should consider that
>>>
>>> - Character width isn't specified by unicode and our definition of
>>> them can be summarized as: full width characters are double width,
>>> everything else is 1.
>>> - Under that definition, 𒈙 Cuneiform Sign Lugal Opposing Lugal
>>> (which occupies no less than 6 column in my IDE) counts as one.
>>> - ﷽ Arabic Ligature Bismillah Ar-Rahman Ar-Raheem counts as one
>>> width, despite being long enough to align perfectly with "static_cast"
>>> - ☹ White Frowning Face Emoji counts as one in the standard, but 🙂
>>> Slightly Smiling Face counts as 2, and so, if we standardize only one fill
>>> character, we would only standardize sadness.
>>> - Indeed, in the absence of a proper unicode property for
>>> character-width, we have no choice but to resort to imprecise ranges of
>>> codepoints.
>>> - Alignment is entirely dependent on the existence of a suitable
>>> monospace glyph, which isn't and cannot be guaranteed.
>>> - Parameters longer than the alignment specifiers are never aligned
>>> - We do not align with ZWJ, ZWSP, BELL, or Set Transmit State
>>> ill-formed and I'd argue that if we allow these, we have no ground to make
>>> the potted plant emoji ill-formed.
>>>
>>> And so, the status quo is that we make some effort to align, and it
>>> might be off by in some cases, even by quite a lot.
>>> And this is fine. Nowhere in the standard do we promise a pixel perfect
>>> alignment on a 8K HDR screen, and I think Donald Knuth would not hold the
>>> C++ standard responsible for not doing proper typesetting in terminals.
>>>
>>> In that sense, supporting estimated width greater than 1 would not be
>>> a deviation from the status quo. Maybe we can add a note that
>>> alignment only works in some cases.
>>>
>>> So for me the arguments : "It should be ill-formed because it cannot
>>> align perfectly" or "it should be ill-formed because I don't see a use
>>> case" fall flat.
>>> We cannot reach perfection, but that shouldn't stop us from settling for
>>> good enough.
>>>
>>> And Peter Brett's motivation of scenarios in which every codepoint is
>>> double width is very, very strong.
>>>
>>> I'm sympathetic to Zach's argument that the sequence of codepoints
>>> should be predictable for testing purposes - so it should be fully
>>> specified, with the caveat that we should not guarantee stability of the
>>> estimated width.
>>>
>>> I'm also sympathetic with Victor's argument that we should provide a
>>> resolution for which there exists an implementation and no performance
>>> concern.
>>>
>>> I think a good path forward would be to
>>>
>>> - Gain implementation experience with option 1, as drafted by Tom.
>>> - Draft a note that explains very clearly that alignment is not
>>> guaranteed if the formatted output contains characters represented by the
>>> output device as anything but 1 column wide monospace glyphs.
>>> - Adopt the revised resolution proposed by Tom, modulo wording
>>> tweaks.
>>>
>>>
>>> Corentin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>

Received on 2021-12-18 11:07:24