Date: Sun, 12 Dec 2021 06:55:17 -0800
Dear Unicoders,
The "discovery" of space-like double width characters is interesting but I
don't think it changes anything. We already knew that there are many
characters with width > 1 that could potentially be used as separators.
However, being used as a separator doesn't make it automatically eligible
for being a fill by any stretch of imagination. The design std::format and
other facilities it is based on doesn't meaningfully work with such
characters. To understand why let's look at types supported by std::format
and observe that pretty much all the arguments do not have width which is a
multiple of 2:
bool: is printed as true and false with the width of the latter not being a
multiple of 2; only localized format can potentially be meaningful
numeric types, pointers: using arabic numerals, cannot be
meaningfully restricted to multiples of 2 even in localized format
character types: char cannot be meaningfully restricted to multiples of 2
string types: can potentially be restricted
chrono types: cannot be meaningfully restricted to multiples of 2 except
for potentially some locales (although the response from the actual users
was negative!)
As we can see pretty much the only case where a double width fill makes
sense is when the arguments are restricted to a small subset of inputs in a
localized environment. Although we could invent some handling of the mix of
double and single width inputs, as Tom's analysis clearly showed none of
the solutions is satisfactory (they are basically hacks).
All of this suggests that this functionality doesn't belong to std::format
but at most some localization facility which works exclusively with
double-width inputs. It could potentially use std::format or formatter
specializations as part of implementation though.
The fact that it would introduce a nonzero penalty for the common case
violating the "don't pay for what you don't use" is also worrying. Fill,
width and alignment have to be supported by all formatters and therefore
even seemingly small changes can have significant costs. It's particularly
worrying to pay for something which is clearly broken for most inputs.
Therefore I continue to be strongly opposed to introducing this novel
design by committee to std::format to the extent that I'm actually willing
to write a paper arguing about not doing this even though it's a huge waste
of time that distracts us from doing things that actually matter.
Cheers,
Victor
On Thu, Dec 2, 2021 at 8:30 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:
> Hello,
>
> - At yesterday's telecon, I think I heard 2 arguments
>
> - Double width codepoints are useful for some use cases and cultures
> and should be supported
> - Double width codepoints cannot be aligned properly, and I do not
> care about them, they should be ill-formed.
>
> I noted that, while we can, in all cases, specify a predictable and stable
> sequence of codepoints in the output, we cannot and do not guarantee a
> visual alignment.
> Indeed, we should consider that
>
> - Character width isn't specified by unicode and our definition of
> them can be summarized as: full width characters are double width,
> everything else is 1.
> - Under that definition, 𒈙 Cuneiform Sign Lugal Opposing Lugal (which
> occupies no less than 6 column in my IDE) counts as one.
> - ï·½ Arabic Ligature Bismillah Ar-Rahman Ar-Raheem counts as one width,
> despite being long enough to align perfectly with "static_cast"
> - ☹ White Frowning Face Emoji counts as one in the standard, but 🙂
> Slightly Smiling Face counts as 2, and so, if we standardize only one fill
> character, we would only standardize sadness.
> - Indeed, in the absence of a proper unicode property for
> character-width, we have no choice but to resort to imprecise ranges of
> codepoints.
> - Alignment is entirely dependent on the existence of a suitable
> monospace glyph, which isn't and cannot be guaranteed.
> - Parameters longer than the alignment specifiers are never aligned
> - We do not align with ZWJ, ZWSP, BELL, or Set Transmit State
> ill-formed and I'd argue that if we allow these, we have no ground to make
> the potted plant emoji ill-formed.
>
> And so, the status quo is that we make some effort to align, and it might
> be off by in some cases, even by quite a lot.
> And this is fine. Nowhere in the standard do we promise a pixel perfect
> alignment on a 8K HDR screen, and I think Donald Knuth would not hold the
> C++ standard responsible for not doing proper typesetting in terminals.
>
> In that sense, supporting estimated width greater than 1 would not be
> a deviation from the status quo. Maybe we can add a note that
> alignment only works in some cases.
>
> So for me the arguments : "It should be ill-formed because it cannot
> align perfectly" or "it should be ill-formed because I don't see a use
> case" fall flat.
> We cannot reach perfection, but that shouldn't stop us from settling for
> good enough.
>
> And Peter Brett's motivation of scenarios in which every codepoint is
> double width is very, very strong.
>
> I'm sympathetic to Zach's argument that the sequence of codepoints should
> be predictable for testing purposes - so it should be fully specified, with
> the caveat that we should not guarantee stability of the estimated width.
>
> I'm also sympathetic with Victor's argument that we should provide a
> resolution for which there exists an implementation and no performance
> concern.
>
> I think a good path forward would be to
>
> - Gain implementation experience with option 1, as drafted by Tom.
> - Draft a note that explains very clearly that alignment is not
> guaranteed if the formatted output contains characters represented by the
> output device as anything but 1 column wide monospace glyphs.
> - Adopt the revised resolution proposed by Tom, modulo wording tweaks.
>
>
> Corentin
>
>
>
>
>
>
>
>
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
The "discovery" of space-like double width characters is interesting but I
don't think it changes anything. We already knew that there are many
characters with width > 1 that could potentially be used as separators.
However, being used as a separator doesn't make it automatically eligible
for being a fill by any stretch of imagination. The design std::format and
other facilities it is based on doesn't meaningfully work with such
characters. To understand why let's look at types supported by std::format
and observe that pretty much all the arguments do not have width which is a
multiple of 2:
bool: is printed as true and false with the width of the latter not being a
multiple of 2; only localized format can potentially be meaningful
numeric types, pointers: using arabic numerals, cannot be
meaningfully restricted to multiples of 2 even in localized format
character types: char cannot be meaningfully restricted to multiples of 2
string types: can potentially be restricted
chrono types: cannot be meaningfully restricted to multiples of 2 except
for potentially some locales (although the response from the actual users
was negative!)
As we can see pretty much the only case where a double width fill makes
sense is when the arguments are restricted to a small subset of inputs in a
localized environment. Although we could invent some handling of the mix of
double and single width inputs, as Tom's analysis clearly showed none of
the solutions is satisfactory (they are basically hacks).
All of this suggests that this functionality doesn't belong to std::format
but at most some localization facility which works exclusively with
double-width inputs. It could potentially use std::format or formatter
specializations as part of implementation though.
The fact that it would introduce a nonzero penalty for the common case
violating the "don't pay for what you don't use" is also worrying. Fill,
width and alignment have to be supported by all formatters and therefore
even seemingly small changes can have significant costs. It's particularly
worrying to pay for something which is clearly broken for most inputs.
Therefore I continue to be strongly opposed to introducing this novel
design by committee to std::format to the extent that I'm actually willing
to write a paper arguing about not doing this even though it's a huge waste
of time that distracts us from doing things that actually matter.
Cheers,
Victor
On Thu, Dec 2, 2021 at 8:30 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:
> Hello,
>
> - At yesterday's telecon, I think I heard 2 arguments
>
> - Double width codepoints are useful for some use cases and cultures
> and should be supported
> - Double width codepoints cannot be aligned properly, and I do not
> care about them, they should be ill-formed.
>
> I noted that, while we can, in all cases, specify a predictable and stable
> sequence of codepoints in the output, we cannot and do not guarantee a
> visual alignment.
> Indeed, we should consider that
>
> - Character width isn't specified by unicode and our definition of
> them can be summarized as: full width characters are double width,
> everything else is 1.
> - Under that definition, 𒈙 Cuneiform Sign Lugal Opposing Lugal (which
> occupies no less than 6 column in my IDE) counts as one.
> - ï·½ Arabic Ligature Bismillah Ar-Rahman Ar-Raheem counts as one width,
> despite being long enough to align perfectly with "static_cast"
> - ☹ White Frowning Face Emoji counts as one in the standard, but 🙂
> Slightly Smiling Face counts as 2, and so, if we standardize only one fill
> character, we would only standardize sadness.
> - Indeed, in the absence of a proper unicode property for
> character-width, we have no choice but to resort to imprecise ranges of
> codepoints.
> - Alignment is entirely dependent on the existence of a suitable
> monospace glyph, which isn't and cannot be guaranteed.
> - Parameters longer than the alignment specifiers are never aligned
> - We do not align with ZWJ, ZWSP, BELL, or Set Transmit State
> ill-formed and I'd argue that if we allow these, we have no ground to make
> the potted plant emoji ill-formed.
>
> And so, the status quo is that we make some effort to align, and it might
> be off by in some cases, even by quite a lot.
> And this is fine. Nowhere in the standard do we promise a pixel perfect
> alignment on a 8K HDR screen, and I think Donald Knuth would not hold the
> C++ standard responsible for not doing proper typesetting in terminals.
>
> In that sense, supporting estimated width greater than 1 would not be
> a deviation from the status quo. Maybe we can add a note that
> alignment only works in some cases.
>
> So for me the arguments : "It should be ill-formed because it cannot
> align perfectly" or "it should be ill-formed because I don't see a use
> case" fall flat.
> We cannot reach perfection, but that shouldn't stop us from settling for
> good enough.
>
> And Peter Brett's motivation of scenarios in which every codepoint is
> double width is very, very strong.
>
> I'm sympathetic to Zach's argument that the sequence of codepoints should
> be predictable for testing purposes - so it should be fully specified, with
> the caveat that we should not guarantee stability of the estimated width.
>
> I'm also sympathetic with Victor's argument that we should provide a
> resolution for which there exists an implementation and no performance
> concern.
>
> I think a good path forward would be to
>
> - Gain implementation experience with option 1, as drafted by Tom.
> - Draft a note that explains very clearly that alignment is not
> guaranteed if the formatted output contains characters represented by the
> output device as anything but 1 column wide monospace glyphs.
> - Adopt the revised resolution proposed by Tom, modulo wording tweaks.
>
>
> Corentin
>
>
>
>
>
>
>
>
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2021-12-12 08:55:33