I've added this to my list of future
agenda items.
I think wording may be missing
regarding when the fill "character" is actually used. The only
relevant wording I can find is
[format.string.std]
table 59, and there only for the case where "^" is used to
center text in a field. Actually, that wording doesn't
technically state what "characters" are to be used to perform the
fill; it only states how many are to be inserted.
[format.string.std]p3
contains an example that demonstrates the intent, but I don't see
any other normative wording.
I think we can avoid specifying that a
fill character must be exactly one grapheme cluster or that it
must have an estimated width of 1 by borrowing some of the wording
from
[format.string.std]p14.
We can state something like, "For a string in a Unicode encoding,
the sequence of extended grapheme clusters in the fill character
are repeatedly inserted one whole extended grapheme cluster at a
time until the field width would be exceeded. [
Note: if
the fill character contains an extended grapheme cluster with an
estimated width greater than 1, then the field may not be padded
to the full width -
end note ]." This would allow format
specifiers like "
{~-^10}" that,
given an argument of "
text" would
format the field as "
~-~text~-~".
Basically, I guess I'm arguing against trying to restrict what is
a valid fill character; we can avoid the need for diagnostics by
doing so.
Another tangent:
[format.string.std]p11
may have a wording issue. It states, "... implementations should
estimate the width of a string as the sum of estimated widths of
the
first code points in its extended grapheme
clusters...". As is, that seems ambiguous as to whether the sum
is based on the first code point of each EGC or whether each EGC
contributes its first code points (some undefined sequence of 0 or
more initial code points) towards the sum. I think the intent
would be better expressed with "... implementations should
estimate the width of a string as the sum of estimated widths of
the first code point
s
in
itseach extended grapheme
cluster
s..."
Tom.
On 8/9/21 4:04 PM, Steve Downey via
SG16 wrote:
A specific case of the general confusion between
"character" and `char`. It's broken for any multi-byte encoding,
not just Unicode.
However, I suspect that grapheme cluster might be a rabbit hole.
Checking whether a sequence is, is difficult, and IIRC might
change?
Hello,
I wanted to bring this new LWG issue to your attention.
The author asks whether the fill character of
std::format is
- a code unit
- a code point
- a grapheme cluster
This might be an abi breaking thing, and
implementation disagrees already apparently.
My gut feeling is that it needs to at least be a
codepoint.
I do not know if there are any concerns with allowing a
grapheme in terms of implementation or performance. There
is definitively some motivation, especially for non-nfc
format strings.
This sort of issue illustrates my point that using the
term character in the standard can be problematic!
Thanks,
Have a great week,
Corentin
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16