Date: Tue, 10 Aug 2021 00:04:29 -0400
I've added this to my list of future agenda items.
I think wording may be missing regarding when the fill "character" is
actually used. The only relevant wording I can find is
[format.string.std] table 59 <http://eel.is/c++draft/tab:format.align>,
and there only for the case where "^" is used to center text in a
field. Actually, that wording doesn't technically state what
"characters" are to be used to perform the fill; it only states how many
are to be inserted. [format.string.std]p3
<http://eel.is/c++draft/format#string.std-3> contains an example that
demonstrates the intent, but I don't see any other normative wording.
I think we can avoid specifying that a fill character must be exactly
one grapheme cluster or that it must have an estimated width of 1 by
borrowing some of the wording from [format.string.std]p14
<http://eel.is/c++draft/format#string.std-14>. We can state something
like, "For a string in a Unicode encoding, the sequence of extended
grapheme clusters in the fill character are repeatedly inserted one
whole extended grapheme cluster at a time until the field width would be
exceeded. [ /Note/: if the fill character contains an extended grapheme
cluster with an estimated width greater than 1, then the field may not
be padded to the full width - /end note/ ]." This would allow format
specifiers like "{~-^10}" that, given an argument of "text" would format
the field as "~-~text~-~". Basically, I guess I'm arguing against trying
to restrict what is a valid fill character; we can avoid the need for
diagnostics by doing so.
Another tangent: [format.string.std]p11
<http://eel.is/c++draft/format#string.std-11> may have a wording issue.
It states, "... implementations should estimate the width of a string as
the sum of estimated widths of the *first* *code points* in its extended
grapheme clusters...". As is, that seems ambiguous as to whether the
sum is based on the first code point of each EGC or whether each EGC
contributes its first code points (some undefined sequence of 0 or more
initial code points) towards the sum. I think the intent would be
better expressed with "... implementations should estimate the width of
a string as the sum of estimated widths of the first code point*s* in
*its**_each_* extended grapheme cluster*s*..."
Tom.
On 8/9/21 4:04 PM, Steve Downey via SG16 wrote:
> A specific case of the general confusion between "character" and
> `char`. It's broken for any multi-byte encoding, not just Unicode.
>
> However, I suspect that grapheme cluster might be a rabbit hole.
> Checking whether a sequence is, is difficult, and IIRC might change?
>
> On Mon, Aug 9, 2021 at 11:30 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Hello,
>
> I wanted to bring this new LWG issue to your attention.
> https://cplusplus.github.io/LWG/issue3576
> <https://cplusplus.github.io/LWG/issue3576>
>
> The author asks whether the fill character of std::format is
>
> * a code unit
> * a code point
> * a grapheme cluster
>
> This might be an abi breaking thing, and implementation disagrees
> already apparently.
>
> My gut feeling is that it needs to at least be a codepoint.
> I do not know if there are any concerns with allowing a grapheme
> in terms of implementation or performance. There is definitively
> some motivation, especially for non-nfc format strings.
>
> This sort of issue illustrates my point that using the term
> character in the standard can be problematic!
>
> Thanks,
> Have a great week,
>
> Corentin
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
>
I think wording may be missing regarding when the fill "character" is
actually used. The only relevant wording I can find is
[format.string.std] table 59 <http://eel.is/c++draft/tab:format.align>,
and there only for the case where "^" is used to center text in a
field. Actually, that wording doesn't technically state what
"characters" are to be used to perform the fill; it only states how many
are to be inserted. [format.string.std]p3
<http://eel.is/c++draft/format#string.std-3> contains an example that
demonstrates the intent, but I don't see any other normative wording.
I think we can avoid specifying that a fill character must be exactly
one grapheme cluster or that it must have an estimated width of 1 by
borrowing some of the wording from [format.string.std]p14
<http://eel.is/c++draft/format#string.std-14>. We can state something
like, "For a string in a Unicode encoding, the sequence of extended
grapheme clusters in the fill character are repeatedly inserted one
whole extended grapheme cluster at a time until the field width would be
exceeded. [ /Note/: if the fill character contains an extended grapheme
cluster with an estimated width greater than 1, then the field may not
be padded to the full width - /end note/ ]." This would allow format
specifiers like "{~-^10}" that, given an argument of "text" would format
the field as "~-~text~-~". Basically, I guess I'm arguing against trying
to restrict what is a valid fill character; we can avoid the need for
diagnostics by doing so.
Another tangent: [format.string.std]p11
<http://eel.is/c++draft/format#string.std-11> may have a wording issue.
It states, "... implementations should estimate the width of a string as
the sum of estimated widths of the *first* *code points* in its extended
grapheme clusters...". As is, that seems ambiguous as to whether the
sum is based on the first code point of each EGC or whether each EGC
contributes its first code points (some undefined sequence of 0 or more
initial code points) towards the sum. I think the intent would be
better expressed with "... implementations should estimate the width of
a string as the sum of estimated widths of the first code point*s* in
*its**_each_* extended grapheme cluster*s*..."
Tom.
On 8/9/21 4:04 PM, Steve Downey via SG16 wrote:
> A specific case of the general confusion between "character" and
> `char`. It's broken for any multi-byte encoding, not just Unicode.
>
> However, I suspect that grapheme cluster might be a rabbit hole.
> Checking whether a sequence is, is difficult, and IIRC might change?
>
> On Mon, Aug 9, 2021 at 11:30 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Hello,
>
> I wanted to bring this new LWG issue to your attention.
> https://cplusplus.github.io/LWG/issue3576
> <https://cplusplus.github.io/LWG/issue3576>
>
> The author asks whether the fill character of std::format is
>
> * a code unit
> * a code point
> * a grapheme cluster
>
> This might be an abi breaking thing, and implementation disagrees
> already apparently.
>
> My gut feeling is that it needs to at least be a codepoint.
> I do not know if there are any concerns with allowing a grapheme
> in terms of implementation or performance. There is definitively
> some motivation, especially for non-nfc format strings.
>
> This sort of issue illustrates my point that using the term
> character in the standard can be problematic!
>
> Thanks,
> Have a great week,
>
> Corentin
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
>
Received on 2021-08-09 23:04:32