I've added this to my list of future agenda items.

I think wording may be missing regarding when the fill "character" is actually used. The only relevant wording I can find is [format.string.std] table 59, and there only for the case where "^" is used to center text in a field. Actually, that wording doesn't technically state what "characters" are to be used to perform the fill; it only states how many are to be inserted. [format.string.std]p3 contains an example that demonstrates the intent, but I don't see any other normative wording.

I think we can avoid specifying that a fill character must be exactly one grapheme cluster or that it must have an estimated width of 1 by borrowing some of the wording from [format.string.std]p14. We can state something like, "For a string in a Unicode encoding, the sequence of extended grapheme clusters in the fill character are repeatedly inserted one whole extended grapheme cluster at a time until the field width would be exceeded. [ Note: if the fill character contains an extended grapheme cluster with an estimated width greater than 1, then the field may not be padded to the full width - end note ]." This would allow format specifiers like "{~-^10}" that, given an argument of "text" would format the field as "~-~text~-~". Basically, I guess I'm arguing against trying to restrict what is a valid fill character; we can avoid the need for diagnostics by doing so.

Another tangent: [format.string.std]p11 may have a wording issue. It states, "... implementations should estimate the width of a string as the sum of estimated widths of the first code points in its extended grapheme clusters...". As is, that seems ambiguous as to whether the sum is based on the first code point of each EGC or whether each EGC contributes its first code points (some undefined sequence of 0 or more initial code points) towards the sum. I think the intent would be better expressed with "... implementations should estimate the width of a string as the sum of estimated widths of the first code points in ~~its~~each extended grapheme clusters..."

Tom.

On 8/9/21 4:04 PM, Steve Downey via SG16 wrote:

A specific case of the general confusion between "character" and `char`. It's broken for any multi-byte encoding, not just Unicode.

However, I suspect that grapheme cluster might be a rabbit hole. Checking whether a sequence is, is difficult, and IIRC might change?

On Mon, Aug 9, 2021 at 11:30 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:

Hello,

I wanted to bring this new LWG issue to your attention.

https://cplusplus.github.io/LWG/issue3576

The author asks whether the fill character of std::format is

a code unit

a code point

a grapheme cluster

This might be an abi breaking thing, and implementation disagrees already apparently.

My gut feeling is that it needs to at least be a codepoint.

I do not know if there are any concerns with allowing a grapheme in terms of implementation or performance. There is definitively some motivation, especially for non-nfc format strings.

This sort of issue illustrates my point that using the term character in the standard can be problematic!

Thanks,

Have a great week,

Corentin

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16