The wording in this context is Unicode specific; I agree we should not use UCS scalar value in a general sense.

Change in 22.14.2.2 [format.string.std] paragraph 3:

A couple of comments on P2572R0

The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. The examples above that include that character illustrate the effect of the estimated width when that character is used as a fill character as opposed to when it is used as a formatting argument.

I had trouble understanding that, I would suggest including the information that 🤡 is of width 2 directly within the example or in a separate note directly below the example.

I'm not sure I'm following; why would relocating the note improve comprehension? Perhaps the note would read better as two paragraphs as below?

[ Note 3: If the width option is absent, then the field width is the estimated width of the formatted argument and the alignment option has no effect. If the estimated width of the formatted argument matches or exceeds the field width, then both the alignment and width options have no effect.

The width of any fill character is assumed to be 1. The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. The examples above that include that character illustrate the effect of the estimated width when that character is used as a fill character as opposed to when it is used as a formatting argument. — end note ]

Because there are two completely different statements here.

The width of any fill character is assumed to be 1.

This is a general statement - I could even argue that it could be normative rather than a note. It's important

It's followed by a description of the example. In such a way that makes it looks like we are still making general statements.

If you don't want to attach that part to the note you could maybe rewrite

The width of any fill character is assumed to be 1. The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. The examples above that include that character illustrate the effect of the estimated width when that character is used as a fill character as opposed to when it is used as a formatting argument.

The width of any fill character is assumed to be 1. The examples above that include that character illustrate the effect of the estimated width when that character is used as a fill character as opposed to when it is used as a formatting argument. The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2.

At least that makes it clear that the statement about the emoji is related to the example.

It's a detail but i had to read that paragraph to understand why we were suddenly talking about a clown

But my preference would be

The align ~~specifier~~option applies to all argument types. The meaning of the various alignment options is as specified in Table 64

The width of any fill codepoint is 1.

[ Note 3: If the width option is absent, then the field width is the estimated width of the formatted argument and the alignment option has no effect. If the estimated width of the formatted argument matches or exceeds the field width, then both the alignment and width options have no effect.]

[ Example ]

[Note 4: The examples above that include that character illustrate the effect of the estimated width when that character is used as a fill character as opposed to when it is used as a formatting argument. The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. ]

" The examples above that include that character illustrate the effect of the estimated width when that character is used as a fill character as opposed to when it is used as a formatting argument." is probably superfluous and could be omitted.

Hmm, I think that note is helpful for explaining the rather subtle reason that format("{:🤡^6}", "x") and format("{:*^6}", "🤡🤡🤡") produce the results that they do.

Fair enough!

Do we need examples with extended grapheme clusters?

ie 🐻‍❄️ has 3 codepoints, width 1, but its use as fill-option is invalid

🐻‍❄️ is actually 4 code points (U+1F43B, U+200D, U+2744, U+FE0F) and has a width of 2 since its first code point is U+1F43B per [format.string.std]p11.

yes, sorry 4 codepoints, width 2.

There are such examples in the prose. I'm ambivalent with regard to whether we add such an example in the wording. If we do, I'm assuming something like the following would be what you want? (I substituted an EGC consisting of 2 code points with an estimated width of 1).

string s9 = format("{:é^6}", "x"); // ill-formed; é is U+0065, U+0301.
string s10 = format("{:*^6}", "ééé"); // value of s10 is "*ééé**"

Yes, exactly mirroring the clown.

It would illustrate both EGC for the string arguments being counted as one and show that they are ill-formed as a fill argument.

I would however strongly prefer an emoji sequence, not to bring normalization confusion in the mix.

Thanks,

Corentin

The rest looks great to me

Thank you.

Tom.

Thanks,

Corentin

[1] Unicode Standard 14.0 - 2.7 Unicode Strings

Whenever such strings are specified to be in a particular Unicode encoding form—even
one with the same code unit size—the string must not violate the requirements of that
encoding form. For example, isolated surrogates in a Unicode 16-bit string are not allowed
when that string is specified to be well-formed UTF-16. A number of techniques are available for dealing with an isolated surrogate, such as omitting it, converting it into U+FFFD
replacement character to produce well-formed UTF-16, or simply halting the processing of the string with an error. (See Section 3.9, Unicode Encoding Forms.)