ISOCPP sg16 List: Re: P2572R0 std::format() fill character allowances (Proposed resolution for LWG issues 3576 and 3639)

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 24 Jun 2022 18:04:40 -0400

On 6/24/22 3:10 AM, Corentin wrote:
> Hello,
> A couple of comments on P2572R0
Thank you, Corentin.
>
> > Change in 22.14.2.2 [format.string.std] paragraph 11
> <http://eel.is/c++draft/format.string.std#11>:
>
> The intent is not clear to me here.
> I don't think there are a precondition that unicode encoded strings
> are well-formed in format (and if there was that change is not
> necessary as there can be no codepoints that are not scalar value in a
> well-formed sequence), and if we want to enforce well-formedness i'd
> rather that it be stated that way.
>
> On the other hand, if we do not intend to ensure well-formedness, we
> should be mindful that
> if we allow ill-formed sequences, then the standard practice is to
> replace isolate surrogate by
> � whose width is definitively 1 [1], and so the changes in 22.14.2.2
> requires, in my mind further discussion and a broader solution. I
> would keep that a separate issue.
> Unless you want to clearly state "the width of surrogates is
> undefined" or something like that.
>
> Beside, I also don't think it would be useful to replace code point by
> scalar values in places where there is already a precondition or no
> possibility to have isolated surrogates, as scalar value, beside being
> a mouthful, is only applicable to unicode (unlike codepoint), and in
> places where well-formedness is desirable, "Precondition foo is a
> well-formed sequence in the bar encoding" would take care of scalar
> values.

This sounds like something that would be appropriate to discuss in LWG.
This change is intended to be a drive-by correction and specifically not
a design change. If LWG wishes to reject that change, that won't affect
the rest of the paper.

I'm not aware of any wording that describes what happens if an
ill-formed Unicode string is provided as the format string; that seems
like UB by omission and implementations appear to agree. I agree that
any change to perform character substitution should be addressed in a
separate paper.

The wording in this context is Unicode specific; I agree we should not
use UCS scalar value in a general sense.

>
>
> Change in 22.14.2.2 [format.string.std] paragraph 3
> <http://eel.is/c++draft/format.string.std#3>:
>
> A couple of comments on P2572R0
> The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. The
> examples above that include that character illustrate the effect of
> the estimated width when that character is used as a fill character as
> opposed to when it is used as a formatting argument.
>
> I had trouble understanding that, I would suggest including the
> information that 🤡 is of width 2 directly within the example or in a
> separate note directly below the example.

I'm not sure I'm following; why would relocating the note improve
comprehension? Perhaps the note would read better as two paragraphs as
below?

[ /Note 3/: If the /width/ option is absent, then the field width is the
estimated width of the formatted argument and the alignment option has
no effect. If the estimated width of the formatted argument matches or
exceeds the field width, then both the alignment and width options have
no effect.

The width of any fill character is assumed to be 1. The 🤡 (U+1F921
CLOWN FACE) emoji has an estimated width of 2. The examples above that
include that character illustrate the effect of the estimated width when
that character is used as a fill character as opposed to when it is used
as a formatting argument. — /end note/ ]

>
> " The examples above that include that character illustrate the effect
> of the estimated width when that character is used as a fill character
> as opposed to when it is used as a formatting argument." is probably
> superfluous and could be omitted.
Hmm, I think that note is helpful for explaining the rather subtle
reason that format("{:🤡^6}", "x") and format("{:*^6}", "🤡🤡🤡")
produce the results that they do.
>
> Do we need examples with extended grapheme clusters?
> ie 🐻‍❄️ has 3 codepoints, width 1, but its use as fill-option is invalid

🐻‍❄️ is actually 4 code points (U+1F43B, U+200D, U+2744, U+FE0F) and
has a width of 2 since its first code point is U+1F43B per
[format.string.std]p11 <http://eel.is/c++draft/format.string.std#11>.

There are such examples in the prose. I'm ambivalent with regard to
whether we add such an example in the wording. If we do, I'm assuming
something like the following would be what you want? (I substituted an
EGC consisting of 2 code points with an estimated width of 1).

string s9 = format("{:é^6}", "x"); // ill-formed; é is U+0065, U+0301.
string s10 = format("{:*^6}", "ééé"); // value of s10 is "*ééé**"

>
> The rest looks great to me

Thank you.

Tom.

>
> Thanks,
> Corentin
>
>
> [1] Unicode Standard 14.0 - 2.7 Unicode Strings
>
> Whenever such strings are specified to be in a particular Unicode
> encoding form—even
> one with the same code unit size—the string must not violate the
> requirements of that
> encoding form. For example, isolated surrogates in a Unicode
> 16-bit string are not allowed
> when that string is specified to be well-formed UTF-16. A number
> of techniques are available for dealing with an isolated
> surrogate, such as omitting it, converting it into U+FFFD
> replacement character to produce well-formed UTF-16, or simply
> halting the processing of the string with an error. (See Section
> 3.9, Unicode Encoding Forms.)
>

Received on 2022-06-24 22:04:43