ISOCPP sg16 List: Re: P2572R0 std::format() fill character allowances (Proposed resolution for LWG issues 3576 and 3639)

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 1 Jul 2022 15:27:56 -0400

Apologies for the delay responding; I've been on vacation!

On 6/25/22 4:55 AM, Corentin wrote:
>
>
> On Sat, Jun 25, 2022 at 12:04 AM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 6/24/22 3:10 AM, Corentin wrote:
>> Hello,
>> A couple of comments on P2572R0
> Thank you, Corentin.
>>
>> > Change in 22.14.2.2 [format.string.std] paragraph 11
>> <http://eel.is/c++draft/format.string.std#11>:
>>
>> The intent is not clear to me here.
>> I don't think there are a precondition that unicode encoded
>> strings are well-formed in format (and if there was that change
>> is not necessary as there can be no codepoints that are not
>> scalar value in a well-formed sequence), and if we want to
>> enforce well-formedness i'd rather that it be stated that way.
>>
>> On the other hand, if we do not intend to ensure well-formedness,
>> we should be mindful that
>> if we allow ill-formed sequences, then the standard practice is
>> 1ef5114c3377bbf48ed14092587de9605fcda5d2to replace isolate
>> surrogate by
>> � whose width is definitively 1 [1], and so the changes in
>> 22.14.2.2 requires, in my mind further discussion and a broader
>> solution. I would keep that a separate issue.
>> Unless you want to clearly state "the width of surrogates is
>> undefined" or something like that.
>>
>> Beside, I also don't think it would be useful to replace code
>> point by scalar values in places where there is already a
>> precondition or no possibility to have isolated surrogates, as
>> scalar value, beside being a mouthful, is only applicable to
>> unicode (unlike codepoint), and in places where well-formedness
>> is desirable, "Precondition foo is a well-formed sequence in the
>> bar encoding" would take care of scalar values.
>
> This sounds like something that would be appropriate to discuss in
> LWG. This change is intended to be a drive-by correction and
> specifically not a design change. If LWG wishes to reject that
> change, that won't affect the rest of the paper.
>
> I'm not aware of any wording that describes what happens if an
> ill-formed Unicode string is provided as the format string; that
> seems like UB by omission and implementations appear to agree. I
> agree that any change to perform character substitution should be
> addressed in a separate paper.
>
> I'm not sure that talking about scalar value here makes it clear that
> passing non-well formed text is a precondition violation.
The change is not intended to add such a precondition.
> As no such precondition seems to exist, and as we seem to agree that
> your proposed change does introduce one.

I don't think we agree on that. As I said, I think passing an ill-formed
Unicode string is currently UB by omission. Regardless,
[format.string.std]p11 <http://eel.is/c++draft/format.string.std#11> states:

    For a string in a Unicode encoding, implementations should estimate
    the width of a string as the sum of estimated widths of the first
    code points in its extended grapheme clusters. The extended grapheme
    clusters of a string are defined by UAX #29. ...

It isn't clear to me that UAX #29 <https://unicode.org/reports/tr29> is
intended to define extended grapheme clusters for code point sequences
that contain surrogate code points. UAX #29 does not use UCS scalar
value terminology, but the document reads as though it is intended to
treat all code points as "characters". The words "lone" and "surrogate"
do not appear in it. It may be worth trying to clarify this with the UTC.

> I would prefer to see explicitly something like "The width of a
> unicode string that is not well-formed is undefined" or something like
> that. it covers both lone surrogate and what happens if you cannot
> form a codepoint for the purposes of width estimation.

I think that would be fine, but I don't want to do that as part of this
paper. I suggest filing a separate LWG issue for that.

At any rate, since this change seems to be contentious, I'll just remove
it in the next revision.

Tom.

>
>
> The wording in this context is Unicode specific; I agree we should
> not use UCS scalar value in a general sense.
>
>>
>>
>> Change in 22.14.2.2 [format.string.std] paragraph 3
>> <http://eel.is/c++draft/format.string.std#3>:
>>
>> A couple of comments on P2572R0
>> The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2.
>> The examples above that include that character illustrate the
>> effect of the estimated width when that character is used as a
>> fill character as opposed to when it is used as a formatting
>> argument.
>>
>> I had trouble understanding that, I would suggest including the
>> information that 🤡 is of width 2 directly within the example or
>> in a separate note directly below the example.
>
> I'm not sure I'm following; why would relocating the note improve
> comprehension? Perhaps the note would read better as two
> paragraphs as below?
>
> [ /Note 3/: If the /width/ option is absent, then the field width
> is the estimated width of the formatted argument and the alignment
> option has no effect. If the estimated width of the formatted
> argument matches or exceeds the field width, then both the
> alignment and width options have no effect.
>
> The width of any fill character is assumed to be 1. The 🤡
> (U+1F921 CLOWN FACE) emoji has an estimated width of 2. The
> examples above that include that character illustrate the effect
> of the estimated width when that character is used as a fill
> character as opposed to when it is used as a formatting argument.
> — /end note/ ]
>
>
> Because there are two completely different statements here.
>
> The width of any fill character is assumed to be 1.
>
>
> This is a general statement - I could even argue that it could be
> normative rather than a note. It's important
>
> It's followed by a description of the example. In such a way that
> makes it looks like we are still making general statements.
> If you don't want to attach that part to the note you could maybe rewrite
>
> The width of any fill character is assumed to be 1. The 🤡 (U+1F921
> CLOWN FACE) emoji has an estimated width of 2. The examples above that
> include that character illustrate the effect of the estimated width
> when that character is used as a fill character as opposed to when it
> is used as a formatting argument.
>
> The width of any fill character is assumed to be 1. The examples above
> that include that character illustrate the effect of the estimated
> width when that character is used as a fill character as opposed to
> when it is used as a formatting argument. The 🤡 (U+1F921 CLOWN FACE)
> emoji has an estimated width of 2.
>
> At least that makes it clear that the statement about the emoji is
> related to the example.
> It's a detail but i had to read that paragraph to understand why we
> were suddenly talking about a clown
>
> But my preference would be
>
> The /align/specifieroption applies to all argument types. The meaning
> of the various alignment options is as specified in Table 64
> <http://eel.is/c++draft/format.string.std#tab:format.align>
> The width of any fill codepoint is 1.
> [ Note 3: If the width option is absent, then the field width is the
> estimated width of the formatted argument and the alignment option has
> no effect. If the estimated width of the formatted argument matches or
> exceeds the field width, then both the alignment and width options
> have no effect.]
>
> [ Example ]
>
> [Note 4: The examples above that include that character illustrate
> the effect of the estimated width when that character is used as a
> fill character as opposed to when it is used as a formatting argument.
> The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. ]
>
>
>
>>
>> " The examples above that include that character illustrate the
>> effect of the estimated width when that character is used as a
>> fill character as opposed to when it is used as a formatting
>> argument." is probably superfluous and could be omitted.
> Hmm, I think that note is helpful for explaining the rather subtle
> reason that format("{:🤡^6}", "x") and format("{:*^6}", "🤡🤡🤡")
> produce the results that they do.
>
>
> Fair enough!
>
>>
>> Do we need examples with extended grapheme clusters?
>> ie 🐻‍❄️ has 3 codepoints, width 1, but its use as fill-option is
>> invalid
>
> 🐻‍❄️ is actually 4 code points (U+1F43B, U+200D, U+2744, U+FE0F)
> and has a width of 2 since its first code point is U+1F43B per
> [format.string.std]p11 <http://eel.is/c++draft/format.string.std#11>.
>
> yes, sorry 4 codepoints, width 2.
>
> There are such examples in the prose. I'm ambivalent with regard
> to whether we add such an example in the wording. If we do, I'm
> assuming something like the following would be what you want? (I
> substituted an EGC consisting of 2 code points with an estimated
> width of 1).
>
> string s9 = format("{:é^6}", "x"); // ill-formed; é is
> U+0065, U+0301.
> string s10 = format("{:*^6}", "ééé"); // value of s10 is "*ééé**"
>
>
> Yes, exactly mirroring the clown.
> It would illustrate both EGC for the string arguments being counted as
> one and show that they are ill-formed as a fill argument.
> I would however strongly prefer an emoji sequence, not to bring
> normalization confusion in the mix.
> Thanks,
> Corentin
>
>
>>
>> The rest looks great to me
>
> Thank you.
>
> Tom.
>
>>
>> Thanks,
>> Corentin
>>
>>
>> [1] Unicode Standard 14.0 - 2.7 Unicode Strings
>>
>> Whenever such strings are specified to be in a particular
>> Unicode encoding form—even
>> one with the same code unit size—the string must not violate
>> the requirements of that
>> encoding form. For example, isolated surrogates in a Unicode
>> 16-bit string are not allowed
>> when that string is specified to be well-formed UTF-16. A
>> number of techniques are available for dealing with an
>> isolated surrogate, such as omitting it, converting it into
>> U+FFFD
>> replacement character to produce well-formed UTF-16, or
>> simply halting the processing of the string with an error.
>> (See Section 3.9, Unicode Encoding Forms.)
>>

Received on 2022-07-01 19:28:00