Date: Fri, 24 Jun 2022 09:10:55 +0200
Hello,
A couple of comments on P2572R0
> Change in 22.14.2.2 [format.string.std] paragraph 11
<http://eel.is/c++draft/format.string.std#11>:
The intent is not clear to me here.
I don't think there are a precondition that unicode encoded strings are
well-formed in format (and if there was that change is not necessary as
there can be no codepoints that are not scalar value in a well-formed
sequence), and if we want to enforce well-formedness i'd rather that it be
stated that way.
On the other hand, if we do not intend to ensure well-formedness, we should
be mindful that
if we allow ill-formed sequences, then the standard practice is to replace
isolate surrogate by
� whose width is definitively 1 [1], and so the changes in 22.14.2.2
requires, in my mind further discussion and a broader solution. I would
keep that a separate issue.
Unless you want to clearly state "the width of surrogates is undefined" or
something like that.
Beside, I also don't think it would be useful to replace code point by
scalar values in places where there is already a precondition or no
possibility to have isolated surrogates, as scalar value, beside being a
mouthful, is only applicable to unicode (unlike codepoint), and in places
where well-formedness is desirable, "Precondition foo is a well-formed
sequence in the bar encoding" would take care of scalar values.
Change in 22.14.2.2 [format.string.std] paragraph 3
<http://eel.is/c++draft/format.string.std#3>:
The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. The examples
above that include that character illustrate the effect of the estimated
width when that character is used as a fill character as opposed to when it
is used as a formatting argument.
I had trouble understanding that, I would suggest including the information
that 🤡 is of width 2 directly within the example or in a separate note
directly below the example.
" The examples above that include that character illustrate the effect of
the estimated width when that character is used as a fill character as
opposed to when it is used as a formatting argument." is probably
superfluous and could be omitted.
Do we need examples with extended grapheme clusters?
ie 🐻❄️ has 3 codepoints, width 1, but its use as fill-option is invalid
The rest looks great to me
Thanks,
Corentin
[1] Unicode Standard 14.0 - 2.7 Unicode Strings
> Whenever such strings are specified to be in a particular Unicode encoding
> form—even
> one with the same code unit size—the string must not violate the
> requirements of that
> encoding form. For example, isolated surrogates in a Unicode 16-bit string
> are not allowed
> when that string is specified to be well-formed UTF-16. A number of
> techniques are available for dealing with an isolated surrogate, such as
> omitting it, converting it into U+FFFD
> replacement character to produce well-formed UTF-16, or simply halting the
> processing of the string with an error. (See Section 3.9, Unicode Encoding
> Forms.)
A couple of comments on P2572R0
> Change in 22.14.2.2 [format.string.std] paragraph 11
<http://eel.is/c++draft/format.string.std#11>:
The intent is not clear to me here.
I don't think there are a precondition that unicode encoded strings are
well-formed in format (and if there was that change is not necessary as
there can be no codepoints that are not scalar value in a well-formed
sequence), and if we want to enforce well-formedness i'd rather that it be
stated that way.
On the other hand, if we do not intend to ensure well-formedness, we should
be mindful that
if we allow ill-formed sequences, then the standard practice is to replace
isolate surrogate by
� whose width is definitively 1 [1], and so the changes in 22.14.2.2
requires, in my mind further discussion and a broader solution. I would
keep that a separate issue.
Unless you want to clearly state "the width of surrogates is undefined" or
something like that.
Beside, I also don't think it would be useful to replace code point by
scalar values in places where there is already a precondition or no
possibility to have isolated surrogates, as scalar value, beside being a
mouthful, is only applicable to unicode (unlike codepoint), and in places
where well-formedness is desirable, "Precondition foo is a well-formed
sequence in the bar encoding" would take care of scalar values.
Change in 22.14.2.2 [format.string.std] paragraph 3
<http://eel.is/c++draft/format.string.std#3>:
The 🤡 (U+1F921 CLOWN FACE) emoji has an estimated width of 2. The examples
above that include that character illustrate the effect of the estimated
width when that character is used as a fill character as opposed to when it
is used as a formatting argument.
I had trouble understanding that, I would suggest including the information
that 🤡 is of width 2 directly within the example or in a separate note
directly below the example.
" The examples above that include that character illustrate the effect of
the estimated width when that character is used as a fill character as
opposed to when it is used as a formatting argument." is probably
superfluous and could be omitted.
Do we need examples with extended grapheme clusters?
ie 🐻❄️ has 3 codepoints, width 1, but its use as fill-option is invalid
The rest looks great to me
Thanks,
Corentin
[1] Unicode Standard 14.0 - 2.7 Unicode Strings
> Whenever such strings are specified to be in a particular Unicode encoding
> form—even
> one with the same code unit size—the string must not violate the
> requirements of that
> encoding form. For example, isolated surrogates in a Unicode 16-bit string
> are not allowed
> when that string is specified to be well-formed UTF-16. A number of
> techniques are available for dealing with an isolated surrogate, such as
> omitting it, converting it into U+FFFD
> replacement character to produce well-formed UTF-16, or simply halting the
> processing of the string with an error. (See Section 3.9, Unicode Encoding
> Forms.)
Received on 2022-06-24 07:11:07