Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 27 Apr 2022 00:48:59 +0200
Thanks Tom,
A few comments below

On Tue, Apr 26, 2022 at 10:31 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting
> Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
> The escaped character and escaped string representations of a character or
> string in a non-Unicode encoding is unspecified.
> I would like this to be better specified to ensure implementations behave
> consistently.
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
> The escaped string *E* representation of a string *S* is constructed by
> encoding a sequence of characters in the associated character encoding
> *CE* for charT ([lex.string.literal]) as follows:
> - *E* is initialized with U+0022 QUOTATION MARK (").
> - For each code unit sequence *X* in *S* that either encodes a single
> character or that is a sequence of ill-formed code units:
> I think we should try to avoid considering ill-formed code units as part
of sequences because it begs the question of the boundary condition.
I think the initial wording is clearer. We can read in order either one
valid UCS or one invalid code unit.
Also, character is ill-defined - codepoint works though.

> - If *X* encodes a single character *C*, then:
> - If *C* is in the table below, then its corresponding
> two-character escape sequence is appended to *E*.
> <insert table here>
> - Otherwise, if *C* is not U+0020 SPACE and
> - *CE* is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category has a
> value in the groups Separator (Z) or Other (C), as described
> by table 12 of UAX#44, or
> - *CE* is not a Unicode encoding and C is one of an
> implementation-defined set of separator or non-printable character
> Maybe a simpler way would be to define an implementation-defined mapping
TO-UCS(CE), and then use ucs in the rest of the paragraph, the behavior is
isomorphic, although it gives more guidance
to implementers not to be imaginative with which character they consider

> - then the sequence \u{*simple-hexadecimal-digit-sequence*} is
> appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *C* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> - Otherwise, *C* is appended to *E*.
> - Otherwise *X* is a sequence of ill-formed code units. For each
> code unit *U*, the sequence \x{*simple-hexadecimal-digit-sequence*}
> is appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *U* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> I prefer the original formulation here. In particular, what do you mean by
"code point value"? An invalid code unit does not have a mapping to
codepoint. It is, however, a value itself.
But "where *simple-hexadecimal-digit-sequence* is *U* formatted as-if by a
standard format specifier ([[format.string.std]]) of "{x}". "is also less
clear than the original wording imo

I hope that helps,

> -
> - U+0022 QUOTATION MARK (") is appended to *E*.
> Please offer your thoughts, I would like to discuss this in tomorrow's
> SG16 meeting.
> Tom.
