On Fri, Apr 29, 2022 at 4:20 PM Victor Zverovich via SG16 <sg16@lists.isocpp.org> wrote:

The format string in

formatted as-if by a format string ([format.string.general]) of "\\u\{{x}\}"

is wrong because { and } should be escaped by doubling not via '\'. Moreover, as commented in the meeting I think the old wording that didn't use format strings was clearer.

- Victor

On Wed, Apr 27, 2022 at 11:59 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

Updated wording intended to address concerns raised by Corentin and Hubert is below. Changes include:

Added wording to address stateful character encodings.

Revised wording to better align with wording in [lex.string]p10.

Replaced the incorrect use of "code point" with "code unit" where pointed out by Corentin.

Replaced the {simple-hexadecimal-digit-sequence} formulation with a format string rather than just a standard format specifier.

The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:

U+0022 QUOTATION MARK (") is appended to E.

Each code unit sequence X in S that either encodes a single character or encoding state transition or that is a sequence of ill-formed code units is processed in order as follows:

If X encodes a single character C, then:

If C is one of the UCS scalar values in table X, then the corresponding escape sequence is appended to E.
<insert table X here>

Otherwise, if C is not U+0020 SPACE and

CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or

CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters

then the UCS scalar value corresponding to C is appended to E formatted as-if by a format string ([format.string.general]) of "\\u\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.

Otherwise, C is appended to E.

Otherwise, if X encodes a state transition, the effect on E is unspecified.

Otherwise X is a sequence of ill-formed code units. Each code unit U is appended to E in order formatted as-if by a format string ([format.string.general]) of "\\x\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.

U+0022 QUOTATION MARK (") is appended to E.

When encoding a stateful character encoding, implementations should first initialize E to the initial encoding state. Each subsequent addition to E should begin with the final encoding state of the prior addition. E should be returned to the initial encoding state after the final quotation mark is appended.

Tom.

On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:

The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges currently states:

The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.
I would like this to be better specified to ensure implementations behave consistently.
The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2) and is intended to cover both the Unicode and non-Unicode cases.

The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:

E is initialized with U+0022 QUOTATION MARK (").

For each code unit sequence X in S that either encodes a single character or that is a sequence of ill-formed code units:

If X encodes a single character C, then:

If C is in the table below, then its corresponding two-character escape sequence is appended to E.
<insert table here>

Otherwise, if C is not U+0020 SPACE and

CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or

CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters

then the sequence \u{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of C formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".

Otherwise, C is appended to E.

Otherwise X is a sequence of ill-formed code units. For each code unit U, the sequence \x{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".

U+0022 QUOTATION MARK (") is appended to E.

Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16