Thanks Tom,

A few comments below

On Tue, Apr 26, 2022 at 10:31 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges currently states:

The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.
I would like this to be better specified to ensure implementations behave consistently.
The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2) and is intended to cover both the Unicode and non-Unicode cases.

The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:

E is initialized with U+0022 QUOTATION MARK (").

For each code unit sequence X in S that either encodes a single character or that is a sequence of ill-formed code units:

I think we should try to avoid considering ill-formed code units as part of sequences because it begs the question of the boundary condition.

I think the initial wording is clearer. We can read in order either one valid UCS or one invalid code unit.

Also, character is ill-defined - codepoint works though.

If X encodes a single character C, then:

If C is in the table below, then its corresponding two-character escape sequence is appended to E.
<insert table here>

Otherwise, if C is not U+0020 SPACE and

CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or

CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable character

Maybe a simpler way would be to define an implementation-defined mapping TO-UCS(CE), and then use ucs in the rest of the paragraph, the behavior is isomorphic, although it gives more guidance

to implementers not to be imaginative with which character they consider printable.

then the sequence \u{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of C formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".

Otherwise, C is appended to E.

Otherwise X is a sequence of ill-formed code units. For each code unit U, the sequence \x{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".

I prefer the original formulation here. In particular, what do you mean by "code point value"? An invalid code unit does not have a mapping to codepoint. It is, however, a value itself.

But "where simple-hexadecimal-digit-sequence is U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}". "is also less clear than the original wording imo

I hope that helps,

Corentin

U+0022 QUOTATION MARK (") is appended to E.

Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16