On Tue, Apr 26, 2022 at 4:31 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges currently states:

The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.
I would like this to be better specified to ensure implementations behave consistently.

The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2) and is intended to cover both the Unicode and non-Unicode cases.

The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:

E is initialized with U+0022 QUOTATION MARK (").

For each code unit sequence X in S that either encodes a single character or that is a sequence of ill-formed code units:

If X encodes a single character C, then:

If C is in the table below, then its corresponding two-character escape sequence is appended to E.
<insert table here>

Otherwise, if C is not U+0020 SPACE and

CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or

CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters

I would prefer more emphasis on the "an" (which is much different from "the" in this context): "a set, implementation-defined for this purpose, of [ ... ]".

What should Windows implementations do with codepages that overlay graphic characters with control characters?

then the sequence \u{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of C formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".

Okay: This requires conversion-to-Unicode to be implemented in the runtime library only for the limited set of space and non-printable characters.

Otherwise, C is appended to E.

Otherwise X is a sequence of ill-formed code units. For each code unit U, the sequence \x{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".

U+0022 QUOTATION MARK (") is appended to E.

Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.

The above may actively prevent translation-as-source of escaped strings from reproducing the same code unit sequence in cases of stateful encodings for strings due to "unnecessary" shift sequences. I believe the most likely practical resolution applied by implementations would be to choose to apply an encoding that is different from the literal/wide literal encoding (but is in a base "family"). That is, shift sequences will be considered to be a sequence of characters in their own right and the processing operates as if the initial shift state is active.

I believe the above wording does not accommodate that solution (and that solution should be allowed).

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16