Thanks, Hubert.

On 4/26/22 8:39 PM, Hubert Tong wrote:
On Tue, Apr 26, 2022 at 4:31 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges currently states:

The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.
I would like this to be better specified to ensure implementations behave consistently.

The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2) and is intended to cover both the Unicode and non-Unicode cases.

The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:

  • E is initialized with U+0022 QUOTATION MARK (").
  • For each code unit sequence X in S that either encodes a single character or that is a sequence of ill-formed code units:
    • If X encodes a single character C, then:
      • If C is in the table below, then its corresponding two-character escape sequence is appended to E.
        <insert table here>
      • Otherwise, if C is not U+0020 SPACE and
        • CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
        • CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters
I would prefer more emphasis on the "an" (which is much different from "the" in this context): "a set, implementation-defined for this purpose, of [ ... ]".
Could you elaborate a bit? I would expect there to be one set for each distinct encoding. Is the concern wanting to ensure that this set is not related to the set of separator or non-printable characters perhaps used elsewhere in the standard? e.g., for isspace()?

What should Windows implementations do with codepages that overlay graphic characters with control characters?
I'm not sure how this is relevant. The string is interpreted using the literal encoding corresponding to charT, so there should be no ambiguity.
      • then the sequence \u{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of C formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
Okay: This requires conversion-to-Unicode to be implemented in the runtime library only for the limited set of space and non-printable characters.
Yes, that sounds right.

      • Otherwise, C is appended to E.
    • Otherwise X is a sequence of ill-formed code units. For each code unit U, the sequence \x{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
  • U+0022 QUOTATION MARK (") is appended to E.

Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.

The above may actively prevent translation-as-source of escaped strings from reproducing the same code unit sequence in cases of stateful encodings for strings due to "unnecessary" shift sequences. I believe the most likely practical resolution applied by implementations would be to choose to apply an encoding that is different from the literal/wide literal encoding (but is in a base "family"). That is, shift sequences will be considered to be a sequence of characters in their own right and the processing operates as if the initial shift state is active.
That matches my expectations as well. I'll follow up with suggested wording to address this and some of Corentin's concerns.

I believe the above wording does not accommodate that solution (and that solution should be allowed).

I agree it should be allowed.

Tom.