On Fri, Apr 29, 2022 at 4:20 PM Victor Zverovich via SG16 <sg16@lists.isocpp.org> wrote:
The format string in

  formatted as-if by a format string ([format.string.general]) of "\\u\{{x}\}"

is wrong because { and } should be escaped by doubling not via '\'. Moreover, as commented in the meeting I think the old wording that didn't use format strings was clearer.

+1 
 
- Victor


On Wed, Apr 27, 2022 at 11:59 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

Updated wording intended to address concerns raised by Corentin and Hubert is below. Changes include:

  • Added wording to address stateful character encodings.
  • Revised wording to better align with wording in [lex.string]p10.
  • Replaced the incorrect use of "code point" with "code unit" where pointed out by Corentin.
  • Replaced the {simple-hexadecimal-digit-sequence} formulation with a format string rather than just a standard format specifier.

The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:

  • U+0022 QUOTATION MARK (") is appended to E.
  • Each code unit sequence X in S that either encodes a single character or encoding state transition or that is a sequence of ill-formed code units is processed in order as follows:
    • If X encodes a single character C, then:
      • If C is one of the UCS scalar values in table X, then the corresponding escape sequence is appended to E.
        <insert table X here>
      • Otherwise, if C is not U+0020 SPACE and
        • CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
        • CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters
      • then the UCS scalar value corresponding to C is appended to E formatted as-if by a format string ([format.string.general]) of "\\u\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.
      • Otherwise, C is appended to E.
    • Otherwise, if X encodes a state transition, the effect on E is unspecified.
    • Otherwise X is a sequence of ill-formed code units. Each code unit U is appended to E in order formatted as-if by a format string ([format.string.general]) of "\\x\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.
  • U+0022 QUOTATION MARK (") is appended to E.

When encoding a stateful character encoding, implementations should first initialize E to the initial encoding state. Each subsequent addition to E should begin with the final encoding state of the prior addition. E should be returned to the initial encoding state after the final quotation mark is appended.

Tom.

On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:

The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges currently states:

The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.
I would like this to be better specified to ensure implementations behave consistently.

The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2) and is intended to cover both the Unicode and non-Unicode cases.

The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:

  • E is initialized with U+0022 QUOTATION MARK (").
  • For each code unit sequence X in S that either encodes a single character or that is a sequence of ill-formed code units:
    • If X encodes a single character C, then:
      • If C is in the table below, then its corresponding two-character escape sequence is appended to E.
        <insert table here>
      • Otherwise, if C is not U+0020 SPACE and
        • CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
        • CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters
      • then the sequence \u{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of C formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
      • Otherwise, C is appended to E.
    • Otherwise X is a sequence of ill-formed code units. For each code unit U, the sequence \x{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
  • U+0022 QUOTATION MARK (") is appended to E.

Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.

Tom.


--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16