ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 Apr 2022 14:59:48 -0400

Updated wording intended to address concerns raised by Corentin and
Hubert is below. Changes include:

  * Added wording to address stateful character encodings.
  * Revised wording to better align with wording in [lex.string]p10
    <http://eel.is/c++draft/lex.string#10>.
  * Replaced the incorrect use of "code point" with "code unit" where
    pointed out by Corentin.
  * Replaced the {simple-hexadecimal-digit-sequence} formulation with a
    format string rather than just a standard format specifier.

The escaped string /E/ representation of a string /S/ is constructed by
encoding a sequence of characters in the associated character encoding
/CE/ for charT ([lex.string.literal]) as follows:

  * U+0022 QUOTATION MARK (") is appended to /E/.
  * Each code unit sequence /X/ in /S/ that either encodes a single
    character or encoding state transition or that is a sequence of
    ill-formed code units is processed in order as follows:
      o If /X/ encodes a single character /C/, then:
          + If /C/ is one of the UCS scalar values in table X, then the
            corresponding escape sequence is appended to /E/.
            <insert table X here>
          + Otherwise, if /C/ is not U+0020 SPACE and
              # /CE/ is a Unicode encoding and C corresponds to a UCS
                scalar value whose Unicode property General_Category has
                a value in the groups Separator (Z) or Other (C), as
                described by table 12 of UAX#44, or
              # /CE/ is not a Unicode encoding and C is one of an
                implementation-defined set of separator or non-printable
                characters
          + then the UCS scalar value corresponding to /C/ is appended
            to /E/ formatted as-if by a format string
            ([format.string.general]
            <http://eel.is/c++draft/format.string.general>) of
            "\\u\{{x}\}". When encoding a stateful character encoding,
            these additions should have no effect on encoding state.
          + Otherwise, /C/ is appended to /E/.
      o Otherwise, if /X/ encodes a state transition, the effect on /E/
        is unspecified.
      o Otherwise /X/ is a sequence of ill-formed code units. Each code
        unit /U/ is appended to /E/ in order formatted as-if by a format
        string ([format.string.general]
        <http://eel.is/c++draft/format.string.general>) of "\\x\{{x}\}".
        When encoding a stateful character encoding, these additions
        should have no effect on encoding state.
  * U+0022 QUOTATION MARK (") is appended to /E/.

When encoding a stateful character encoding, implementations should
first initialize /E/ to the initial encoding state. Each subsequent
addition to /E/ should begin with the final encoding state of the prior
addition. /E/ should be returned to the initial encoding state after the
final quotation mark is appended.

Tom.

On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:
>
> The proposed wording for [format.string.escaped]p4 in P2286R7:
> Formatting Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>
>> The escaped character and escaped string representations of a
>> character or string in a non-Unicode encoding is unspecified.
> I would like this to be better specified to ensure implementations
> behave consistently.
>
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
>
> The escaped string /E/ representation of a string /S/ is constructed
> by encoding a sequence of characters in the associated character
> encoding /CE/ for charT ([lex.string.literal]) as follows:
>
> * /E/ is initialized with U+0022 QUOTATION MARK (").
> * For each code unit sequence /X/ in /S/ that either encodes a
> single character or that is a sequence of ill-formed code units:
> o If /X/ encodes a single character /C/, then:
> + If /C/ is in the table below, then its corresponding
> two-character escape sequence is appended to /E/.
> <insert table here>
> + Otherwise, if /C/ is not U+0020 SPACE and
> # /CE/ is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category
> has a value in the groups Separator (Z) or Other (C),
> as described by table 12 of UAX#44, or
> # /CE/ is not a Unicode encoding and C is one of an
> implementation-defined set of separator or
> non-printable characters
> + then the sequence \u{/simple-hexadecimal-digit-sequence/}
> is appended to /E/ where
> /simple-hexadecimal-digit-sequence/ is the code point
> value of /C/ formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> + Otherwise, /C/ is appended to /E/.
> o Otherwise /X/ is a sequence of ill-formed code units. For each
> code unit /U/, the sequence
> \x{/simple-hexadecimal-digit-sequence/} is appended to /E/
> where /simple-hexadecimal-digit-sequence/ is the code point
> value of /U/ formatted as-if by a standard format specifier
> ([[format.string.std]]) of "{x}".
> * U+0022 QUOTATION MARK (") is appended to /E/.
>
> Please offer your thoughts, I would like to discuss this in tomorrow's
> SG16 meeting.
>
> Tom.
>
>

Received on 2022-04-27 18:59:52