ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Fri, 29 Apr 2022 07:20:20 -0700

The format string in

formatted as-if by a format string ([format.string.general]
<http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}"

is wrong because { and } should be escaped by doubling not via '\'.
Moreover, as commented in the meeting I think the old wording that didn't
use format strings was clearer.

- Victor

On Wed, Apr 27, 2022 at 11:59 AM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> Updated wording intended to address concerns raised by Corentin and Hubert
> is below. Changes include:
>
> - Added wording to address stateful character encodings.
> - Revised wording to better align with wording in [lex.string]p10
> <http://eel.is/c++draft/lex.string#10>.
> - Replaced the incorrect use of "code point" with "code unit" where
> pointed out by Corentin.
> - Replaced the {simple-hexadecimal-digit-sequence} formulation with a
> format string rather than just a standard format specifier.
>
> The escaped string *E* representation of a string *S* is constructed by
> encoding a sequence of characters in the associated character encoding
> *CE* for charT ([lex.string.literal]) as follows:
>
> - U+0022 QUOTATION MARK (") is appended to *E*.
> - Each code unit sequence *X* in *S* that either encodes a single
> character or encoding state transition or that is a sequence of ill-formed
> code units is processed in order as follows:
> - If *X* encodes a single character *C*, then:
> - If *C* is one of the UCS scalar values in table X, then the
> corresponding escape sequence is appended to *E*.
> <insert table X here>
> - Otherwise, if *C* is not U+0020 SPACE and
> - *CE* is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category has a
> value in the groups Separator (Z) or Other (C), as described
> by table 12 of UAX#44, or
> - *CE* is not a Unicode encoding and C is one of an
> implementation-defined set of separator or non-printable characters
> - then the UCS scalar value corresponding to *C* is appended to
> *E* formatted as-if by a format string ([format.string.general]
> <http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}".
> When encoding a stateful character encoding, these additions should have no
> effect on encoding state.
> - Otherwise, *C* is appended to *E*.
> - Otherwise, if *X* encodes a state transition, the effect on *E*
> is unspecified.
> - Otherwise *X* is a sequence of ill-formed code units. Each code
> unit *U* is appended to *E* in order formatted as-if by a format
> string ([format.string.general]
> <http://eel.is/c++draft/format.string.general>) of "\\x\{{x}\}".
> When encoding a stateful character encoding, these additions should have no
> effect on encoding state.
> - U+0022 QUOTATION MARK (") is appended to *E*.
>
> When encoding a stateful character encoding, implementations should first
> initialize *E* to the initial encoding state. Each subsequent addition to
> *E* should begin with the final encoding state of the prior addition. *E*
> should be returned to the initial encoding state after the final quotation
> mark is appended.
>
> Tom.
> On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:
>
> The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting
> Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>
> The escaped character and escaped string representations of a character or
> string in a non-Unicode encoding is unspecified.
>
> I would like this to be better specified to ensure implementations behave
> consistently.
>
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
>
> The escaped string *E* representation of a string *S* is constructed by
> encoding a sequence of characters in the associated character encoding
> *CE* for charT ([lex.string.literal]) as follows:
>
> - *E* is initialized with U+0022 QUOTATION MARK (").
> - For each code unit sequence *X* in *S* that either encodes a single
> character or that is a sequence of ill-formed code units:
> - If *X* encodes a single character *C*, then:
> - If *C* is in the table below, then its corresponding
> two-character escape sequence is appended to *E*.
> <insert table here>
> - Otherwise, if *C* is not U+0020 SPACE and
> - *CE* is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category has a
> value in the groups Separator (Z) or Other (C), as described
> by table 12 of UAX#44, or
> - *CE* is not a Unicode encoding and C is one of an
> implementation-defined set of separator or non-printable characters
> - then the sequence \u{*simple-hexadecimal-digit-sequence*} is
> appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *C* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> - Otherwise, *C* is appended to *E*.
> - Otherwise *X* is a sequence of ill-formed code units. For each
> code unit *U*, the sequence \x{*simple-hexadecimal-digit-sequence*}
> is appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *U* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> - U+0022 QUOTATION MARK (") is appended to *E*.
>
> Please offer your thoughts, I would like to discuss this in tomorrow's
> SG16 meeting.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-04-29 14:20:32