C++ Logo

sg16

Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 29 Apr 2022 16:54:57 +0200
On Fri, Apr 29, 2022 at 4:20 PM Victor Zverovich via SG16 <
sg16_at_[hidden]> wrote:

> The format string in
>
> formatted as-if by a format string ([format.string.general]
> <http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}"
>
> is wrong because { and } should be escaped by doubling not via '\'.
> Moreover, as commented in the meeting I think the old wording that didn't
> use format strings was clearer.
>

+1


> - Victor
>
>
> On Wed, Apr 27, 2022 at 11:59 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Updated wording intended to address concerns raised by Corentin and
>> Hubert is below. Changes include:
>>
>> - Added wording to address stateful character encodings.
>> - Revised wording to better align with wording in [lex.string]p10
>> <http://eel.is/c++draft/lex.string#10>.
>> - Replaced the incorrect use of "code point" with "code unit" where
>> pointed out by Corentin.
>> - Replaced the {simple-hexadecimal-digit-sequence} formulation with a
>> format string rather than just a standard format specifier.
>>
>> The escaped string *E* representation of a string *S* is constructed by
>> encoding a sequence of characters in the associated character encoding
>> *CE* for charT ([lex.string.literal]) as follows:
>>
>> - U+0022 QUOTATION MARK (") is appended to *E*.
>> - Each code unit sequence *X* in *S* that either encodes a single
>> character or encoding state transition or that is a sequence of ill-formed
>> code units is processed in order as follows:
>> - If *X* encodes a single character *C*, then:
>> - If *C* is one of the UCS scalar values in table X, then the
>> corresponding escape sequence is appended to *E*.
>> <insert table X here>
>> - Otherwise, if *C* is not U+0020 SPACE and
>> - *CE* is a Unicode encoding and C corresponds to a UCS
>> scalar value whose Unicode property General_Category has a
>> value in the groups Separator (Z) or Other (C), as described
>> by table 12 of UAX#44, or
>> - *CE* is not a Unicode encoding and C is one of an
>> implementation-defined set of separator or non-printable characters
>> - then the UCS scalar value corresponding to *C* is appended to
>> *E* formatted as-if by a format string ([format.string.general]
>> <http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}".
>> When encoding a stateful character encoding, these additions should have no
>> effect on encoding state.
>> - Otherwise, *C* is appended to *E*.
>> - Otherwise, if *X* encodes a state transition, the effect on *E*
>> is unspecified.
>> - Otherwise *X* is a sequence of ill-formed code units. Each code
>> unit *U* is appended to *E* in order formatted as-if by a format
>> string ([format.string.general]
>> <http://eel.is/c++draft/format.string.general>) of "\\x\{{x}\}".
>> When encoding a stateful character encoding, these additions should have no
>> effect on encoding state.
>> - U+0022 QUOTATION MARK (") is appended to *E*.
>>
>> When encoding a stateful character encoding, implementations should first
>> initialize *E* to the initial encoding state. Each subsequent addition
>> to *E* should begin with the final encoding state of the prior addition.
>> *E* should be returned to the initial encoding state after the final
>> quotation mark is appended.
>>
>> Tom.
>> On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:
>>
>> The proposed wording for [format.string.escaped]p4 in P2286R7:
>> Formatting Ranges
>> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
>> currently states:
>>
>> The escaped character and escaped string representations of a character
>> or string in a non-Unicode encoding is unspecified.
>>
>> I would like this to be better specified to ensure implementations behave
>> consistently.
>>
>> The wording below is suggested as a replacement for
>> [format.string.escaped]p2-p4 (link to p2
>> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
>> and is intended to cover both the Unicode and non-Unicode cases.
>>
>> The escaped string *E* representation of a string *S* is constructed by
>> encoding a sequence of characters in the associated character encoding
>> *CE* for charT ([lex.string.literal]) as follows:
>>
>> - *E* is initialized with U+0022 QUOTATION MARK (").
>> - For each code unit sequence *X* in *S* that either encodes a single
>> character or that is a sequence of ill-formed code units:
>> - If *X* encodes a single character *C*, then:
>> - If *C* is in the table below, then its corresponding
>> two-character escape sequence is appended to *E*.
>> <insert table here>
>> - Otherwise, if *C* is not U+0020 SPACE and
>> - *CE* is a Unicode encoding and C corresponds to a UCS
>> scalar value whose Unicode property General_Category has a
>> value in the groups Separator (Z) or Other (C), as described
>> by table 12 of UAX#44, or
>> - *CE* is not a Unicode encoding and C is one of an
>> implementation-defined set of separator or non-printable characters
>> - then the sequence \u{*simple-hexadecimal-digit-sequence*} is
>> appended to *E* where *simple-hexadecimal-digit-sequence* is
>> the code point value of *C* formatted as-if by a standard
>> format specifier ([[format.string.std]]) of "{x}".
>> - Otherwise, *C* is appended to *E*.
>> - Otherwise *X* is a sequence of ill-formed code units. For
>> each code unit *U*, the sequence \x{
>> *simple-hexadecimal-digit-sequence*} is appended to *E* where
>> *simple-hexadecimal-digit-sequence* is the code point value of *U*
>> formatted as-if by a standard format specifier ([[format.string.std]]) of
>> "{x}".
>> - U+0022 QUOTATION MARK (") is appended to *E*.
>>
>> Please offer your thoughts, I would like to discuss this in tomorrow's
>> SG16 meeting.
>>
>> Tom.
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-04-29 14:55:09