ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Fri, 29 Apr 2022 16:56:43 +0200

On 29/04/2022 16.20, Victor Zverovich via SG16 wrote:
> The format string in
>
> formatted as-if by a format string ([format.string.general] <http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}"
>
> is wrong because { and } should be escaped by doubling not via '\'. Moreover, as commented in the meeting I think the old wording that didn't use format strings was clearer.

Agreed with the latter part.

Also, the string as given needs string-literal interpretation,
which may or may not be obvious.

Jens

> - Victor
>
>
> On Wed, Apr 27, 2022 at 11:59 AM Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Updated wording intended to address concerns raised by Corentin and Hubert is below. Changes include:
>
> * Added wording to address stateful character encodings.
> * Revised wording to better align with wording in [lex.string]p10 <http://eel.is/c++draft/lex.string#10>.
> * Replaced the incorrect use of "code point" with "code unit" where pointed out by Corentin.
> * Replaced the {simple-hexadecimal-digit-sequence} formulation with a format string rather than just a standard format specifier.
>
> The escaped string /E/ representation of a string /S/ is constructed by encoding a sequence of characters in the associated character encoding /CE/ for charT ([lex.string.literal]) as follows:
>
> * U+0022 QUOTATION MARK (") is appended to /E/.
> * Each code unit sequence /X/ in /S/ that either encodes a single character or encoding state transition or that is a sequence of ill-formed code units is processed in order as follows:
> o If /X/ encodes a single character /C/, then:
> + If /C/ is one of the UCS scalar values in table X, then the corresponding escape sequence is appended to /E/.
> <insert table X here>
> + Otherwise, if /C/ is not U+0020 SPACE and
> # /CE/ is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
> # /CE/ is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters
> + then the UCS scalar value corresponding to /C/ is appended to /E/ formatted as-if by a format string ([format.string.general] <http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.
> + Otherwise, /C/ is appended to /E/.
> o Otherwise, if /X/ encodes a state transition, the effect on /E/ is unspecified.
> o Otherwise /X/ is a sequence of ill-formed code units. Each code unit /U/ is appended to /E/ in order formatted as-if by a format string ([format.string.general] <http://eel.is/c++draft/format.string.general>) of "\\x\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.
> * U+0022 QUOTATION MARK (") is appended to /E/.
>
> When encoding a stateful character encoding, implementations should first initialize /E/ to the initial encoding state. Each subsequent addition to /E/ should begin with the final encoding state of the prior addition. /E/ should be returned to the initial encoding state after the final quotation mark is appended.
>
> Tom.
>
> On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:
>>
>> The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html> currently states:
>>
>>> The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.
>> I would like this to be better specified to ensure implementations behave consistently.
>>
>> The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2 <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>) and is intended to cover both the Unicode and non-Unicode cases.
>>
>> The escaped string /E/ representation of a string /S/ is constructed by encoding a sequence of characters in the associated character encoding /CE/ for charT ([lex.string.literal]) as follows:
>>
>> * /E/ is initialized with U+0022 QUOTATION MARK (").
>> * For each code unit sequence /X/ in /S/ that either encodes a single character or that is a sequence of ill-formed code units:
>> o If /X/ encodes a single character /C/, then:
>> + If /C/ is in the table below, then its corresponding two-character escape sequence is appended to /E/.
>> <insert table here>
>> + Otherwise, if /C/ is not U+0020 SPACE and
>> # /CE/ is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
>> # /CE/ is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters
>> + then the sequence \u{/simple-hexadecimal-digit-sequence/} is appended to /E/ where /simple-hexadecimal-digit-sequence/ is the code point value of /C/ formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
>> + Otherwise, /C/ is appended to /E/.
>> o Otherwise /X/ is a sequence of ill-formed code units. For each code unit /U/, the sequence \x{/simple-hexadecimal-digit-sequence/} is appended to /E/ where /simple-hexadecimal-digit-sequence/ is the code point value of /U/ formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
>> * U+0022 QUOTATION MARK (") is appended to /E/.
>>
>> Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.
>>
>> Tom.
>>
>>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
>

Received on 2022-04-29 14:56:49