C++ Logo

sg16

Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 29 Apr 2022 11:47:04 -0400
On 4/29/22 10:56 AM, Jens Maurer wrote:
> On 29/04/2022 16.20, Victor Zverovich via SG16 wrote:
>> The format string in
>>
>> formatted as-if by a format string ([format.string.general] <http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}"
>>
>> is wrong because { and } should be escaped by doubling not via '\'. Moreover, as commented in the meeting I think the old wording that didn't use format strings was clearer.
> Agreed with the latter part.
Thank you. That is at least three people reporting they found the prior
wording for the hex formatting to be more clear, so I'll restore that.
>
> Also, the string as given needs string-literal interpretation,
> which may or may not be obvious.

Yes, I had thought about that, but hoped it was clear enough. Clearly it
wasn't. Thank you.

Tom.

>
> Jens
>
>
>
>> - Victor
>>
>>
>> On Wed, Apr 27, 2022 at 11:59 AM Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> Updated wording intended to address concerns raised by Corentin and Hubert is below. Changes include:
>>
>> * Added wording to address stateful character encodings.
>> * Revised wording to better align with wording in [lex.string]p10 <http://eel.is/c++draft/lex.string#10>.
>> * Replaced the incorrect use of "code point" with "code unit" where pointed out by Corentin.
>> * Replaced the {simple-hexadecimal-digit-sequence} formulation with a format string rather than just a standard format specifier.
>>
>> The escaped string /E/ representation of a string /S/ is constructed by encoding a sequence of characters in the associated character encoding /CE/ for charT ([lex.string.literal]) as follows:
>>
>> * U+0022 QUOTATION MARK (") is appended to /E/.
>> * Each code unit sequence /X/ in /S/ that either encodes a single character or encoding state transition or that is a sequence of ill-formed code units is processed in order as follows:
>> o If /X/ encodes a single character /C/, then:
>> + If /C/ is one of the UCS scalar values in table X, then the corresponding escape sequence is appended to /E/.
>> <insert table X here>
>> + Otherwise, if /C/ is not U+0020 SPACE and
>> # /CE/ is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
>> # /CE/ is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters
>> + then the UCS scalar value corresponding to /C/ is appended to /E/ formatted as-if by a format string ([format.string.general] <http://eel.is/c++draft/format.string.general>) of "\\u\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.
>> + Otherwise, /C/ is appended to /E/.
>> o Otherwise, if /X/ encodes a state transition, the effect on /E/ is unspecified.
>> o Otherwise /X/ is a sequence of ill-formed code units. Each code unit /U/ is appended to /E/ in order formatted as-if by a format string ([format.string.general] <http://eel.is/c++draft/format.string.general>) of "\\x\{{x}\}". When encoding a stateful character encoding, these additions should have no effect on encoding state.
>> * U+0022 QUOTATION MARK (") is appended to /E/.
>>
>> When encoding a stateful character encoding, implementations should first initialize /E/ to the initial encoding state. Each subsequent addition to /E/ should begin with the final encoding state of the prior addition. /E/ should be returned to the initial encoding state after the final quotation mark is appended.
>>
>> Tom.
>>
>> On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:
>>> The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html> currently states:
>>>
>>>> The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.
>>> I would like this to be better specified to ensure implementations behave consistently.
>>>
>>> The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2 <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>) and is intended to cover both the Unicode and non-Unicode cases.
>>>
>>> The escaped string /E/ representation of a string /S/ is constructed by encoding a sequence of characters in the associated character encoding /CE/ for charT ([lex.string.literal]) as follows:
>>>
>>> * /E/ is initialized with U+0022 QUOTATION MARK (").
>>> * For each code unit sequence /X/ in /S/ that either encodes a single character or that is a sequence of ill-formed code units:
>>> o If /X/ encodes a single character /C/, then:
>>> + If /C/ is in the table below, then its corresponding two-character escape sequence is appended to /E/.
>>> <insert table here>
>>> + Otherwise, if /C/ is not U+0020 SPACE and
>>> # /CE/ is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
>>> # /CE/ is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable characters
>>> + then the sequence \u{/simple-hexadecimal-digit-sequence/} is appended to /E/ where /simple-hexadecimal-digit-sequence/ is the code point value of /C/ formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
>>> + Otherwise, /C/ is appended to /E/.
>>> o Otherwise /X/ is a sequence of ill-formed code units. For each code unit /U/, the sequence \x{/simple-hexadecimal-digit-sequence/} is appended to /E/ where /simple-hexadecimal-digit-sequence/ is the code point value of /U/ formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
>>> * U+0022 QUOTATION MARK (") is appended to /E/.
>>>
>>> Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.
>>>
>>> Tom.
>>>
>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>>
>>

Received on 2022-04-29 15:47:08