Date: Wed, 27 Apr 2022 14:59:48 -0400
Updated wording intended to address concerns raised by Corentin and
Hubert is below. Changes include:
* Added wording to address stateful character encodings.
* Revised wording to better align with wording in [lex.string]p10
<http://eel.is/c++draft/lex.string#10>.
* Replaced the incorrect use of "code point" with "code unit" where
pointed out by Corentin.
* Replaced the {simple-hexadecimal-digit-sequence} formulation with a
format string rather than just a standard format specifier.
The escaped string /E/ representation of a string /S/ is constructed by
encoding a sequence of characters in the associated character encoding
/CE/ for charT ([lex.string.literal]) as follows:
* U+0022 QUOTATION MARK (") is appended to /E/.
* Each code unit sequence /X/ in /S/ that either encodes a single
character or encoding state transition or that is a sequence of
ill-formed code units is processed in order as follows:
o If /X/ encodes a single character /C/, then:
+ If /C/ is one of the UCS scalar values in table X, then the
corresponding escape sequence is appended to /E/.
<insert table X here>
+ Otherwise, if /C/ is not U+0020 SPACE and
# /CE/ is a Unicode encoding and C corresponds to a UCS
scalar value whose Unicode property General_Category has
a value in the groups Separator (Z) or Other (C), as
described by table 12 of UAX#44, or
# /CE/ is not a Unicode encoding and C is one of an
implementation-defined set of separator or non-printable
characters
+ then the UCS scalar value corresponding to /C/ is appended
to /E/ formatted as-if by a format string
([format.string.general]
<http://eel.is/c++draft/format.string.general>) of
"\\u\{{x}\}". When encoding a stateful character encoding,
these additions should have no effect on encoding state.
+ Otherwise, /C/ is appended to /E/.
o Otherwise, if /X/ encodes a state transition, the effect on /E/
is unspecified.
o Otherwise /X/ is a sequence of ill-formed code units. Each code
unit /U/ is appended to /E/ in order formatted as-if by a format
string ([format.string.general]
<http://eel.is/c++draft/format.string.general>) of "\\x\{{x}\}".
When encoding a stateful character encoding, these additions
should have no effect on encoding state.
* U+0022 QUOTATION MARK (") is appended to /E/.
When encoding a stateful character encoding, implementations should
first initialize /E/ to the initial encoding state. Each subsequent
addition to /E/ should begin with the final encoding state of the prior
addition. /E/ should be returned to the initial encoding state after the
final quotation mark is appended.
Tom.
On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:
>
> The proposed wording for [format.string.escaped]p4 in P2286R7:
> Formatting Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>
>> The escaped character and escaped string representations of a
>> character or string in a non-Unicode encoding is unspecified.
> I would like this to be better specified to ensure implementations
> behave consistently.
>
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
>
> The escaped string /E/ representation of a string /S/ is constructed
> by encoding a sequence of characters in the associated character
> encoding /CE/ for charT ([lex.string.literal]) as follows:
>
> * /E/ is initialized with U+0022 QUOTATION MARK (").
> * For each code unit sequence /X/ in /S/ that either encodes a
> single character or that is a sequence of ill-formed code units:
> o If /X/ encodes a single character /C/, then:
> + If /C/ is in the table below, then its corresponding
> two-character escape sequence is appended to /E/.
> <insert table here>
> + Otherwise, if /C/ is not U+0020 SPACE and
> # /CE/ is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category
> has a value in the groups Separator (Z) or Other (C),
> as described by table 12 of UAX#44, or
> # /CE/ is not a Unicode encoding and C is one of an
> implementation-defined set of separator or
> non-printable characters
> + then the sequence \u{/simple-hexadecimal-digit-sequence/}
> is appended to /E/ where
> /simple-hexadecimal-digit-sequence/ is the code point
> value of /C/ formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> + Otherwise, /C/ is appended to /E/.
> o Otherwise /X/ is a sequence of ill-formed code units. For each
> code unit /U/, the sequence
> \x{/simple-hexadecimal-digit-sequence/} is appended to /E/
> where /simple-hexadecimal-digit-sequence/ is the code point
> value of /U/ formatted as-if by a standard format specifier
> ([[format.string.std]]) of "{x}".
> * U+0022 QUOTATION MARK (") is appended to /E/.
>
> Please offer your thoughts, I would like to discuss this in tomorrow's
> SG16 meeting.
>
> Tom.
>
>
Hubert is below. Changes include:
* Added wording to address stateful character encodings.
* Revised wording to better align with wording in [lex.string]p10
<http://eel.is/c++draft/lex.string#10>.
* Replaced the incorrect use of "code point" with "code unit" where
pointed out by Corentin.
* Replaced the {simple-hexadecimal-digit-sequence} formulation with a
format string rather than just a standard format specifier.
The escaped string /E/ representation of a string /S/ is constructed by
encoding a sequence of characters in the associated character encoding
/CE/ for charT ([lex.string.literal]) as follows:
* U+0022 QUOTATION MARK (") is appended to /E/.
* Each code unit sequence /X/ in /S/ that either encodes a single
character or encoding state transition or that is a sequence of
ill-formed code units is processed in order as follows:
o If /X/ encodes a single character /C/, then:
+ If /C/ is one of the UCS scalar values in table X, then the
corresponding escape sequence is appended to /E/.
<insert table X here>
+ Otherwise, if /C/ is not U+0020 SPACE and
# /CE/ is a Unicode encoding and C corresponds to a UCS
scalar value whose Unicode property General_Category has
a value in the groups Separator (Z) or Other (C), as
described by table 12 of UAX#44, or
# /CE/ is not a Unicode encoding and C is one of an
implementation-defined set of separator or non-printable
characters
+ then the UCS scalar value corresponding to /C/ is appended
to /E/ formatted as-if by a format string
([format.string.general]
<http://eel.is/c++draft/format.string.general>) of
"\\u\{{x}\}". When encoding a stateful character encoding,
these additions should have no effect on encoding state.
+ Otherwise, /C/ is appended to /E/.
o Otherwise, if /X/ encodes a state transition, the effect on /E/
is unspecified.
o Otherwise /X/ is a sequence of ill-formed code units. Each code
unit /U/ is appended to /E/ in order formatted as-if by a format
string ([format.string.general]
<http://eel.is/c++draft/format.string.general>) of "\\x\{{x}\}".
When encoding a stateful character encoding, these additions
should have no effect on encoding state.
* U+0022 QUOTATION MARK (") is appended to /E/.
When encoding a stateful character encoding, implementations should
first initialize /E/ to the initial encoding state. Each subsequent
addition to /E/ should begin with the final encoding state of the prior
addition. /E/ should be returned to the initial encoding state after the
final quotation mark is appended.
Tom.
On 4/26/22 4:31 PM, Tom Honermann via SG16 wrote:
>
> The proposed wording for [format.string.escaped]p4 in P2286R7:
> Formatting Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>
>> The escaped character and escaped string representations of a
>> character or string in a non-Unicode encoding is unspecified.
> I would like this to be better specified to ensure implementations
> behave consistently.
>
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
>
> The escaped string /E/ representation of a string /S/ is constructed
> by encoding a sequence of characters in the associated character
> encoding /CE/ for charT ([lex.string.literal]) as follows:
>
> * /E/ is initialized with U+0022 QUOTATION MARK (").
> * For each code unit sequence /X/ in /S/ that either encodes a
> single character or that is a sequence of ill-formed code units:
> o If /X/ encodes a single character /C/, then:
> + If /C/ is in the table below, then its corresponding
> two-character escape sequence is appended to /E/.
> <insert table here>
> + Otherwise, if /C/ is not U+0020 SPACE and
> # /CE/ is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category
> has a value in the groups Separator (Z) or Other (C),
> as described by table 12 of UAX#44, or
> # /CE/ is not a Unicode encoding and C is one of an
> implementation-defined set of separator or
> non-printable characters
> + then the sequence \u{/simple-hexadecimal-digit-sequence/}
> is appended to /E/ where
> /simple-hexadecimal-digit-sequence/ is the code point
> value of /C/ formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> + Otherwise, /C/ is appended to /E/.
> o Otherwise /X/ is a sequence of ill-formed code units. For each
> code unit /U/, the sequence
> \x{/simple-hexadecimal-digit-sequence/} is appended to /E/
> where /simple-hexadecimal-digit-sequence/ is the code point
> value of /U/ formatted as-if by a standard format specifier
> ([[format.string.std]]) of "{x}".
> * U+0022 QUOTATION MARK (") is appended to /E/.
>
> Please offer your thoughts, I would like to discuss this in tomorrow's
> SG16 meeting.
>
> Tom.
>
>
Received on 2022-04-27 18:59:52