C++ Logo


Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 Apr 2022 13:47:51 -0400
Thanks, Hubert.

On 4/26/22 8:39 PM, Hubert Tong wrote:
> On Tue, Apr 26, 2022 at 4:31 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
> The proposed wording for [format.string.escaped]p4 in P2286R7:
> Formatting Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>> The escaped character and escaped string representations of a
>> character or string in a non-Unicode encoding is unspecified.
> I would like this to be better specified to ensure implementations
> behave consistently.
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
> The escaped string /E/ representation of a string /S/ is
> constructed by encoding a sequence of characters in the associated
> character encoding /CE/ for charT ([lex.string.literal]) as follows:
> * /E/ is initialized with U+0022 QUOTATION MARK (").
> * For each code unit sequence /X/ in /S/ that either encodes a
> single character or that is a sequence of ill-formed code units:
> o If /X/ encodes a single character /C/, then:
> + If /C/ is in the table below, then its corresponding
> two-character escape sequence is appended to /E/.
> <insert table here>
> + Otherwise, if /C/ is not U+0020 SPACE and
> # /CE/ is a Unicode encoding and C corresponds to a
> UCS scalar value whose Unicode property
> General_Category has a value in the groups
> Separator (Z) or Other (C), as described by table
> 12 of UAX#44, or
> # /CE/ is not a Unicode encoding and C is one of an
> implementation-defined set of separator or
> non-printable characters
> I would prefer more emphasis on the "an" (which is much different from
> "the" in this context): "a set, implementation-defined for this
> purpose, of [ ... ]".
Could you elaborate a bit? I would expect there to be one set for each
distinct encoding. Is the concern wanting to ensure that this set is not
related to the set of separator or non-printable characters perhaps used
elsewhere in the standard? e.g., for isspace()?
> What should Windows implementations do with codepages that overlay
> graphic characters with control characters?
I'm not sure how this is relevant. The string is interpreted using the
literal encoding corresponding to charT, so there should be no ambiguity.
> + then the sequence
> \u{/simple-hexadecimal-digit-sequence/} is appended to
> /E/ where /simple-hexadecimal-digit-sequence/ is the
> code point value of /C/ formatted as-if by a standard
> format specifier ([[format.string.std]]) of "{x}".
> Okay: This requires conversion-to-Unicode to be implemented in the
> runtime library only for the limited set of space and non-printable
> characters.
Yes, that sounds right.
> +
> + Otherwise, /C/ is appended to /E/.
> o Otherwise /X/ is a sequence of ill-formed code units. For
> each code unit /U/, the sequence
> \x{/simple-hexadecimal-digit-sequence/} is appended to /E/
> where /simple-hexadecimal-digit-sequence/ is the code
> point value of /U/ formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> * U+0022 QUOTATION MARK (") is appended to /E/.
> Please offer your thoughts, I would like to discuss this in
> tomorrow's SG16 meeting.
> The above may actively prevent translation-as-source of escaped
> strings from reproducing the same code unit sequence in cases of
> stateful encodings for strings due to "unnecessary" shift sequences. I
> believe the most likely practical resolution applied by
> implementations would be to choose to apply an encoding that is
> different from the literal/wide literal encoding (but is in a base
> "family"). That is, shift sequences will be considered to be a
> sequence of characters in their own right and the processing operates
> as if the initial shift state is active.
That matches my expectations as well. I'll follow up with suggested
wording to address this and some of Corentin's concerns.
> I believe the above wording does not accommodate that solution (and
> that solution should be allowed).

I agree it should be allowed.


> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2022-04-27 17:47:52