ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 26 Apr 2022 20:39:52 -0400

On Tue, Apr 26, 2022 at 4:31 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting
> Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>
> The escaped character and escaped string representations of a character or
> string in a non-Unicode encoding is unspecified.
>
> I would like this to be better specified to ensure implementations behave
> consistently.
>
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
>
> The escaped string *E* representation of a string *S* is constructed by
> encoding a sequence of characters in the associated character encoding
> *CE* for charT ([lex.string.literal]) as follows:
>
> - *E* is initialized with U+0022 QUOTATION MARK (").
> - For each code unit sequence *X* in *S* that either encodes a single
> character or that is a sequence of ill-formed code units:
> - If *X* encodes a single character *C*, then:
> - If *C* is in the table below, then its corresponding
> two-character escape sequence is appended to *E*.
> <insert table here>
> - Otherwise, if *C* is not U+0020 SPACE and
> - *CE* is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category has a
> value in the groups Separator (Z) or Other (C), as described
> by table 12 of UAX#44, or
> - *CE* is not a Unicode encoding and C is one of an
> implementation-defined set of separator or non-printable characters
>
> I would prefer more emphasis on the "an" (which is much different from
"the" in this context): "a set, implementation-defined for this purpose, of
[ ... ]".

What should Windows implementations do with codepages that overlay graphic
characters with control characters?

>
> - then the sequence \u{*simple-hexadecimal-digit-sequence*} is
> appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *C* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
>
> Okay: This requires conversion-to-Unicode to be implemented in the runtime
library only for the limited set of space and non-printable characters.

>
> -
> - Otherwise, *C* is appended to *E*.
> - Otherwise *X* is a sequence of ill-formed code units. For each
> code unit *U*, the sequence \x{*simple-hexadecimal-digit-sequence*}
> is appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *U* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> - U+0022 QUOTATION MARK (") is appended to *E*.
>
> Please offer your thoughts, I would like to discuss this in tomorrow's
> SG16 meeting.
>
The above may actively prevent translation-as-source of escaped strings
from reproducing the same code unit sequence in cases of stateful encodings
for strings due to "unnecessary" shift sequences. I believe the most likely
practical resolution applied by implementations would be to choose to apply
an encoding that is different from the literal/wide literal encoding (but
is in a base "family"). That is, shift sequences will be considered to be a
sequence of characters in their own right and the processing operates as if
the initial shift state is active.

I believe the above wording does not accommodate that solution (and that
solution should be allowed).

> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-04-27 00:40:23