C++ Logo

sg16

Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 27 Apr 2022 00:48:59 +0200
Thanks Tom,
A few comments below

On Tue, Apr 26, 2022 at 10:31 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting
> Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>
> The escaped character and escaped string representations of a character or
> string in a non-Unicode encoding is unspecified.
>
> I would like this to be better specified to ensure implementations behave
> consistently.
>
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
>
> The escaped string *E* representation of a string *S* is constructed by
> encoding a sequence of characters in the associated character encoding
> *CE* for charT ([lex.string.literal]) as follows:
>
> - *E* is initialized with U+0022 QUOTATION MARK (").
> - For each code unit sequence *X* in *S* that either encodes a single
> character or that is a sequence of ill-formed code units:
>
> I think we should try to avoid considering ill-formed code units as part
of sequences because it begs the question of the boundary condition.
I think the initial wording is clearer. We can read in order either one
valid UCS or one invalid code unit.
Also, character is ill-defined - codepoint works though.

>
> - If *X* encodes a single character *C*, then:
> - If *C* is in the table below, then its corresponding
> two-character escape sequence is appended to *E*.
> <insert table here>
> - Otherwise, if *C* is not U+0020 SPACE and
> - *CE* is a Unicode encoding and C corresponds to a UCS
> scalar value whose Unicode property General_Category has a
> value in the groups Separator (Z) or Other (C), as described
> by table 12 of UAX#44, or
> - *CE* is not a Unicode encoding and C is one of an
> implementation-defined set of separator or non-printable character
>
> Maybe a simpler way would be to define an implementation-defined mapping
TO-UCS(CE), and then use ucs in the rest of the paragraph, the behavior is
isomorphic, although it gives more guidance
to implementers not to be imaginative with which character they consider
printable.

>
> - then the sequence \u{*simple-hexadecimal-digit-sequence*} is
> appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *C* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
> - Otherwise, *C* is appended to *E*.
> - Otherwise *X* is a sequence of ill-formed code units. For each
> code unit *U*, the sequence \x{*simple-hexadecimal-digit-sequence*}
> is appended to *E* where *simple-hexadecimal-digit-sequence* is the
> code point value of *U* formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
>
> I prefer the original formulation here. In particular, what do you mean by
"code point value"? An invalid code unit does not have a mapping to
codepoint. It is, however, a value itself.
But "where *simple-hexadecimal-digit-sequence* is *U* formatted as-if by a
standard format specifier ([[format.string.std]]) of "{x}". "is also less
clear than the original wording imo


I hope that helps,
Corentin

>
> -
> - U+0022 QUOTATION MARK (") is appended to *E*.
>
> Please offer your thoughts, I would like to discuss this in tomorrow's
> SG16 meeting.
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-04-26 22:49:11