Thank you, Corentin.
Thanks Tom,
A few comments below
On Tue, Apr 26, 2022 at 10:31 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
The proposed wording for [format.string.escaped]p4 in P2286R7: Formatting Ranges currently states:
The escaped character and escaped string representations of a character or string in a non-Unicode encoding is unspecified.I would like this to be better specified to ensure implementations behave consistently.The wording below is suggested as a replacement for [format.string.escaped]p2-p4 (link to p2) and is intended to cover both the Unicode and non-Unicode cases.
The escaped string E representation of a string S is constructed by encoding a sequence of characters in the associated character encoding CE for charT ([lex.string.literal]) as follows:
- E is initialized with U+0022 QUOTATION MARK (").
- For each code unit sequence X in S that either encodes a single character or that is a sequence of ill-formed code units:
I think we should try to avoid considering ill-formed code units as part of sequences because it begs the question of the boundary condition.
We have an open concern about handling such boundary concerns with an intent to defer to the WHATWG Encoding Standard, but we don't have wording for that yet.
Perhaps I'm over thinking this, but since an individual code unit may not be known to be invalid in isolation, I think it makes more sense to treat a code unit sequence as being invalid.I think the initial wording is clearer. We can read in order either one valid UCS or one invalid code unit.
I agree it is, but we use it all over the place. Do we currently use "code point" outside of a Unicode context anywhere?Also, character is ill-defined - codepoint works though.
- If X encodes a single character C, then:
- If C is in the table below, then its corresponding two-character escape sequence is appended to E.
<insert table here>- Otherwise, if C is not U+0020 SPACE and
- CE is a Unicode encoding and C corresponds to a UCS scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by table 12 of UAX#44, or
- CE is not a Unicode encoding and C is one of an implementation-defined set of separator or non-printable character
Maybe a simpler way would be to define an implementation-defined mapping TO-UCS(CE), and then use ucs in the rest of the paragraph, the behavior is isomorphic, although it gives more guidanceto implementers not to be imaginative with which character they consider printable.
I had originally suggested something like that in another email thread, but got push back; there was a desire not to force implementors to become aware of Unicode properties.
- then the sequence \u{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of C formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
- Otherwise, C is appended to E.
- Otherwise X is a sequence of ill-formed code units. For each code unit U, the sequence \x{simple-hexadecimal-digit-sequence} is appended to E where simple-hexadecimal-digit-sequence is the code point value of U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}".
I prefer the original formulation here. In particular, what do you mean by "code point value"? An invalid code unit does not have a mapping to codepoint. It is, however, a value itself.
Oops, that should have said "code unit value". Thanks for spotting that.
Tom.
But "where simple-hexadecimal-digit-sequence is U formatted as-if by a standard format specifier ([[format.string.std]]) of "{x}". "is also less clear than the original wording imo
I hope that helps,Corentin--
- U+0022 QUOTATION MARK (") is appended to E.
Please offer your thoughts, I would like to discuss this in tomorrow's SG16 meeting.
Tom.
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16