ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 27 Apr 2022 13:27:05 -0400

Thank you, Corentin.

On 4/26/22 6:48 PM, Corentin Jabot wrote:
> Thanks Tom,
> A few comments below
>
> On Tue, Apr 26, 2022 at 10:31 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> The proposed wording for [format.string.escaped]p4 in P2286R7:
> Formatting Ranges
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html>
> currently states:
>
>> The escaped character and escaped string representations of a
>> character or string in a non-Unicode encoding is unspecified.
> I would like this to be better specified to ensure implementations
> behave consistently.
>
> The wording below is suggested as a replacement for
> [format.string.escaped]p2-p4 (link to p2
> <https://wiki.edg.com/pub/Wg21telecons2022/LibraryWorkingGroup/p2286r7.html#pnum_12>)
> and is intended to cover both the Unicode and non-Unicode cases.
>
> The escaped string /E/ representation of a string /S/ is
> constructed by encoding a sequence of characters in the associated
> character encoding /CE/ for charT ([lex.string.literal]) as follows:
>
> * /E/ is initialized with U+0022 QUOTATION MARK (").
> * For each code unit sequence /X/ in /S/ that either encodes a
> single character or that is a sequence of ill-formed code units:
>
> I think we should try to avoid considering ill-formed code units as
> part of sequences because it begs the question of the boundary condition.

We have an open concern about handling such boundary concerns with an
intent to defer to the WHATWG Encoding Standard
<https://encoding.spec.whatwg.org/>, but we don't have wording for that yet.

> I think the initial wording is clearer. We can read in order either
> one valid UCS or one invalid code unit.
Perhaps I'm over thinking this, but since an individual code unit may
not be known to be invalid in isolation, I think it makes more sense to
treat a code unit sequence as being invalid.
> Also, character is ill-defined - codepoint works though.
I agree it is, but we use it all over the place. Do we currently use
"code point" outside of a Unicode context anywhere?
>
> o If /X/ encodes a single character /C/, then:
> + If /C/ is in the table below, then its corresponding
> two-character escape sequence is appended to /E/.
> <insert table here>
> + Otherwise, if /C/ is not U+0020 SPACE and
> # /CE/ is a Unicode encoding and C corresponds to a
> UCS scalar value whose Unicode property
> General_Category has a value in the groups
> Separator (Z) or Other (C), as described by table
> 12 of UAX#44, or
> # /CE/ is not a Unicode encoding and C is one of an
> implementation-defined set of separator or
> non-printable character
>
> Maybe a simpler way would be to define an implementation-defined
> mapping TO-UCS(CE), and then use ucs in the rest of the paragraph, the
> behavior is isomorphic, although it gives more guidance
> to implementers not to be imaginative with which character they
> consider printable.

I had originally suggested something like that in another email thread,
but got push back; there was a desire not to force implementors to
become aware of Unicode properties.

> + then the sequence
> \u{/simple-hexadecimal-digit-sequence/} is appended to
> /E/ where /simple-hexadecimal-digit-sequence/ is the
> code point value of /C/ formatted as-if by a standard
> format specifier ([[format.string.std]]) of "{x}".
> + Otherwise, /C/ is appended to /E/.
> o Otherwise /X/ is a sequence of ill-formed code units. For
> each code unit /U/, the sequence
> \x{/simple-hexadecimal-digit-sequence/} is appended to /E/
> where /simple-hexadecimal-digit-sequence/ is the code
> point value of /U/ formatted as-if by a standard format
> specifier ([[format.string.std]]) of "{x}".
>
> I prefer the original formulation here. In particular, what do you
> mean by "code point value"? An invalid code unit does not have a
> mapping to codepoint. It is, however, a value itself.

Oops, that should have said "code unit value". Thanks for spotting that.

Tom.

> But "where /simple-hexadecimal-digit-sequence/ is /U/ formatted as-if
> by a standard format specifier ([[format.string.std]]) of "{x}". "is
> also less clear than the original wording imo
>
> I hope that helps,
> Corentin
>
> o
>
>
> * U+0022 QUOTATION MARK (") is appended to /E/.
>
> Please offer your thoughts, I would like to discuss this in
> tomorrow's SG16 meeting.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-04-27 17:27:09