ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Barry Revzin <barry.revzin_at_[hidden]>
Date: Tue, 10 May 2022 20:31:54 -0500

On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 5/9/22 7:34 PM, Barry Revzin wrote:
>
>
>
> On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>
>>
>>
>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich <
>> victor.zverovich_at_[hidden]> wrote:
>>
>>> > One thing I noticed is that the wording about Grapheme_Extend is gone.
>>> I didn't know what this meant before, so I don't know now if this is a good
>>> removal or a bad removal.
>>>
>>> I don't recall any requests for removing it and think that it should be
>>> reintroduced.
>>>
>>> - Victor
>>>
>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>>
>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>> > I think I have applied this. Here's the rendered version:
>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>> <
>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>> >
>>>>
>>>> > How does this look?
>>>>
>>>> p2.2
>>>>
>>>> For each code sequence X in S that either encodes a single character or
>>>> encoding state transition or that is a sequence of ill-formed code units is
>>>> processed in order as follows:
>>>>
>>>> That feels like bad English grammar to me.
>>>>
>>>> Why "encoding", yet there is an "encodes" before that?
>>>> Why "either" and there are three things that don't
>>>> exactly correspond grammatically?
>>>>
>>>> Maybe make a bulleted sub-list with the three items
>>>> so that the structure is clear.
>>>>
>>>> "If C is one of the UCS scalar values the table below,"
>>>>
>>>> add "in"
>>>>
>>>> better clarify: "the two characters shown as the
>>>> corresponding escape sequence are appended to E"
>>>>
>>>>
>>>> after p2.3.4, p2.5
>>>>
>>>> "simple-hexadecimal-digit-sequence"
>>>>
>>>> I would not re-use lexing grammar for a local placeholder,
>>>> just say \u{/hex-digit-sequence/} or so.
>>>>
>>>>
>>>> p2.5
>>>>
>>>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>>>
>>>> -> "Otherwise (X is a sequence of ill-formed code units), each code
>>>> unit ..."
>>>>
>>>>
>>>> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION MARK is left
>>>> unchanged."
>>>>
>>>> Can we rephrase that to avoid "is escaped as"? We were on such a good
>>>> track to just append characters and avoid any judgment calls.
>>>>
>>>> suggestion "
>>>> - for each character U+0027 APOSTROPHE in S, the two characters \' are
>>>> appended to E
>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>
>>>>
>>>> Jens
>>>>
>>>
>> Thanks Jens and Victor! I did my best to apply the suggested changes:
>>
>>
>> - Updated rendered wording:
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>> - New diff:
>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>
>>
>> Hopefully, this gradient is slowly descending to the correct solution :-)
>>
>> Thanks, Barry. This appears to have incorporated the parts of my prior
>> suggestions that did not have opposition, so just minor issues noted below.
>>
>> Discussion at the last meeting
>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022> revealed
>> that we're failing to specify the encoding used to interpret *S*. Change
>> p2 as follows: (perhaps substitute "as described below" for "as follows")
>>
>> The escaped string *E* representation of a string *S* is constructed by
>> encoding a sequence of characters *as follows.* in t*T*he associated
>> character encoding *CE* for charT ([lex.string.literal]
>> <http://eel.is/c++draft/tab:lex.string.literal>) as follows:* is used
>> both to interpret S and to construct E.*
>>
>> In p2.2, "code sequence" -> "code unit sequence".
>>
>> In p2.3.4 and p2.5, I don't think we should re-use the
>> *hexadecimal-digit* grammar term here. Just say, "hexadecimal digits".
>>
>> Add the following note to p2.4 to address a request by Hubert:
>>
>> Otherwise, if *X* encodes a state transition, the effect on *E* is
>> unspecified.* [ Note: the intent is that a state transition be
>> represented in E such that its original code unit sequence can be
>> reconstructed - end note ]*
>>
>> Hubert pointed out during the last meeting that we should not be trying
>> to interpret state transitions for stateful encodings as I had previously
>> been trying to do. I think we can now simplify p2.5:
>>
>> Otherwise (*X* is a sequence of ill-formed code units), each code unit
>> *U* is appended to *E* in order as the sequence *\x{hex-digit-sequence}*,
>> where *hex-digit-sequence* is the shortest hexadecimal representation of
>> *U* using lower-case hexadecimal digits. When encoding a stateful
>> character encoding, these additions should have no effect on encoding state.
>>
>> In p3, we now need to drop "in a Unicode encoding". I think the result
>> should also produce a string, not a character.
>>
>> The escaped character*string* representation of a character *C* in a
>> Unicode encoding is equivalent to the escaped string representation of a
>> string of *C*, except that:
>>
>> p4 should be removed now.
>>
>> The escaped character and escaped string representations of a character
>> or string in a non-Unicode encoding is unspecified.
>>
>> Hubert, the wording does not explicitly address your request to be able
>> to specify spacing and separator characters as a set of encoding agnostic
>> code point values. I think the existing wording suffices to meet your goals
>> since an implementation can document a method of identifying the set of
>> escaped characters by, for example, specifying characters in EBCDIC 1047
>> and describing how to map those to other code pages. If you don't agree,
>> could you suggest how the wording might be updated to better address your
>> concern?
>>
>> Tom.
>>
>
> Thanks, Tom! I applied these changes. The diff can be found here:
> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>
> Thanks, Barry. This looks good to me modulo Hubert's additional tweak.
>
> One last thing I noticed. The example section has this:
>
> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
> // s4 has value
> [\u{0} \n \t \u{2} \u{1b}]
>
> That example depends on the encoding being ASCII-based in order for the
> \x02 and \x1b escapes to be interpreted as characters \u{2} and \u{1b}.
> Similarly, s5 and s6 have UTF-8 dependencies. Perhaps we should add a
> comment?
>
> string s0 = format("[{}]", "h\tllo"); // s0 has value:
> [h llo]
> string s1 = format("[{:?}]", "h\tllo"); // s1 has value:
> ["h\tllo"]
> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); // s2 has value:
> ["Спасибо, Виктор ♥!"]
> string s3 = format("[{:?}] [{:?}]", '\'', '"'); // s3 has value:
> ['\'', '"']
> *// The following examples assume use of the UTF-8 encoding.*
> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
> // s4 has value
> [\u{0} \n \t \u{2} \u{1b}]
> string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
> // s5 has value:
> ["\x{c3}\x{28}"]
> string s6 = format("[{:?}]", "🤷🏻‍♂️"); // s6 has
> value: ["🤷🏻\u{200d}♂\u{fe0f}"]
>
> I never got around to translating "Спасибо, Виктор ♥!" until now. Very
> nice :)
>
> Tom
>

Applied Hubert's change and added this comment:
https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
Thanks!

The decreasing rate of requested changes is encouraging!

Barry

Received on 2022-05-11 01:32:07