ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Fri, 13 May 2022 22:48:11 -0400

On Fri, May 13, 2022 at 8:55 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> Thanks for the update, Barry. No concerns from me!
>

Thanks for the heads up, Barry. Looks okay to me too.

>
> Tom.
>
> On May 13, 2022, at 8:04 PM, Barry Revzin <barry.revzin_at_[hidden]> wrote:
>
>
> Thank you for making all these iterations!
>
> LWG re-affirmed this paper today, making one change. The wording you all
> provided me had a note:
>
> [ *Note*: the intent is that a state transition be represented in `$E$`
> such that the original code unit sequence of `$S$` can be reconstructed
> -*end note* ]
>
> which LWG wanted to elevate into recommended practice:
>
> *Recommended Practice*: a state transition should be represented in `$E$`
> such that the original code unit sequence of `$S$` can be reconstructed.
>
> Same words, just slightly more intentional about the intent. I hope that's
> okay with everybody. (dif:
> https://github.com/brevzin/cpp_proposals/commit/fc263d0be55e189a6f98996a7cb06f2f87f82bfd,
> rendered:
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_21
> )
>
> Thanks again,
>
> Barry
>
> On Thu, May 12, 2022 at 8:56 AM Tom Honermann <tom_at_[hidden]> wrote:
>
>> Ship it!
>>
>> Thank you for sticking with us through all these iterations!
>>
>> Tom.
>> On 5/11/22 9:44 PM, Barry Revzin wrote:
>>
>> Done!
>>
>> Barry "Ship it?" Revzin
>>
>> On Wed, May 11, 2022 at 3:36 PM Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> Hi, Barry. We discussed in today's SG16 meeting and identified one last
>>> minor change to make. We then polled forwarding the paper to LWG with
>>> unanimous consent so this is definitely the last change!
>>>
>>> In 2.3.1, substitute "character" for "UCS scalar value" in the first
>>> sentence and in the table header.
>>>
>>> If *C* is one of the UCS scalar values*characters* in the table below,
>>> then the two characters shown as the corresponding escape sequence are
>>> appended to *E*:
>>> UCS scalar value*character*
>>> escape sequence
>>> U+0009 CHARACTER TABULATION \t
>>> U+000A LINE FEED \n
>>> U+000D CARRIAGE RETURN \r
>>> U+0022 QUOTATION MARK \"
>>> U+005C REVERSE SOLIDUS \\
>>>
>>> Tom.
>>> On 5/11/22 12:56 PM, Tom Honermann via SG16 wrote:
>>>
>>> I have a weak preference for "character" given that the wording is
>>> intended to address Unicode and non-Unicode behavior. I don't think we have
>>> any normative uses of "code point" at present.
>>>
>>> The definition of "code point" we have via our normative reference to
>>> ISO/IEC 10646 is: "value in the UCS codespace". That doesn't really work
>>> for the non-Unicode case and, regardless, would include surrogate code
>>> points which I don't think we want in this context.
>>>
>>> Tom.
>>> On 5/11/22 12:24 PM, Victor Zverovich via SG16 wrote:
>>>
>>> Thanks Tom and others for revising the wording. The latest version of
>>> the escaping section looks good to me with only one minor question: is it
>>> clear that "character" in
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
>>> means a code point or shall we use the term code point instead?
>>>
>>> Cheers,
>>> Victor
>>>
>>> On Tue, May 10, 2022 at 6:32 PM Barry Revzin <barry.revzin_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]>
>>>> wrote:
>>>>
>>>>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich <
>>>>>> victor.zverovich_at_[hidden]> wrote:
>>>>>>
>>>>>>> > One thing I noticed is that the wording about Grapheme_Extend is
>>>>>>> gone. I didn't know what this meant before, so I don't know now if this is
>>>>>>> a good removal or a bad removal.
>>>>>>>
>>>>>>> I don't recall any requests for removing it and think that it should
>>>>>>> be reintroduced.
>>>>>>>
>>>>>>> - Victor
>>>>>>>
>>>>>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>>>>>> > I think I have applied this. Here's the rendered version:
>>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>>> <
>>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>>> >
>>>>>>>>
>>>>>>>> > How does this look?
>>>>>>>>
>>>>>>>> p2.2
>>>>>>>>
>>>>>>>> For each code sequence X in S that either encodes a single
>>>>>>>> character or encoding state transition or that is a sequence of ill-formed
>>>>>>>> code units is processed in order as follows:
>>>>>>>>
>>>>>>>> That feels like bad English grammar to me.
>>>>>>>>
>>>>>>>> Why "encoding", yet there is an "encodes" before that?
>>>>>>>> Why "either" and there are three things that don't
>>>>>>>> exactly correspond grammatically?
>>>>>>>>
>>>>>>>> Maybe make a bulleted sub-list with the three items
>>>>>>>> so that the structure is clear.
>>>>>>>>
>>>>>>>> "If C is one of the UCS scalar values the table below,"
>>>>>>>>
>>>>>>>> add "in"
>>>>>>>>
>>>>>>>> better clarify: "the two characters shown as the
>>>>>>>> corresponding escape sequence are appended to E"
>>>>>>>>
>>>>>>>>
>>>>>>>> after p2.3.4, p2.5
>>>>>>>>
>>>>>>>> "simple-hexadecimal-digit-sequence"
>>>>>>>>
>>>>>>>> I would not re-use lexing grammar for a local placeholder,
>>>>>>>> just say \u{/hex-digit-sequence/} or so.
>>>>>>>>
>>>>>>>>
>>>>>>>> p2.5
>>>>>>>>
>>>>>>>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>>>>>>>
>>>>>>>> -> "Otherwise (X is a sequence of ill-formed code units), each code
>>>>>>>> unit ..."
>>>>>>>>
>>>>>>>>
>>>>>>>> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION MARK is
>>>>>>>> left unchanged."
>>>>>>>>
>>>>>>>> Can we rephrase that to avoid "is escaped as"? We were on such a
>>>>>>>> good
>>>>>>>> track to just append characters and avoid any judgment calls.
>>>>>>>>
>>>>>>>> suggestion "
>>>>>>>> - for each character U+0027 APOSTROPHE in S, the two characters \'
>>>>>>>> are appended to E
>>>>>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>>>>>
>>>>>>>>
>>>>>>>> Jens
>>>>>>>>
>>>>>>>
>>>>>> Thanks Jens and Victor! I did my best to apply the suggested changes:
>>>>>>
>>>>>>
>>>>>> - Updated rendered wording:
>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>> - New diff:
>>>>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>>>>
>>>>>>
>>>>>> Hopefully, this gradient is slowly descending to the correct solution
>>>>>> :-)
>>>>>>
>>>>>> Thanks, Barry. This appears to have incorporated the parts of my
>>>>>> prior suggestions that did not have opposition, so just minor issues noted
>>>>>> below.
>>>>>>
>>>>>> Discussion at the last meeting
>>>>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>>>>> revealed that we're failing to specify the encoding used to interpret
>>>>>> *S*. Change p2 as follows: (perhaps substitute "as described below"
>>>>>> for "as follows")
>>>>>>
>>>>>> The escaped string *E* representation of a string *S* is constructed
>>>>>> by encoding a sequence of characters *as follows.* in t*T*he
>>>>>> associated character encoding *CE* for charT ([lex.string.literal]
>>>>>> <http://eel.is/c++draft/tab:lex.string.literal>) as follows:* is
>>>>>> used both to interpret S and to construct E.*
>>>>>>
>>>>>> In p2.2, "code sequence" -> "code unit sequence".
>>>>>>
>>>>>> In p2.3.4 and p2.5, I don't think we should re-use the
>>>>>> *hexadecimal-digit* grammar term here. Just say, "hexadecimal
>>>>>> digits".
>>>>>>
>>>>>> Add the following note to p2.4 to address a request by Hubert:
>>>>>>
>>>>>> Otherwise, if *X* encodes a state transition, the effect on *E* is
>>>>>> unspecified.* [ Note: the intent is that a state transition be
>>>>>> represented in E such that its original code unit sequence can be
>>>>>> reconstructed - end note ]*
>>>>>>
>>>>>> Hubert pointed out during the last meeting that we should not be
>>>>>> trying to interpret state transitions for stateful encodings as I had
>>>>>> previously been trying to do. I think we can now simplify p2.5:
>>>>>>
>>>>>> Otherwise (*X* is a sequence of ill-formed code units), each code
>>>>>> unit *U* is appended to *E* in order as the sequence
>>>>>> *\x{hex-digit-sequence}*, where *hex-digit-sequence* is the shortest
>>>>>> hexadecimal representation of *U* using lower-case hexadecimal
>>>>>> digits. When encoding a stateful character encoding, these additions
>>>>>> should have no effect on encoding state.
>>>>>>
>>>>>> In p3, we now need to drop "in a Unicode encoding". I think the
>>>>>> result should also produce a string, not a character.
>>>>>>
>>>>>> The escaped character*string* representation of a character *C* in a
>>>>>> Unicode encoding is equivalent to the escaped string representation
>>>>>> of a string of *C*, except that:
>>>>>>
>>>>>> p4 should be removed now.
>>>>>>
>>>>>> The escaped character and escaped string representations of a
>>>>>> character or string in a non-Unicode encoding is unspecified.
>>>>>>
>>>>>> Hubert, the wording does not explicitly address your request to be
>>>>>> able to specify spacing and separator characters as a set of encoding
>>>>>> agnostic code point values. I think the existing wording suffices to meet
>>>>>> your goals since an implementation can document a method of identifying the
>>>>>> set of escaped characters by, for example, specifying characters in EBCDIC
>>>>>> 1047 and describing how to map those to other code pages. If you don't
>>>>>> agree, could you suggest how the wording might be updated to better address
>>>>>> your concern?
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>
>>>>> Thanks, Tom! I applied these changes. The diff can be found here:
>>>>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>>>>
>>>>> Thanks, Barry. This looks good to me modulo Hubert's additional tweak.
>>>>>
>>>>> One last thing I noticed. The example section has this:
>>>>>
>>>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>>>> // s4 has value
>>>>> [\u{0} \n \t \u{2} \u{1b}]
>>>>>
>>>>> That example depends on the encoding being ASCII-based in order for
>>>>> the \x02 and \x1b escapes to be interpreted as characters \u{2} and
>>>>> \u{1b}. Similarly, s5 and s6 have UTF-8 dependencies. Perhaps we
>>>>> should add a comment?
>>>>>
>>>>> string s0 = format("[{}]", "h\tllo"); // s0 has
>>>>> value: [h llo]
>>>>> string s1 = format("[{:?}]", "h\tllo"); // s1 has
>>>>> value: ["h\tllo"]
>>>>> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); // s2 has
>>>>> value: ["Спасибо, Виктор ♥!"]
>>>>> string s3 = format("[{:?}] [{:?}]", '\'', '"'); // s3 has
>>>>> value: ['\'', '"']
>>>>> *// The following examples assume use of the UTF-8 encoding.*
>>>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>>>> // s4 has value
>>>>> [\u{0} \n \t \u{2} \u{1b}]
>>>>> string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
>>>>> // s5 has
>>>>> value: ["\x{c3}\x{28}"]
>>>>> string s6 = format("[{:?}]", "🤷🏻‍♂️"); // s6 has
>>>>> value: ["🤷🏻\u{200d}♂\u{fe0f}"]
>>>>>
>>>>> I never got around to translating "Спасибо, Виктор ♥!" until now. Very
>>>>> nice :)
>>>>>
>>>>> Tom
>>>>>
>>>>
>>>> Applied Hubert's change and added this comment:
>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>> Thanks!
>>>>
>>>> The decreasing rate of requested changes is encouraging!
>>>>
>>>> Barry
>>>>
>>>
>>>
>>> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-05-14 02:48:42