ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 11 May 2022 09:24:03 -0700

Thanks Tom and others for revising the wording. The latest version of the
escaping section looks good to me with only one minor question: is it clear
that "character" in
https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
means a code point or shall we use the term code point instead?

Cheers,
Victor

On Tue, May 10, 2022 at 6:32 PM Barry Revzin <barry.revzin_at_[hidden]> wrote:

>
>
> On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>
>>
>>
>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>
>>>
>>>
>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich <
>>> victor.zverovich_at_[hidden]> wrote:
>>>
>>>> > One thing I noticed is that the wording about Grapheme_Extend is
>>>> gone. I didn't know what this meant before, so I don't know now if this is
>>>> a good removal or a bad removal.
>>>>
>>>> I don't recall any requests for removing it and think that it should be
>>>> reintroduced.
>>>>
>>>> - Victor
>>>>
>>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]>
>>>> wrote:
>>>>
>>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>>> > I think I have applied this. Here's the rendered version:
>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>> <
>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>> >
>>>>>
>>>>> > How does this look?
>>>>>
>>>>> p2.2
>>>>>
>>>>> For each code sequence X in S that either encodes a single character
>>>>> or encoding state transition or that is a sequence of ill-formed code units
>>>>> is processed in order as follows:
>>>>>
>>>>> That feels like bad English grammar to me.
>>>>>
>>>>> Why "encoding", yet there is an "encodes" before that?
>>>>> Why "either" and there are three things that don't
>>>>> exactly correspond grammatically?
>>>>>
>>>>> Maybe make a bulleted sub-list with the three items
>>>>> so that the structure is clear.
>>>>>
>>>>> "If C is one of the UCS scalar values the table below,"
>>>>>
>>>>> add "in"
>>>>>
>>>>> better clarify: "the two characters shown as the
>>>>> corresponding escape sequence are appended to E"
>>>>>
>>>>>
>>>>> after p2.3.4, p2.5
>>>>>
>>>>> "simple-hexadecimal-digit-sequence"
>>>>>
>>>>> I would not re-use lexing grammar for a local placeholder,
>>>>> just say \u{/hex-digit-sequence/} or so.
>>>>>
>>>>>
>>>>> p2.5
>>>>>
>>>>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>>>>
>>>>> -> "Otherwise (X is a sequence of ill-formed code units), each code
>>>>> unit ..."
>>>>>
>>>>>
>>>>> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION MARK is
>>>>> left unchanged."
>>>>>
>>>>> Can we rephrase that to avoid "is escaped as"? We were on such a good
>>>>> track to just append characters and avoid any judgment calls.
>>>>>
>>>>> suggestion "
>>>>> - for each character U+0027 APOSTROPHE in S, the two characters \'
>>>>> are appended to E
>>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>>
>>>>>
>>>>> Jens
>>>>>
>>>>
>>> Thanks Jens and Victor! I did my best to apply the suggested changes:
>>>
>>>
>>> - Updated rendered wording:
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>> - New diff:
>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>
>>>
>>> Hopefully, this gradient is slowly descending to the correct solution
>>> :-)
>>>
>>> Thanks, Barry. This appears to have incorporated the parts of my prior
>>> suggestions that did not have opposition, so just minor issues noted below.
>>>
>>> Discussion at the last meeting
>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>> revealed that we're failing to specify the encoding used to interpret
>>> *S*. Change p2 as follows: (perhaps substitute "as described below" for
>>> "as follows")
>>>
>>> The escaped string *E* representation of a string *S* is constructed by
>>> encoding a sequence of characters *as follows.* in t*T*he associated
>>> character encoding *CE* for charT ([lex.string.literal]
>>> <http://eel.is/c++draft/tab:lex.string.literal>) as follows:* is used
>>> both to interpret S and to construct E.*
>>>
>>> In p2.2, "code sequence" -> "code unit sequence".
>>>
>>> In p2.3.4 and p2.5, I don't think we should re-use the
>>> *hexadecimal-digit* grammar term here. Just say, "hexadecimal digits".
>>>
>>> Add the following note to p2.4 to address a request by Hubert:
>>>
>>> Otherwise, if *X* encodes a state transition, the effect on *E* is
>>> unspecified.* [ Note: the intent is that a state transition be
>>> represented in E such that its original code unit sequence can be
>>> reconstructed - end note ]*
>>>
>>> Hubert pointed out during the last meeting that we should not be trying
>>> to interpret state transitions for stateful encodings as I had previously
>>> been trying to do. I think we can now simplify p2.5:
>>>
>>> Otherwise (*X* is a sequence of ill-formed code units), each code unit
>>> *U* is appended to *E* in order as the sequence *\x{hex-digit-sequence}*,
>>> where *hex-digit-sequence* is the shortest hexadecimal representation
>>> of *U* using lower-case hexadecimal digits. When encoding a stateful
>>> character encoding, these additions should have no effect on encoding state.
>>>
>>> In p3, we now need to drop "in a Unicode encoding". I think the result
>>> should also produce a string, not a character.
>>>
>>> The escaped character*string* representation of a character *C* in a
>>> Unicode encoding is equivalent to the escaped string representation of
>>> a string of *C*, except that:
>>>
>>> p4 should be removed now.
>>>
>>> The escaped character and escaped string representations of a character
>>> or string in a non-Unicode encoding is unspecified.
>>>
>>> Hubert, the wording does not explicitly address your request to be able
>>> to specify spacing and separator characters as a set of encoding agnostic
>>> code point values. I think the existing wording suffices to meet your goals
>>> since an implementation can document a method of identifying the set of
>>> escaped characters by, for example, specifying characters in EBCDIC 1047
>>> and describing how to map those to other code pages. If you don't agree,
>>> could you suggest how the wording might be updated to better address your
>>> concern?
>>>
>>> Tom.
>>>
>>
>> Thanks, Tom! I applied these changes. The diff can be found here:
>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>
>> Thanks, Barry. This looks good to me modulo Hubert's additional tweak.
>>
>> One last thing I noticed. The example section has this:
>>
>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>> // s4 has value
>> [\u{0} \n \t \u{2} \u{1b}]
>>
>> That example depends on the encoding being ASCII-based in order for the
>> \x02 and \x1b escapes to be interpreted as characters \u{2} and \u{1b}.
>> Similarly, s5 and s6 have UTF-8 dependencies. Perhaps we should add a
>> comment?
>>
>> string s0 = format("[{}]", "h\tllo"); // s0 has value:
>> [h llo]
>> string s1 = format("[{:?}]", "h\tllo"); // s1 has value:
>> ["h\tllo"]
>> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); // s2 has value:
>> ["Спасибо, Виктор ♥!"]
>> string s3 = format("[{:?}] [{:?}]", '\'', '"'); // s3 has value:
>> ['\'', '"']
>> *// The following examples assume use of the UTF-8 encoding.*
>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>> // s4 has value
>> [\u{0} \n \t \u{2} \u{1b}]
>> string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
>> // s5 has value:
>> ["\x{c3}\x{28}"]
>> string s6 = format("[{:?}]", "🤷🏻‍♂️"); // s6 has
>> value: ["🤷🏻\u{200d}♂\u{fe0f}"]
>>
>> I never got around to translating "Спасибо, Виктор ♥!" until now. Very
>> nice :)
>>
>> Tom
>>
>
> Applied Hubert's change and added this comment:
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
> Thanks!
>
> The decreasing rate of requested changes is encouraging!
>
> Barry
>

Received on 2022-05-11 16:24:15