ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 12 May 2022 09:56:47 -0400

Ship it!

Thank you for sticking with us through all these iterations!

Tom.

On 5/11/22 9:44 PM, Barry Revzin wrote:
> Done!
>
> Barry "Ship it?" Revzin
>
> On Wed, May 11, 2022 at 3:36 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> Hi, Barry. We discussed in today's SG16 meeting and identified one
> last minor change to make. We then polled forwarding the paper to
> LWG with unanimous consent so this is definitely the last change!
>
> In 2.3.1, substitute "character" for "UCS scalar value" in the
> first sentence and in the table header.
>
> If /C/ is one of the UCS scalar values_characters_ in the
> table below, then the two characters shown as the
> corresponding escape sequence are appended to /E/:
>
> UCS scalar value_character_
> escape sequence
> U+0009 CHARACTER TABULATION |\t|
> U+000A LINE FEED |\n|
> U+000D CARRIAGE RETURN |\r|
> U+0022 QUOTATION MARK |\"|
> U+005C REVERSE SOLIDUS |\\|
>
> Tom.
>
> On 5/11/22 12:56 PM, Tom Honermann via SG16 wrote:
>>
>> I have a weak preference for "character" given that the wording
>> is intended to address Unicode and non-Unicode behavior. I don't
>> think we have any normative uses of "code point" at present.
>>
>> The definition of "code point" we have via our normative
>> reference to ISO/IEC 10646 is: "value in the UCS codespace". That
>> doesn't really work for the non-Unicode case and, regardless,
>> would include surrogate code points which I don't think we want
>> in this context.
>>
>> Tom.
>>
>> On 5/11/22 12:24 PM, Victor Zverovich via SG16 wrote:
>>> Thanks Tom and others for revising the wording. The latest
>>> version of the escaping section looks good to me with only one
>>> minor question: is it clear that "character" in
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
>>> means a code point or shall we use the term code point instead?
>>>
>>> Cheers,
>>> Victor
>>>
>>> On Tue, May 10, 2022 at 6:32 PM Barry Revzin
>>> <barry.revzin_at_[hidden]> wrote:
>>>
>>>
>>>
>>> On Tue, May 10, 2022 at 1:31 PM Tom Honermann
>>> <tom_at_[hidden]> wrote:
>>>
>>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>>>
>>>>
>>>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann
>>>> <tom_at_[hidden]> wrote:
>>>>
>>>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>>>
>>>>>
>>>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich
>>>>> <victor.zverovich_at_[hidden]> wrote:
>>>>>
>>>>> > One thing I noticed is that the wording
>>>>> about Grapheme_Extend is gone. I didn't know
>>>>> what this meant before, so I don't know now if
>>>>> this is a good removal or a bad removal.
>>>>>
>>>>> I don't recall any requests for removing it
>>>>> and think that it should be reintroduced.
>>>>>
>>>>> - Victor
>>>>>
>>>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer
>>>>> <Jens.Maurer_at_[hidden]> wrote:
>>>>>
>>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>>> > I think I have applied this. Here's the
>>>>> rendered version:
>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>> <https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12>
>>>>>
>>>>> > How does this look?
>>>>>
>>>>> p2.2
>>>>>
>>>>> For each code sequence X in S that either
>>>>> encodes a single character or encoding
>>>>> state transition or that is a sequence of
>>>>> ill-formed code units is processed in
>>>>> order as follows:
>>>>>
>>>>> That feels like bad English grammar to me.
>>>>>
>>>>> Why "encoding", yet there is an "encodes"
>>>>> before that?
>>>>> Why "either" and there are three things
>>>>> that don't
>>>>> exactly correspond grammatically?
>>>>>
>>>>> Maybe make a bulleted sub-list with the
>>>>> three items
>>>>> so that the structure is clear.
>>>>>
>>>>> "If C is one of the UCS scalar values the
>>>>> table below,"
>>>>>
>>>>> add "in"
>>>>>
>>>>> better clarify: "the two characters shown
>>>>> as the
>>>>> corresponding escape sequence are appended
>>>>> to E"
>>>>>
>>>>>
>>>>> after p2.3.4, p2.5
>>>>>
>>>>> "simple-hexadecimal-digit-sequence"
>>>>>
>>>>> I would not re-use lexing grammar for a
>>>>> local placeholder,
>>>>> just say \u{/hex-digit-sequence/} or so.
>>>>>
>>>>>
>>>>> p2.5
>>>>>
>>>>> "Otherwise, X is a sequence of ill-formed
>>>>> code units. Each"
>>>>>
>>>>> -> "Otherwise (X is a sequence of
>>>>> ill-formed code units), each code unit ..."
>>>>>
>>>>>
>>>>> "U+0027 APOSTROPHE is escaped as \' while
>>>>> U+0022 QUOTATION MARK is left unchanged."
>>>>>
>>>>> Can we rephrase that to avoid "is escaped
>>>>> as"? We were on such a good
>>>>> track to just append characters and avoid
>>>>> any judgment calls.
>>>>>
>>>>> suggestion "
>>>>> - for each character U+0027 APOSTROPHE in
>>>>> S, the two characters \' are appended to E
>>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>>
>>>>>
>>>>> Jens
>>>>>
>>>>>
>>>>> Thanks Jens and Victor! I did my best to apply the
>>>>> suggested changes:
>>>>>
>>>>> * Updated rendered wording:
>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>> * New diff:
>>>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>>>
>>>>>
>>>>> Hopefully, this gradient is slowly descending to
>>>>> the correct solution :-)
>>>>
>>>> Thanks, Barry. This appears to have incorporated
>>>> the parts of my prior suggestions that did not have
>>>> opposition, so just minor issues noted below.
>>>>
>>>> Discussion at the last meeting
>>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>>> revealed that we're failing to specify the encoding
>>>> used to interpret /S/. Change p2 as follows:
>>>> (perhaps substitute "as described below" for "as
>>>> follows")
>>>>
>>>> The escaped string /E/ representation of a
>>>> string /S/ is constructed by encoding a
>>>> sequence of characters_as follows._in t_T_he
>>>> associated character encoding /CE/ for charT
>>>> ([lex.string.literal]
>>>> <http://eel.is/c++draft/tab:lex.string.literal>)as
>>>> follows:_is used both to interpret /S/ and to
>>>> construct /E/._
>>>>
>>>> In p2.2, "code sequence" -> "code unit sequence".
>>>>
>>>> In p2.3.4 and p2.5, I don't think we should re-use
>>>> the /hexadecimal-digit/ grammar term here. Just
>>>> say, "hexadecimal digits".
>>>>
>>>> Add the following note to p2.4 to address a request
>>>> by Hubert:
>>>>
>>>> Otherwise, if /X/ encodes a state transition,
>>>> the effect on /E/ is unspecified._[ /Note:/ the
>>>> intent is that a state transition be
>>>> represented in /E/ such that its original code
>>>> unit sequence can be reconstructed /- end note/ ]_
>>>>
>>>> Hubert pointed out during the last meeting that we
>>>> should not be trying to interpret state transitions
>>>> for stateful encodings as I had previously been
>>>> trying to do. I think we can now simplify p2.5:
>>>>
>>>> Otherwise (/X/ is a sequence of ill-formed code
>>>> units), each code unit /U/ is appended to /E/
>>>> in order as the sequence
>>>> /\x{hex-digit-sequence}/, where
>>>> /hex-digit-sequence/ is the shortest
>>>> hexadecimal representation of /U/ using
>>>> lower-case hexadecimal digits.When encoding a
>>>> stateful character encoding, these additions
>>>> should have no effect on encoding state.
>>>>
>>>> In p3, we now need to drop "in a Unicode encoding".
>>>> I think the result should also produce a string,
>>>> not a character.
>>>>
>>>> The escaped character_string_ representation of
>>>> a character /C/ in a Unicode encoding is
>>>> equivalent to the escaped string representation
>>>> of a string of /C/, except that:
>>>>
>>>> p4 should be removed now.
>>>>
>>>> The escaped character and escaped string
>>>> representations of a character or string in a
>>>> non-Unicode encoding is unspecified.
>>>>
>>>> Hubert, the wording does not explicitly address
>>>> your request to be able to specify spacing and
>>>> separator characters as a set of encoding agnostic
>>>> code point values. I think the existing wording
>>>> suffices to meet your goals since an implementation
>>>> can document a method of identifying the set of
>>>> escaped characters by, for example, specifying
>>>> characters in EBCDIC 1047 and describing how to map
>>>> those to other code pages. If you don't agree,
>>>> could you suggest how the wording might be updated
>>>> to better address your concern?
>>>>
>>>> Tom.
>>>>
>>>>
>>>> Thanks, Tom! I applied these changes. The diff can be
>>>> found here:
>>>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>>
>>> Thanks, Barry. This looks good to me modulo Hubert's
>>> additional tweak.
>>>
>>> One last thing I noticed. The example section has this:
>>>
>>> string s4 = format("[{:?}]", string("\0 \n \t \x02
>>> \x1b", 9));
>>>
>>> // s4 has value [\u{0} \n \t \u{2} \u{1b}]
>>>
>>> That example depends on the encoding being ASCII-based
>>> in order for the \x02 and \x1b escapes to be interpreted
>>> as characters \u{2} and \u{1b}. Similarly, s5 and s6
>>> have UTF-8 dependencies. Perhaps we should add a comment?
>>>
>>> string s0 = format("[{}]",
>>> "h\tllo"); // s0 has value: [h llo]
>>> string s1 = format("[{:?}]",
>>> "h\tllo"); // s1 has value: ["h\tllo"]
>>> string s2 = format("[{:?}]", "Спасибо, Виктор
>>> ♥!"); // s2 has value: ["Спасибо, Виктор ♥!"]
>>> string s3 = format("[{:?}] [{:?}]", '\'',
>>> '"'); // s3 has value: ['\'', '"']
>>> _// The following examples assume use of the UTF-8
>>> encoding._
>>> string s4 = format("[{:?}]", string("\0 \n \t \x02
>>> \x1b", 9));
>>>
>>> // s4 has value [\u{0} \n \t \u{2} \u{1b}]
>>> string s5 = format("[{:?}]",
>>> "\xc3\x28"); // invalid UTF-8
>>>
>>> // s5 has value: ["\x{c3}\x{28}"]
>>> string s6 = format("[{:?}]",
>>> "🤷🏻‍♂️"); // s6 has value:
>>> ["🤷🏻\u{200d}♂\u{fe0f}"]
>>>
>>> I never got around to translating "Спасибо, Виктор ♥!"
>>> until now. Very nice :)
>>>
>>> Tom
>>>
>>>
>>> Applied Hubert's change and added this comment:
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>> Thanks!
>>>
>>> The decreasing rate of requested changes is encouraging!
>>>
>>> Barry
>>>
>>>
>>

Received on 2022-05-12 13:56:48