C++ Logo

sg16

Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Barry Revzin <barry.revzin_at_[hidden]>
Date: Fri, 13 May 2022 19:03:47 -0500
Thank you for making all these iterations!

LWG re-affirmed this paper today, making one change. The wording you all
provided me had a note:

[ *Note*: the intent is that a state transition be represented in `$E$`
such that the original code unit sequence of `$S$` can be reconstructed
-*end note* ]

which LWG wanted to elevate into recommended practice:

*Recommended Practice*: a state transition should be represented in `$E$`
such that the original code unit sequence of `$S$` can be reconstructed.

Same words, just slightly more intentional about the intent. I hope that's
okay with everybody. (dif:
https://github.com/brevzin/cpp_proposals/commit/fc263d0be55e189a6f98996a7cb06f2f87f82bfd,
rendered:
https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_21
)

Thanks again,

Barry

On Thu, May 12, 2022 at 8:56 AM Tom Honermann <tom_at_[hidden]> wrote:

> Ship it!
>
> Thank you for sticking with us through all these iterations!
>
> Tom.
> On 5/11/22 9:44 PM, Barry Revzin wrote:
>
> Done!
>
> Barry "Ship it?" Revzin
>
> On Wed, May 11, 2022 at 3:36 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> Hi, Barry. We discussed in today's SG16 meeting and identified one last
>> minor change to make. We then polled forwarding the paper to LWG with
>> unanimous consent so this is definitely the last change!
>>
>> In 2.3.1, substitute "character" for "UCS scalar value" in the first
>> sentence and in the table header.
>>
>> If *C* is one of the UCS scalar values*characters* in the table below,
>> then the two characters shown as the corresponding escape sequence are
>> appended to *E*:
>> UCS scalar value*character*
>> escape sequence
>> U+0009 CHARACTER TABULATION \t
>> U+000A LINE FEED \n
>> U+000D CARRIAGE RETURN \r
>> U+0022 QUOTATION MARK \"
>> U+005C REVERSE SOLIDUS \\
>>
>> Tom.
>> On 5/11/22 12:56 PM, Tom Honermann via SG16 wrote:
>>
>> I have a weak preference for "character" given that the wording is
>> intended to address Unicode and non-Unicode behavior. I don't think we have
>> any normative uses of "code point" at present.
>>
>> The definition of "code point" we have via our normative reference to
>> ISO/IEC 10646 is: "value in the UCS codespace". That doesn't really work
>> for the non-Unicode case and, regardless, would include surrogate code
>> points which I don't think we want in this context.
>>
>> Tom.
>> On 5/11/22 12:24 PM, Victor Zverovich via SG16 wrote:
>>
>> Thanks Tom and others for revising the wording. The latest version of the
>> escaping section looks good to me with only one minor question: is it clear
>> that "character" in
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
>> means a code point or shall we use the term code point instead?
>>
>> Cheers,
>> Victor
>>
>> On Tue, May 10, 2022 at 6:32 PM Barry Revzin <barry.revzin_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>>>
>>>>
>>>>
>>>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]> wrote:
>>>>
>>>>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich <
>>>>> victor.zverovich_at_[hidden]> wrote:
>>>>>
>>>>>> > One thing I noticed is that the wording about Grapheme_Extend is
>>>>>> gone. I didn't know what this meant before, so I don't know now if this is
>>>>>> a good removal or a bad removal.
>>>>>>
>>>>>> I don't recall any requests for removing it and think that it should
>>>>>> be reintroduced.
>>>>>>
>>>>>> - Victor
>>>>>>
>>>>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>>>>> > I think I have applied this. Here's the rendered version:
>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>> <
>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>> >
>>>>>>>
>>>>>>> > How does this look?
>>>>>>>
>>>>>>> p2.2
>>>>>>>
>>>>>>> For each code sequence X in S that either encodes a single character
>>>>>>> or encoding state transition or that is a sequence of ill-formed code units
>>>>>>> is processed in order as follows:
>>>>>>>
>>>>>>> That feels like bad English grammar to me.
>>>>>>>
>>>>>>> Why "encoding", yet there is an "encodes" before that?
>>>>>>> Why "either" and there are three things that don't
>>>>>>> exactly correspond grammatically?
>>>>>>>
>>>>>>> Maybe make a bulleted sub-list with the three items
>>>>>>> so that the structure is clear.
>>>>>>>
>>>>>>> "If C is one of the UCS scalar values the table below,"
>>>>>>>
>>>>>>> add "in"
>>>>>>>
>>>>>>> better clarify: "the two characters shown as the
>>>>>>> corresponding escape sequence are appended to E"
>>>>>>>
>>>>>>>
>>>>>>> after p2.3.4, p2.5
>>>>>>>
>>>>>>> "simple-hexadecimal-digit-sequence"
>>>>>>>
>>>>>>> I would not re-use lexing grammar for a local placeholder,
>>>>>>> just say \u{/hex-digit-sequence/} or so.
>>>>>>>
>>>>>>>
>>>>>>> p2.5
>>>>>>>
>>>>>>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>>>>>>
>>>>>>> -> "Otherwise (X is a sequence of ill-formed code units), each code
>>>>>>> unit ..."
>>>>>>>
>>>>>>>
>>>>>>> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION MARK is
>>>>>>> left unchanged."
>>>>>>>
>>>>>>> Can we rephrase that to avoid "is escaped as"? We were on such a
>>>>>>> good
>>>>>>> track to just append characters and avoid any judgment calls.
>>>>>>>
>>>>>>> suggestion "
>>>>>>> - for each character U+0027 APOSTROPHE in S, the two characters \'
>>>>>>> are appended to E
>>>>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>>>>
>>>>>>>
>>>>>>> Jens
>>>>>>>
>>>>>>
>>>>> Thanks Jens and Victor! I did my best to apply the suggested changes:
>>>>>
>>>>>
>>>>> - Updated rendered wording:
>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>> - New diff:
>>>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>>>
>>>>>
>>>>> Hopefully, this gradient is slowly descending to the correct solution
>>>>> :-)
>>>>>
>>>>> Thanks, Barry. This appears to have incorporated the parts of my prior
>>>>> suggestions that did not have opposition, so just minor issues noted below.
>>>>>
>>>>> Discussion at the last meeting
>>>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>>>> revealed that we're failing to specify the encoding used to interpret
>>>>> *S*. Change p2 as follows: (perhaps substitute "as described below"
>>>>> for "as follows")
>>>>>
>>>>> The escaped string *E* representation of a string *S* is constructed
>>>>> by encoding a sequence of characters *as follows.* in t*T*he
>>>>> associated character encoding *CE* for charT ([lex.string.literal]
>>>>> <http://eel.is/c++draft/tab:lex.string.literal>) as follows:* is used
>>>>> both to interpret S and to construct E.*
>>>>>
>>>>> In p2.2, "code sequence" -> "code unit sequence".
>>>>>
>>>>> In p2.3.4 and p2.5, I don't think we should re-use the
>>>>> *hexadecimal-digit* grammar term here. Just say, "hexadecimal digits".
>>>>>
>>>>> Add the following note to p2.4 to address a request by Hubert:
>>>>>
>>>>> Otherwise, if *X* encodes a state transition, the effect on *E* is
>>>>> unspecified.* [ Note: the intent is that a state transition be
>>>>> represented in E such that its original code unit sequence can be
>>>>> reconstructed - end note ]*
>>>>>
>>>>> Hubert pointed out during the last meeting that we should not be
>>>>> trying to interpret state transitions for stateful encodings as I had
>>>>> previously been trying to do. I think we can now simplify p2.5:
>>>>>
>>>>> Otherwise (*X* is a sequence of ill-formed code units), each code
>>>>> unit *U* is appended to *E* in order as the sequence
>>>>> *\x{hex-digit-sequence}*, where *hex-digit-sequence* is the shortest
>>>>> hexadecimal representation of *U* using lower-case hexadecimal digits.
>>>>> When encoding a stateful character encoding, these additions should have no
>>>>> effect on encoding state.
>>>>>
>>>>> In p3, we now need to drop "in a Unicode encoding". I think the result
>>>>> should also produce a string, not a character.
>>>>>
>>>>> The escaped character*string* representation of a character *C* in a
>>>>> Unicode encoding is equivalent to the escaped string representation
>>>>> of a string of *C*, except that:
>>>>>
>>>>> p4 should be removed now.
>>>>>
>>>>> The escaped character and escaped string representations of a
>>>>> character or string in a non-Unicode encoding is unspecified.
>>>>>
>>>>> Hubert, the wording does not explicitly address your request to be
>>>>> able to specify spacing and separator characters as a set of encoding
>>>>> agnostic code point values. I think the existing wording suffices to meet
>>>>> your goals since an implementation can document a method of identifying the
>>>>> set of escaped characters by, for example, specifying characters in EBCDIC
>>>>> 1047 and describing how to map those to other code pages. If you don't
>>>>> agree, could you suggest how the wording might be updated to better address
>>>>> your concern?
>>>>>
>>>>> Tom.
>>>>>
>>>>
>>>> Thanks, Tom! I applied these changes. The diff can be found here:
>>>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>>>
>>>> Thanks, Barry. This looks good to me modulo Hubert's additional tweak.
>>>>
>>>> One last thing I noticed. The example section has this:
>>>>
>>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>>> // s4 has value
>>>> [\u{0} \n \t \u{2} \u{1b}]
>>>>
>>>> That example depends on the encoding being ASCII-based in order for the
>>>> \x02 and \x1b escapes to be interpreted as characters \u{2} and \u{1b}.
>>>> Similarly, s5 and s6 have UTF-8 dependencies. Perhaps we should add a
>>>> comment?
>>>>
>>>> string s0 = format("[{}]", "h\tllo"); // s0 has value:
>>>> [h llo]
>>>> string s1 = format("[{:?}]", "h\tllo"); // s1 has value:
>>>> ["h\tllo"]
>>>> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); // s2 has value:
>>>> ["Спасибо, Виктор ♥!"]
>>>> string s3 = format("[{:?}] [{:?}]", '\'', '"'); // s3 has value:
>>>> ['\'', '"']
>>>> *// The following examples assume use of the UTF-8 encoding.*
>>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>>> // s4 has value
>>>> [\u{0} \n \t \u{2} \u{1b}]
>>>> string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
>>>> // s5 has value:
>>>> ["\x{c3}\x{28}"]
>>>> string s6 = format("[{:?}]", "🤷🏻‍♂️"); // s6 has
>>>> value: ["🤷🏻\u{200d}♂\u{fe0f}"]
>>>>
>>>> I never got around to translating "Спасибо, Виктор ♥!" until now. Very
>>>> nice :)
>>>>
>>>> Tom
>>>>
>>>
>>> Applied Hubert's change and added this comment:
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>> Thanks!
>>>
>>> The decreasing rate of requested changes is encouraging!
>>>
>>> Barry
>>>
>>
>>
>>

Received on 2022-05-14 00:04:00