C++ Logo

sg16

Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Barry Revzin <barry.revzin_at_[hidden]>
Date: Wed, 11 May 2022 20:44:47 -0500
Done!

Barry "Ship it?" Revzin

On Wed, May 11, 2022 at 3:36 PM Tom Honermann <tom_at_[hidden]> wrote:

> Hi, Barry. We discussed in today's SG16 meeting and identified one last
> minor change to make. We then polled forwarding the paper to LWG with
> unanimous consent so this is definitely the last change!
>
> In 2.3.1, substitute "character" for "UCS scalar value" in the first
> sentence and in the table header.
>
> If *C* is one of the UCS scalar values*characters* in the table below,
> then the two characters shown as the corresponding escape sequence are
> appended to *E*:
> UCS scalar value*character*
> escape sequence
> U+0009 CHARACTER TABULATION \t
> U+000A LINE FEED \n
> U+000D CARRIAGE RETURN \r
> U+0022 QUOTATION MARK \"
> U+005C REVERSE SOLIDUS \\
>
> Tom.
> On 5/11/22 12:56 PM, Tom Honermann via SG16 wrote:
>
> I have a weak preference for "character" given that the wording is
> intended to address Unicode and non-Unicode behavior. I don't think we have
> any normative uses of "code point" at present.
>
> The definition of "code point" we have via our normative reference to
> ISO/IEC 10646 is: "value in the UCS codespace". That doesn't really work
> for the non-Unicode case and, regardless, would include surrogate code
> points which I don't think we want in this context.
>
> Tom.
> On 5/11/22 12:24 PM, Victor Zverovich via SG16 wrote:
>
> Thanks Tom and others for revising the wording. The latest version of the
> escaping section looks good to me with only one minor question: is it clear
> that "character" in
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
> means a code point or shall we use the term code point instead?
>
> Cheers,
> Victor
>
> On Tue, May 10, 2022 at 6:32 PM Barry Revzin <barry.revzin_at_[hidden]>
> wrote:
>
>>
>>
>> On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>>
>>>
>>>
>>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>>
>>>>
>>>>
>>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich <
>>>> victor.zverovich_at_[hidden]> wrote:
>>>>
>>>>> > One thing I noticed is that the wording about Grapheme_Extend is
>>>>> gone. I didn't know what this meant before, so I don't know now if this is
>>>>> a good removal or a bad removal.
>>>>>
>>>>> I don't recall any requests for removing it and think that it should
>>>>> be reintroduced.
>>>>>
>>>>> - Victor
>>>>>
>>>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>>>> > I think I have applied this. Here's the rendered version:
>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>> <
>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>> >
>>>>>>
>>>>>> > How does this look?
>>>>>>
>>>>>> p2.2
>>>>>>
>>>>>> For each code sequence X in S that either encodes a single character
>>>>>> or encoding state transition or that is a sequence of ill-formed code units
>>>>>> is processed in order as follows:
>>>>>>
>>>>>> That feels like bad English grammar to me.
>>>>>>
>>>>>> Why "encoding", yet there is an "encodes" before that?
>>>>>> Why "either" and there are three things that don't
>>>>>> exactly correspond grammatically?
>>>>>>
>>>>>> Maybe make a bulleted sub-list with the three items
>>>>>> so that the structure is clear.
>>>>>>
>>>>>> "If C is one of the UCS scalar values the table below,"
>>>>>>
>>>>>> add "in"
>>>>>>
>>>>>> better clarify: "the two characters shown as the
>>>>>> corresponding escape sequence are appended to E"
>>>>>>
>>>>>>
>>>>>> after p2.3.4, p2.5
>>>>>>
>>>>>> "simple-hexadecimal-digit-sequence"
>>>>>>
>>>>>> I would not re-use lexing grammar for a local placeholder,
>>>>>> just say \u{/hex-digit-sequence/} or so.
>>>>>>
>>>>>>
>>>>>> p2.5
>>>>>>
>>>>>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>>>>>
>>>>>> -> "Otherwise (X is a sequence of ill-formed code units), each code
>>>>>> unit ..."
>>>>>>
>>>>>>
>>>>>> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION MARK is
>>>>>> left unchanged."
>>>>>>
>>>>>> Can we rephrase that to avoid "is escaped as"? We were on such a good
>>>>>> track to just append characters and avoid any judgment calls.
>>>>>>
>>>>>> suggestion "
>>>>>> - for each character U+0027 APOSTROPHE in S, the two characters \'
>>>>>> are appended to E
>>>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>>>
>>>>>>
>>>>>> Jens
>>>>>>
>>>>>
>>>> Thanks Jens and Victor! I did my best to apply the suggested changes:
>>>>
>>>>
>>>> - Updated rendered wording:
>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>> - New diff:
>>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>>
>>>>
>>>> Hopefully, this gradient is slowly descending to the correct solution
>>>> :-)
>>>>
>>>> Thanks, Barry. This appears to have incorporated the parts of my prior
>>>> suggestions that did not have opposition, so just minor issues noted below.
>>>>
>>>> Discussion at the last meeting
>>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>>> revealed that we're failing to specify the encoding used to interpret
>>>> *S*. Change p2 as follows: (perhaps substitute "as described below"
>>>> for "as follows")
>>>>
>>>> The escaped string *E* representation of a string *S* is constructed
>>>> by encoding a sequence of characters *as follows.* in t*T*he
>>>> associated character encoding *CE* for charT ([lex.string.literal]
>>>> <http://eel.is/c++draft/tab:lex.string.literal>) as follows:* is used
>>>> both to interpret S and to construct E.*
>>>>
>>>> In p2.2, "code sequence" -> "code unit sequence".
>>>>
>>>> In p2.3.4 and p2.5, I don't think we should re-use the
>>>> *hexadecimal-digit* grammar term here. Just say, "hexadecimal digits".
>>>>
>>>> Add the following note to p2.4 to address a request by Hubert:
>>>>
>>>> Otherwise, if *X* encodes a state transition, the effect on *E* is
>>>> unspecified.* [ Note: the intent is that a state transition be
>>>> represented in E such that its original code unit sequence can be
>>>> reconstructed - end note ]*
>>>>
>>>> Hubert pointed out during the last meeting that we should not be trying
>>>> to interpret state transitions for stateful encodings as I had previously
>>>> been trying to do. I think we can now simplify p2.5:
>>>>
>>>> Otherwise (*X* is a sequence of ill-formed code units), each code unit
>>>> *U* is appended to *E* in order as the sequence
>>>> *\x{hex-digit-sequence}*, where *hex-digit-sequence* is the shortest
>>>> hexadecimal representation of *U* using lower-case hexadecimal digits.
>>>> When encoding a stateful character encoding, these additions should have no
>>>> effect on encoding state.
>>>>
>>>> In p3, we now need to drop "in a Unicode encoding". I think the result
>>>> should also produce a string, not a character.
>>>>
>>>> The escaped character*string* representation of a character *C* in a
>>>> Unicode encoding is equivalent to the escaped string representation of
>>>> a string of *C*, except that:
>>>>
>>>> p4 should be removed now.
>>>>
>>>> The escaped character and escaped string representations of a character
>>>> or string in a non-Unicode encoding is unspecified.
>>>>
>>>> Hubert, the wording does not explicitly address your request to be able
>>>> to specify spacing and separator characters as a set of encoding agnostic
>>>> code point values. I think the existing wording suffices to meet your goals
>>>> since an implementation can document a method of identifying the set of
>>>> escaped characters by, for example, specifying characters in EBCDIC 1047
>>>> and describing how to map those to other code pages. If you don't agree,
>>>> could you suggest how the wording might be updated to better address your
>>>> concern?
>>>>
>>>> Tom.
>>>>
>>>
>>> Thanks, Tom! I applied these changes. The diff can be found here:
>>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>>
>>> Thanks, Barry. This looks good to me modulo Hubert's additional tweak.
>>>
>>> One last thing I noticed. The example section has this:
>>>
>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>> // s4 has value
>>> [\u{0} \n \t \u{2} \u{1b}]
>>>
>>> That example depends on the encoding being ASCII-based in order for the
>>> \x02 and \x1b escapes to be interpreted as characters \u{2} and \u{1b}.
>>> Similarly, s5 and s6 have UTF-8 dependencies. Perhaps we should add a
>>> comment?
>>>
>>> string s0 = format("[{}]", "h\tllo"); // s0 has value:
>>> [h llo]
>>> string s1 = format("[{:?}]", "h\tllo"); // s1 has value:
>>> ["h\tllo"]
>>> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); // s2 has value:
>>> ["Спасибо, Виктор ♥!"]
>>> string s3 = format("[{:?}] [{:?}]", '\'', '"'); // s3 has value:
>>> ['\'', '"']
>>> *// The following examples assume use of the UTF-8 encoding.*
>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>> // s4 has value
>>> [\u{0} \n \t \u{2} \u{1b}]
>>> string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
>>> // s5 has value:
>>> ["\x{c3}\x{28}"]
>>> string s6 = format("[{:?}]", "🤷🏻‍♂️"); // s6 has
>>> value: ["🤷🏻\u{200d}♂\u{fe0f}"]
>>>
>>> I never got around to translating "Спасибо, Виктор ♥!" until now. Very
>>> nice :)
>>>
>>> Tom
>>>
>>
>> Applied Hubert's change and added this comment:
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>> Thanks!
>>
>> The decreasing rate of requested changes is encouraging!
>>
>> Barry
>>
>
>
>

Received on 2022-05-12 01:44:59