ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 10 May 2022 14:31:38 -0400

On 5/9/22 7:34 PM, Barry Revzin wrote:
>
>
> On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>
>>
>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich
>> <victor.zverovich_at_[hidden]> wrote:
>>
>> > One thing I noticed is that the wording about
>> Grapheme_Extend is gone. I didn't know what this meant
>> before, so I don't know now if this is a good removal or a
>> bad removal.
>>
>> I don't recall any requests for removing it and think that it
>> should be reintroduced.
>>
>> - Victor
>>
>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer
>> <Jens.Maurer_at_[hidden]> wrote:
>>
>> On 05/05/2022 04.08, Barry Revzin wrote:
>> > I think I have applied this. Here's the rendered
>> version:
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>> <https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12>
>>
>> > How does this look?
>>
>> p2.2
>>
>> For each code sequence X in S that either encodes a
>> single character or encoding state transition or that is
>> a sequence of ill-formed code units is processed in order
>> as follows:
>>
>> That feels like bad English grammar to me.
>>
>> Why "encoding", yet there is an "encodes" before that?
>> Why "either" and there are three things that don't
>> exactly correspond grammatically?
>>
>> Maybe make a bulleted sub-list with the three items
>> so that the structure is clear.
>>
>> "If C is one of the UCS scalar values the table below,"
>>
>> add "in"
>>
>> better clarify: "the two characters shown as the
>> corresponding escape sequence are appended to E"
>>
>>
>> after p2.3.4, p2.5
>>
>> "simple-hexadecimal-digit-sequence"
>>
>> I would not re-use lexing grammar for a local placeholder,
>> just say \u{/hex-digit-sequence/} or so.
>>
>>
>> p2.5
>>
>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>
>> -> "Otherwise (X is a sequence of ill-formed code units),
>> each code unit ..."
>>
>>
>> "U+0027 APOSTROPHE is escaped as \' while U+0022
>> QUOTATION MARK is left unchanged."
>>
>> Can we rephrase that to avoid "is escaped as"? We were
>> on such a good
>> track to just append characters and avoid any judgment calls.
>>
>> suggestion "
>> - for each character U+0027 APOSTROPHE in S, the two
>> characters \' are appended to E
>> - U+0022 QUOTATION MARK is left unchanged"
>>
>>
>> Jens
>>
>>
>> Thanks Jens and Victor! I did my best to apply the suggested changes:
>>
>> * Updated rendered wording:
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>> * New diff:
>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>
>>
>> Hopefully, this gradient is slowly descending to the correct
>> solution :-)
>
> Thanks, Barry. This appears to have incorporated the parts of my
> prior suggestions that did not have opposition, so just minor
> issues noted below.
>
> Discussion at the last meeting
> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
> revealed that we're failing to specify the encoding used to
> interpret /S/. Change p2 as follows: (perhaps substitute "as
> described below" for "as follows")
>
> The escaped string /E/ representation of a string /S/ is
> constructed by encoding a sequence of characters_as
> follows._in t_T_he associated character encoding /CE/ for
> charT ([lex.string.literal]
> <http://eel.is/c++draft/tab:lex.string.literal>)as follows:_is
> used both to interpret /S/ and to construct /E/._
>
> In p2.2, "code sequence" -> "code unit sequence".
>
> In p2.3.4 and p2.5, I don't think we should re-use the
> /hexadecimal-digit/ grammar term here. Just say, "hexadecimal digits".
>
> Add the following note to p2.4 to address a request by Hubert:
>
> Otherwise, if /X/ encodes a state transition, the effect on
> /E/ is unspecified._[ /Note:/ the intent is that a state
> transition be represented in /E/ such that its original code
> unit sequence can be reconstructed /- end note/ ]_
>
> Hubert pointed out during the last meeting that we should not be
> trying to interpret state transitions for stateful encodings as I
> had previously been trying to do. I think we can now simplify p2.5:
>
> Otherwise (/X/ is a sequence of ill-formed code units), each
> code unit /U/ is appended to /E/ in order as the sequence
> /\x{hex-digit-sequence}/, where /hex-digit-sequence/ is the
> shortest hexadecimal representation of /U/ using lower-case
> hexadecimal digits.When encoding a stateful character
> encoding, these additions should have no effect on encoding state.
>
> In p3, we now need to drop "in a Unicode encoding". I think the
> result should also produce a string, not a character.
>
> The escaped character_string_ representation of a character
> /C/ in a Unicode encoding is equivalent to the escaped string
> representation of a string of /C/, except that:
>
> p4 should be removed now.
>
> The escaped character and escaped string representations of a
> character or string in a non-Unicode encoding is unspecified.
>
> Hubert, the wording does not explicitly address your request to be
> able to specify spacing and separator characters as a set of
> encoding agnostic code point values. I think the existing wording
> suffices to meet your goals since an implementation can document a
> method of identifying the set of escaped characters by, for
> example, specifying characters in EBCDIC 1047 and describing how
> to map those to other code pages. If you don't agree, could you
> suggest how the wording might be updated to better address your
> concern?
>
> Tom.
>
>
> Thanks, Tom! I applied these changes. The diff can be found here:
> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54

Thanks, Barry. This looks good to me modulo Hubert's additional tweak.

One last thing I noticed. The example section has this:

    string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
                                                            // s4 has
    value [\u{0} \n \t \u{2} \u{1b}]

That example depends on the encoding being ASCII-based in order for the
\x02 and \x1b escapes to be interpreted as characters \u{2} and \u{1b}.
Similarly, s5 and s6 have UTF-8 dependencies. Perhaps we should add a
comment?

    string s0 = format("[{}]", "h\tllo"); // s0 has
    value: [h llo]
    string s1 = format("[{:?}]", "h\tllo"); // s1 has
    value: ["h\tllo"]
    string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); // s2 has
    value: ["Спасибо, Виктор ♥!"]
    string s3 = format("[{:?}] [{:?}]", '\'', '"'); // s3 has
    value: ['\'', '"']
    _// The following examples assume use of the UTF-8 encoding._
    string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
                                                            // s4 has
    value [\u{0} \n \t \u{2} \u{1b}]
    string s5 = format("[{:?}]", "\xc3\x28"); // invalid UTF-8
                                                            // s5 has
    value: ["\x{c3}\x{28}"]
    string s6 = format("[{:?}]", "🤷🏻‍♂️"); // s6
    has value: ["🤷🏻\u{200d}♂\u{fe0f}"]

I never got around to translating "Спасибо, Виктор ♥!" until now. Very
nice :)

Tom.

>
> Barry

Received on 2022-05-10 18:31:40