C++ Logo

sg16

Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 11 May 2022 16:36:10 -0400
Hi, Barry. We discussed in today's SG16 meeting and identified one last
minor change to make. We then polled forwarding the paper to LWG with
unanimous consent so this is definitely the last change!

In 2.3.1, substitute "character" for "UCS scalar value" in the first
sentence and in the table header.

    If /C/ is one of the UCS scalar values_characters_ in the table
    below, then the two characters shown as the corresponding escape
    sequence are appended to /E/:

    UCS scalar value_character_
     escape sequence
    U+0009 CHARACTER TABULATION |\t|
    U+000A LINE FEED |\n|
    U+000D CARRIAGE RETURN |\r|
    U+0022 QUOTATION MARK |\"|
    U+005C REVERSE SOLIDUS |\\|

Tom.

On 5/11/22 12:56 PM, Tom Honermann via SG16 wrote:
>
> I have a weak preference for "character" given that the wording is
> intended to address Unicode and non-Unicode behavior. I don't think we
> have any normative uses of "code point" at present.
>
> The definition of "code point" we have via our normative reference to
> ISO/IEC 10646 is: "value in the UCS codespace". That doesn't really
> work for the non-Unicode case and, regardless, would include surrogate
> code points which I don't think we want in this context.
>
> Tom.
>
> On 5/11/22 12:24 PM, Victor Zverovich via SG16 wrote:
>> Thanks Tom and others for revising the wording. The latest version of
>> the escaping section looks good to me with only one minor question:
>> is it clear that "character" in
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
>> means a code point or shall we use the term code point instead?
>>
>> Cheers,
>> Victor
>>
>> On Tue, May 10, 2022 at 6:32 PM Barry Revzin <barry.revzin_at_[hidden]>
>> wrote:
>>
>>
>>
>> On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]>
>> wrote:
>>
>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>>
>>>
>>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann
>>> <tom_at_[hidden]> wrote:
>>>
>>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>>
>>>>
>>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich
>>>> <victor.zverovich_at_[hidden]> wrote:
>>>>
>>>> > One thing I noticed is that the wording about
>>>> Grapheme_Extend is gone. I didn't know what this
>>>> meant before, so I don't know now if this is a good
>>>> removal or a bad removal.
>>>>
>>>> I don't recall any requests for removing it and
>>>> think that it should be reintroduced.
>>>>
>>>> - Victor
>>>>
>>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer
>>>> <Jens.Maurer_at_[hidden]> wrote:
>>>>
>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>> > I think I have applied this. Here's the
>>>> rendered version:
>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>> <https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12>
>>>>
>>>> > How does this look?
>>>>
>>>> p2.2
>>>>
>>>> For each code sequence X in S that either
>>>> encodes a single character or encoding state
>>>> transition or that is a sequence of ill-formed
>>>> code units is processed in order as follows:
>>>>
>>>> That feels like bad English grammar to me.
>>>>
>>>> Why "encoding", yet there is an "encodes"
>>>> before that?
>>>> Why "either" and there are three things that don't
>>>> exactly correspond grammatically?
>>>>
>>>> Maybe make a bulleted sub-list with the three items
>>>> so that the structure is clear.
>>>>
>>>> "If C is one of the UCS scalar values the table
>>>> below,"
>>>>
>>>> add "in"
>>>>
>>>> better clarify: "the two characters shown as the
>>>> corresponding escape sequence are appended to E"
>>>>
>>>>
>>>> after p2.3.4, p2.5
>>>>
>>>> "simple-hexadecimal-digit-sequence"
>>>>
>>>> I would not re-use lexing grammar for a local
>>>> placeholder,
>>>> just say \u{/hex-digit-sequence/} or so.
>>>>
>>>>
>>>> p2.5
>>>>
>>>> "Otherwise, X is a sequence of ill-formed code
>>>> units. Each"
>>>>
>>>> -> "Otherwise (X is a sequence of ill-formed
>>>> code units), each code unit ..."
>>>>
>>>>
>>>> "U+0027 APOSTROPHE is escaped as \' while
>>>> U+0022 QUOTATION MARK is left unchanged."
>>>>
>>>> Can we rephrase that to avoid "is escaped as"?
>>>> We were on such a good
>>>> track to just append characters and avoid any
>>>> judgment calls.
>>>>
>>>> suggestion "
>>>> - for each character U+0027 APOSTROPHE in S,
>>>> the two characters \' are appended to E
>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>
>>>>
>>>> Jens
>>>>
>>>>
>>>> Thanks Jens and Victor! I did my best to apply the
>>>> suggested changes:
>>>>
>>>> * Updated rendered wording:
>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>> * New diff:
>>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>>
>>>>
>>>> Hopefully, this gradient is slowly descending to the
>>>> correct solution :-)
>>>
>>> Thanks, Barry. This appears to have incorporated the
>>> parts of my prior suggestions that did not have
>>> opposition, so just minor issues noted below.
>>>
>>> Discussion at the last meeting
>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>> revealed that we're failing to specify the encoding used
>>> to interpret /S/. Change p2 as follows: (perhaps
>>> substitute "as described below" for "as follows")
>>>
>>> The escaped string /E/ representation of a string
>>> /S/ is constructed by encoding a sequence of
>>> characters_as follows._in t_T_he associated
>>> character encoding /CE/ for charT
>>> ([lex.string.literal]
>>> <http://eel.is/c++draft/tab:lex.string.literal>)as
>>> follows:_is used both to interpret /S/ and to
>>> construct /E/._
>>>
>>> In p2.2, "code sequence" -> "code unit sequence".
>>>
>>> In p2.3.4 and p2.5, I don't think we should re-use the
>>> /hexadecimal-digit/ grammar term here. Just say,
>>> "hexadecimal digits".
>>>
>>> Add the following note to p2.4 to address a request by
>>> Hubert:
>>>
>>> Otherwise, if /X/ encodes a state transition, the
>>> effect on /E/ is unspecified._[ /Note:/ the intent
>>> is that a state transition be represented in /E/
>>> such that its original code unit sequence can be
>>> reconstructed /- end note/ ]_
>>>
>>> Hubert pointed out during the last meeting that we
>>> should not be trying to interpret state transitions for
>>> stateful encodings as I had previously been trying to
>>> do. I think we can now simplify p2.5:
>>>
>>> Otherwise (/X/ is a sequence of ill-formed code
>>> units), each code unit /U/ is appended to /E/ in
>>> order as the sequence /\x{hex-digit-sequence}/,
>>> where /hex-digit-sequence/ is the shortest
>>> hexadecimal representation of /U/ using lower-case
>>> hexadecimal digits.When encoding a stateful
>>> character encoding, these additions should have no
>>> effect on encoding state.
>>>
>>> In p3, we now need to drop "in a Unicode encoding". I
>>> think the result should also produce a string, not a
>>> character.
>>>
>>> The escaped character_string_ representation of a
>>> character /C/ in a Unicode encoding is equivalent to
>>> the escaped string representation of a string of
>>> /C/, except that:
>>>
>>> p4 should be removed now.
>>>
>>> The escaped character and escaped string
>>> representations of a character or string in a
>>> non-Unicode encoding is unspecified.
>>>
>>> Hubert, the wording does not explicitly address your
>>> request to be able to specify spacing and separator
>>> characters as a set of encoding agnostic code point
>>> values. I think the existing wording suffices to meet
>>> your goals since an implementation can document a method
>>> of identifying the set of escaped characters by, for
>>> example, specifying characters in EBCDIC 1047 and
>>> describing how to map those to other code pages. If you
>>> don't agree, could you suggest how the wording might be
>>> updated to better address your concern?
>>>
>>> Tom.
>>>
>>>
>>> Thanks, Tom! I applied these changes. The diff can be found
>>> here:
>>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>
>> Thanks, Barry. This looks good to me modulo Hubert's
>> additional tweak.
>>
>> One last thing I noticed. The example section has this:
>>
>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b",
>> 9));
>> //
>> s4 has value [\u{0} \n \t \u{2} \u{1b}]
>>
>> That example depends on the encoding being ASCII-based in
>> order for the \x02 and \x1b escapes to be interpreted as
>> characters \u{2} and \u{1b}. Similarly, s5 and s6 have UTF-8
>> dependencies. Perhaps we should add a comment?
>>
>> string s0 = format("[{}]", "h\tllo"); //
>> s0 has value: [h llo]
>> string s1 = format("[{:?}]", "h\tllo"); //
>> s1 has value: ["h\tllo"]
>> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); //
>> s2 has value: ["Спасибо, Виктор ♥!"]
>> string s3 = format("[{:?}] [{:?}]", '\'', '"'); //
>> s3 has value: ['\'', '"']
>> _// The following examples assume use of the UTF-8 encoding._
>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b",
>> 9));
>> //
>> s4 has value [\u{0} \n \t \u{2} \u{1b}]
>> string s5 = format("[{:?}]", "\xc3\x28"); //
>> invalid UTF-8
>> //
>> s5 has value: ["\x{c3}\x{28}"]
>> string s6 = format("[{:?}]",
>> "🤷🏻‍♂️"); // s6 has value:
>> ["🤷🏻\u{200d}♂\u{fe0f}"]
>>
>> I never got around to translating "Спасибо, Виктор ♥!" until
>> now. Very nice :)
>>
>> Tom
>>
>>
>> Applied Hubert's change and added this comment:
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>> Thanks!
>>
>> The decreasing rate of requested changes is encouraging!
>>
>> Barry
>>
>>
>

Received on 2022-05-11 20:36:12