C++ Logo

sg16

Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 11 May 2022 12:56:03 -0400
I have a weak preference for "character" given that the wording is
intended to address Unicode and non-Unicode behavior. I don't think we
have any normative uses of "code point" at present.

The definition of "code point" we have via our normative reference to
ISO/IEC 10646 is: "value in the UCS codespace". That doesn't really work
for the non-Unicode case and, regardless, would include surrogate code
points which I don't think we want in this context.

Tom.

On 5/11/22 12:24 PM, Victor Zverovich via SG16 wrote:
> Thanks Tom and others for revising the wording. The latest version of
> the escaping section looks good to me with only one minor question: is
> it clear that "character" in
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
> means a code point or shall we use the term code point instead?
>
> Cheers,
> Victor
>
> On Tue, May 10, 2022 at 6:32 PM Barry Revzin <barry.revzin_at_[hidden]>
> wrote:
>
>
>
> On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]>
> wrote:
>
> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>
>>
>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann
>> <tom_at_[hidden]> wrote:
>>
>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>
>>>
>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich
>>> <victor.zverovich_at_[hidden]> wrote:
>>>
>>> > One thing I noticed is that the wording about
>>> Grapheme_Extend is gone. I didn't know what this
>>> meant before, so I don't know now if this is a good
>>> removal or a bad removal.
>>>
>>> I don't recall any requests for removing it and
>>> think that it should be reintroduced.
>>>
>>> - Victor
>>>
>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer
>>> <Jens.Maurer_at_[hidden]> wrote:
>>>
>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>> > I think I have applied this. Here's the
>>> rendered version:
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>> <https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12>
>>>
>>> > How does this look?
>>>
>>> p2.2
>>>
>>> For each code sequence X in S that either
>>> encodes a single character or encoding state
>>> transition or that is a sequence of ill-formed
>>> code units is processed in order as follows:
>>>
>>> That feels like bad English grammar to me.
>>>
>>> Why "encoding", yet there is an "encodes" before
>>> that?
>>> Why "either" and there are three things that don't
>>> exactly correspond grammatically?
>>>
>>> Maybe make a bulleted sub-list with the three items
>>> so that the structure is clear.
>>>
>>> "If C is one of the UCS scalar values the table
>>> below,"
>>>
>>> add "in"
>>>
>>> better clarify: "the two characters shown as the
>>> corresponding escape sequence are appended to E"
>>>
>>>
>>> after p2.3.4, p2.5
>>>
>>> "simple-hexadecimal-digit-sequence"
>>>
>>> I would not re-use lexing grammar for a local
>>> placeholder,
>>> just say \u{/hex-digit-sequence/} or so.
>>>
>>>
>>> p2.5
>>>
>>> "Otherwise, X is a sequence of ill-formed code
>>> units. Each"
>>>
>>> -> "Otherwise (X is a sequence of ill-formed
>>> code units), each code unit ..."
>>>
>>>
>>> "U+0027 APOSTROPHE is escaped as \' while U+0022
>>> QUOTATION MARK is left unchanged."
>>>
>>> Can we rephrase that to avoid "is escaped as"?
>>> We were on such a good
>>> track to just append characters and avoid any
>>> judgment calls.
>>>
>>> suggestion "
>>> - for each character U+0027 APOSTROPHE in S,
>>> the two characters \' are appended to E
>>> - U+0022 QUOTATION MARK is left unchanged"
>>>
>>>
>>> Jens
>>>
>>>
>>> Thanks Jens and Victor! I did my best to apply the
>>> suggested changes:
>>>
>>> * Updated rendered wording:
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>> * New diff:
>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>
>>>
>>> Hopefully, this gradient is slowly descending to the
>>> correct solution :-)
>>
>> Thanks, Barry. This appears to have incorporated the
>> parts of my prior suggestions that did not have
>> opposition, so just minor issues noted below.
>>
>> Discussion at the last meeting
>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>> revealed that we're failing to specify the encoding used
>> to interpret /S/. Change p2 as follows: (perhaps
>> substitute "as described below" for "as follows")
>>
>> The escaped string /E/ representation of a string /S/
>> is constructed by encoding a sequence of
>> characters_as follows._in t_T_he associated character
>> encoding /CE/ for charT ([lex.string.literal]
>> <http://eel.is/c++draft/tab:lex.string.literal>)as
>> follows:_is used both to interpret /S/ and to
>> construct /E/._
>>
>> In p2.2, "code sequence" -> "code unit sequence".
>>
>> In p2.3.4 and p2.5, I don't think we should re-use the
>> /hexadecimal-digit/ grammar term here. Just say,
>> "hexadecimal digits".
>>
>> Add the following note to p2.4 to address a request by
>> Hubert:
>>
>> Otherwise, if /X/ encodes a state transition, the
>> effect on /E/ is unspecified._[ /Note:/ the intent is
>> that a state transition be represented in /E/ such
>> that its original code unit sequence can be
>> reconstructed /- end note/ ]_
>>
>> Hubert pointed out during the last meeting that we should
>> not be trying to interpret state transitions for stateful
>> encodings as I had previously been trying to do. I think
>> we can now simplify p2.5:
>>
>> Otherwise (/X/ is a sequence of ill-formed code
>> units), each code unit /U/ is appended to /E/ in
>> order as the sequence /\x{hex-digit-sequence}/, where
>> /hex-digit-sequence/ is the shortest hexadecimal
>> representation of /U/ using lower-case hexadecimal
>> digits.When encoding a stateful character encoding,
>> these additions should have no effect on encoding state.
>>
>> In p3, we now need to drop "in a Unicode encoding". I
>> think the result should also produce a string, not a
>> character.
>>
>> The escaped character_string_ representation of a
>> character /C/ in a Unicode encoding is equivalent to
>> the escaped string representation of a string of /C/,
>> except that:
>>
>> p4 should be removed now.
>>
>> The escaped character and escaped string
>> representations of a character or string in a
>> non-Unicode encoding is unspecified.
>>
>> Hubert, the wording does not explicitly address your
>> request to be able to specify spacing and separator
>> characters as a set of encoding agnostic code point
>> values. I think the existing wording suffices to meet
>> your goals since an implementation can document a method
>> of identifying the set of escaped characters by, for
>> example, specifying characters in EBCDIC 1047 and
>> describing how to map those to other code pages. If you
>> don't agree, could you suggest how the wording might be
>> updated to better address your concern?
>>
>> Tom.
>>
>>
>> Thanks, Tom! I applied these changes. The diff can be found
>> here:
>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>
> Thanks, Barry. This looks good to me modulo Hubert's
> additional tweak.
>
> One last thing I noticed. The example section has this:
>
> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
> //
> s4 has value [\u{0} \n \t \u{2} \u{1b}]
>
> That example depends on the encoding being ASCII-based in
> order for the \x02 and \x1b escapes to be interpreted as
> characters \u{2} and \u{1b}. Similarly, s5 and s6 have UTF-8
> dependencies. Perhaps we should add a comment?
>
> string s0 = format("[{}]", "h\tllo"); //
> s0 has value: [h llo]
> string s1 = format("[{:?}]", "h\tllo"); //
> s1 has value: ["h\tllo"]
> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); //
> s2 has value: ["Спасибо, Виктор ♥!"]
> string s3 = format("[{:?}] [{:?}]", '\'', '"'); //
> s3 has value: ['\'', '"']
> _// The following examples assume use of the UTF-8 encoding._
> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
> //
> s4 has value [\u{0} \n \t \u{2} \u{1b}]
> string s5 = format("[{:?}]", "\xc3\x28"); //
> invalid UTF-8
> //
> s5 has value: ["\x{c3}\x{28}"]
> string s6 = format("[{:?}]",
> "🤷🏻‍♂️"); // s6 has value:
> ["🤷🏻\u{200d}♂\u{fe0f}"]
>
> I never got around to translating "Спасибо, Виктор ♥!" until
> now. Very nice :)
>
> Tom
>
>
> Applied Hubert's change and added this comment:
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
> Thanks!
>
> The decreasing rate of requested changes is encouraging!
>
> Barry
>
>

Received on 2022-05-11 16:56:06