ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Barry Revzin <barry.revzin_at_[hidden]>
Date: Mon, 9 May 2022 18:34:53 -0500

On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>
>
>
> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich <
> victor.zverovich_at_[hidden]> wrote:
>
>> > One thing I noticed is that the wording about Grapheme_Extend is gone.
>> I didn't know what this meant before, so I don't know now if this is a good
>> removal or a bad removal.
>>
>> I don't recall any requests for removing it and think that it should be
>> reintroduced.
>>
>> - Victor
>>
>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>
>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>> > I think I have applied this. Here's the rendered version:
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>> <
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>> >
>>>
>>> > How does this look?
>>>
>>> p2.2
>>>
>>> For each code sequence X in S that either encodes a single character or
>>> encoding state transition or that is a sequence of ill-formed code units is
>>> processed in order as follows:
>>>
>>> That feels like bad English grammar to me.
>>>
>>> Why "encoding", yet there is an "encodes" before that?
>>> Why "either" and there are three things that don't
>>> exactly correspond grammatically?
>>>
>>> Maybe make a bulleted sub-list with the three items
>>> so that the structure is clear.
>>>
>>> "If C is one of the UCS scalar values the table below,"
>>>
>>> add "in"
>>>
>>> better clarify: "the two characters shown as the
>>> corresponding escape sequence are appended to E"
>>>
>>>
>>> after p2.3.4, p2.5
>>>
>>> "simple-hexadecimal-digit-sequence"
>>>
>>> I would not re-use lexing grammar for a local placeholder,
>>> just say \u{/hex-digit-sequence/} or so.
>>>
>>>
>>> p2.5
>>>
>>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>>
>>> -> "Otherwise (X is a sequence of ill-formed code units), each code unit
>>> ..."
>>>
>>>
>>> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION MARK is left
>>> unchanged."
>>>
>>> Can we rephrase that to avoid "is escaped as"? We were on such a good
>>> track to just append characters and avoid any judgment calls.
>>>
>>> suggestion "
>>> - for each character U+0027 APOSTROPHE in S, the two characters \' are
>>> appended to E
>>> - U+0022 QUOTATION MARK is left unchanged"
>>>
>>>
>>> Jens
>>>
>>
> Thanks Jens and Victor! I did my best to apply the suggested changes:
>
>
> - Updated rendered wording:
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
> - New diff:
> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>
>
> Hopefully, this gradient is slowly descending to the correct solution :-)
>
> Thanks, Barry. This appears to have incorporated the parts of my prior
> suggestions that did not have opposition, so just minor issues noted below.
>
> Discussion at the last meeting
> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022> revealed
> that we're failing to specify the encoding used to interpret *S*. Change
> p2 as follows: (perhaps substitute "as described below" for "as follows")
>
> The escaped string *E* representation of a string *S* is constructed by
> encoding a sequence of characters *as follows.* in t*T*he associated
> character encoding *CE* for charT ([lex.string.literal]
> <http://eel.is/c++draft/tab:lex.string.literal>) as follows:* is used
> both to interpret S and to construct E.*
>
> In p2.2, "code sequence" -> "code unit sequence".
>
> In p2.3.4 and p2.5, I don't think we should re-use the *hexadecimal-digit*
> grammar term here. Just say, "hexadecimal digits".
>
> Add the following note to p2.4 to address a request by Hubert:
>
> Otherwise, if *X* encodes a state transition, the effect on *E* is
> unspecified.* [ Note: the intent is that a state transition be
> represented in E such that its original code unit sequence can be
> reconstructed - end note ]*
>
> Hubert pointed out during the last meeting that we should not be trying to
> interpret state transitions for stateful encodings as I had previously been
> trying to do. I think we can now simplify p2.5:
>
> Otherwise (*X* is a sequence of ill-formed code units), each code unit *U*
> is appended to *E* in order as the sequence *\x{hex-digit-sequence}*,
> where *hex-digit-sequence* is the shortest hexadecimal representation of
> *U* using lower-case hexadecimal digits. When encoding a stateful
> character encoding, these additions should have no effect on encoding state.
>
> In p3, we now need to drop "in a Unicode encoding". I think the result
> should also produce a string, not a character.
>
> The escaped character*string* representation of a character *C* in a
> Unicode encoding is equivalent to the escaped string representation of a
> string of *C*, except that:
>
> p4 should be removed now.
>
> The escaped character and escaped string representations of a character or
> string in a non-Unicode encoding is unspecified.
>
> Hubert, the wording does not explicitly address your request to be able to
> specify spacing and separator characters as a set of encoding agnostic code
> point values. I think the existing wording suffices to meet your goals
> since an implementation can document a method of identifying the set of
> escaped characters by, for example, specifying characters in EBCDIC 1047
> and describing how to map those to other code pages. If you don't agree,
> could you suggest how the wording might be updated to better address your
> concern?
>
> Tom.
>

Thanks, Tom! I applied these changes. The diff can be found here:
https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54

Barry

Received on 2022-05-09 23:35:06