ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Mon, 16 May 2022 13:16:53 -0400

On Mon, May 16, 2022 at 1:11 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 5/14/22 9:11 PM, Hubert Tong wrote:
>
> On Sat, May 14, 2022 at 6:08 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 5/14/22 8:17 AM, Corentin Jabot wrote:
>>
>> Hey.
>> Thanks for the work Barry.
>>
>> I'm still concerned how long are we still going to keep using the term
>> character incorrectly and in context in which its meaning is ambiguous?
>>
>> Chair hat on: We did discuss this usage during the last telecon
>> <https://github.com/sg16-unicode/sg16-meetings#may-11th-2022> and
>> consensus was for this direction though I have no doubt that stronger
>> consensus could be found with adoption of new terms.
>>
>> Chair hat off ...
>>
>> I don't agree that this wording uses "character" incorrectly, but I do
>> agree that the use here is as ambiguous as usage elsewhere throughout the
>> standard.
>>
>> If we want to clean up our use of "character" (and I think we would all
>> like us to), then I think we need a paper that analyzes how it is currently
>> used and how many terms are needed to replace it. We could then identify
>> terms to fit to those uses. Unfortunately, such terms will likely have to
>> be distinct from what ISO/IEC 10646 provides since many of those terms are
>> defined in Unicode specific terms.
>>
>> Do we have precedence for the use of the term state-transition? (it's not
>> an industry term to the best of my knowledge).
>>
>> I'm not aware of any other uses of this term in the standard. I'll defer
>> to Hubert whether "state-transition" is an acceptable term of art or
>> whether there is another term that would be preferred.
>>
>
> The preferred term of art would be "shift sequence"; however, instead of
> saying "encodes a shift sequence", we should probably say "is a shift
> sequence".
>
> Ok, thanks, Hubert. Here are the changes I think are then desired (based
> on https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html
> which I think is still the most recent revision).
>
LGTM. Thanks!

> In [format.string.escaped]p2.2:
>
> For each code unit sequence *X* in *S* that either encodes a single
> character, encodes a state transition*is a shift sequence*, or is a
> sequence of ill-formed code units, processing is in order as follows:
>
> In [format.string.escaped]p2.4:
>
> Otherwise, if *X* encodes a state transition*is a shift sequence*, the
> effect on *E* and further decoding of *S* is unspecified.
>
> *Recommended Practice*: a state transition*shift sequence* should be
> represented in *E* such that the original code unit sequence of *S* can
> be reconstructed.
>
> Barry, I know I had said we were done, but ... are you ok making these
> changes? The LWG chairs should of course be made aware of the additional
> changes so they can decide if they want LWG to re-re-review again.
>
> Tom.
>
>
>> In all, I'm afraid i had a preference for the original "unspecified"
>> wording, as it's now still unspecified in practice (there is a
>> recommended practice without implementation experience, which doesn't seem
>> to be much better), and it's using terms that are both imprecise and at the
>> same time force implementer hands in undesirable implementations.
>>
>> The recommended practice is only applicable to implementors that support
>> stateful encodings and was requested by the one participating implementor
>> that is most likely to be impacted by such encodings. I don't see anyone's
>> hands being forced. Note that the entire relevant paragraph is:
>>
>> - Otherwise, if *X* encodes a state transition, the effect on *E* and
>> further decoding of *S* is unspecified.
>> * Recommended Practice*: a state transition should be represented in *E*
>> such that the original code unit sequence of *S* can be reconstructed.
>>
>> ie, it is not clear to me that preserving shift state in the
>> escaped string is a requirement or something implementers will want to do
>> in all cases, and in particular, I would expect an escaped strings to be
>> the same regardless of the encoding in a high quality implementation
>>
>> In a case where escaped strings are "the same" regardless of encoding,
> some input strings that are encoded differently from each other, can, in
> stateful encodings, otherwise map to the same escaped string. There simply
> are nuances to the input string apart from the sequence of coded characters.
>
> I understand though if the perceived problem is that there is a trade-off
> between "human readability" and "accuracy for debugging purposes" that the
> design does not acknowledge (we only have the one escaping mechanism being
> introduced, and perhaps for both intents).
>
>> While I tend to agree with your characteristic of a high-quality
>> implementation, "characters" that contribute solely to change in state are
>> particular to stateful encodings, so not generally applicable. If we didn't
>> specify weaker requirements for them, then I would expect them to fall into
>> the implementation-defined set of non-printable characters and be rendered
>> as \u{xx} sequences.
>>
> If we didn't specify weaker requirements for them, then we'd also
> introduce (non-escaped) shift sequences into the escaped string, which then
> makes encoding extra shift sequences or shift sequences omitted at the end
> of the string quite unworkable. Keep in mind that shift sequences can be
> multiple code units and attempts to interpret them as characters may
> involve characters in the initial shift state: trying to emit them as
> escaped characters would then possibly cause extra shift sequences in both
> the encoding of the escaped string and in attempts to translate the escaped
> string as a string literal.
>
>> Tom.
>>
>>
>> (I understand that LWG already decided on that (sorry for not following)
>> so, it might land on my pile of NB comments)
>>
>> Thanks,
>>
>> Corentin
>>
>> On Sat, May 14, 2022 at 4:48 AM Hubert Tong via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On Fri, May 13, 2022 at 8:55 PM Tom Honermann via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> Thanks for the update, Barry. No concerns from me!
>>>>
>>>
>>> Thanks for the heads up, Barry. Looks okay to me too.
>>>
>>>
>>>>
>>>> Tom.
>>>>
>>>> On May 13, 2022, at 8:04 PM, Barry Revzin <barry.revzin_at_[hidden]>
>>>> wrote:
>>>>
>>>>
>>>> Thank you for making all these iterations!
>>>>
>>>> LWG re-affirmed this paper today, making one change. The wording you
>>>> all provided me had a note:
>>>>
>>>> [ *Note*: the intent is that a state transition be represented in `$E$`
>>>> such that the original code unit sequence of `$S$` can be reconstructed
>>>> -*end note* ]
>>>>
>>>> which LWG wanted to elevate into recommended practice:
>>>>
>>>> *Recommended Practice*: a state transition should be represented in
>>>> `$E$` such that the original code unit sequence of `$S$` can be
>>>> reconstructed.
>>>>
>>>> Same words, just slightly more intentional about the intent. I hope
>>>> that's okay with everybody. (dif:
>>>> https://github.com/brevzin/cpp_proposals/commit/fc263d0be55e189a6f98996a7cb06f2f87f82bfd,
>>>> rendered:
>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_21
>>>> )
>>>>
>>>> Thanks again,
>>>>
>>>> Barry
>>>>
>>>> On Thu, May 12, 2022 at 8:56 AM Tom Honermann <tom_at_[hidden]>
>>>> wrote:
>>>>
>>>>> Ship it!
>>>>>
>>>>> Thank you for sticking with us through all these iterations!
>>>>>
>>>>> Tom.
>>>>> On 5/11/22 9:44 PM, Barry Revzin wrote:
>>>>>
>>>>> Done!
>>>>>
>>>>> Barry "Ship it?" Revzin
>>>>>
>>>>> On Wed, May 11, 2022 at 3:36 PM Tom Honermann <tom_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Hi, Barry. We discussed in today's SG16 meeting and identified one
>>>>>> last minor change to make. We then polled forwarding the paper to LWG with
>>>>>> unanimous consent so this is definitely the last change!
>>>>>>
>>>>>> In 2.3.1, substitute "character" for "UCS scalar value" in the first
>>>>>> sentence and in the table header.
>>>>>>
>>>>>> If *C* is one of the UCS scalar values*characters* in the table
>>>>>> below, then the two characters shown as the corresponding escape sequence
>>>>>> are appended to *E*:
>>>>>> UCS scalar value*character*
>>>>>> escape sequence
>>>>>> U+0009 CHARACTER TABULATION \t
>>>>>> U+000A LINE FEED \n
>>>>>> U+000D CARRIAGE RETURN \r
>>>>>> U+0022 QUOTATION MARK \"
>>>>>> U+005C REVERSE SOLIDUS \\
>>>>>>
>>>>>> Tom.
>>>>>> On 5/11/22 12:56 PM, Tom Honermann via SG16 wrote:
>>>>>>
>>>>>> I have a weak preference for "character" given that the wording is
>>>>>> intended to address Unicode and non-Unicode behavior. I don't think we have
>>>>>> any normative uses of "code point" at present.
>>>>>>
>>>>>> The definition of "code point" we have via our normative reference to
>>>>>> ISO/IEC 10646 is: "value in the UCS codespace". That doesn't really work
>>>>>> for the non-Unicode case and, regardless, would include surrogate code
>>>>>> points which I don't think we want in this context.
>>>>>>
>>>>>> Tom.
>>>>>> On 5/11/22 12:24 PM, Victor Zverovich via SG16 wrote:
>>>>>>
>>>>>> Thanks Tom and others for revising the wording. The latest version of
>>>>>> the escaping section looks good to me with only one minor question: is it
>>>>>> clear that "character" in
>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
>>>>>> means a code point or shall we use the term code point instead?
>>>>>>
>>>>>> Cheers,
>>>>>> Victor
>>>>>>
>>>>>> On Tue, May 10, 2022 at 6:32 PM Barry Revzin <barry.revzin_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 10, 2022 at 1:31 PM Tom Honermann <tom_at_[hidden]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 9, 2022 at 4:14 PM Tom Honermann <tom_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich <
>>>>>>>>> victor.zverovich_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> > One thing I noticed is that the wording about Grapheme_Extend
>>>>>>>>>> is gone. I didn't know what this meant before, so I don't know now if this
>>>>>>>>>> is a good removal or a bad removal.
>>>>>>>>>>
>>>>>>>>>> I don't recall any requests for removing it and think that it
>>>>>>>>>> should be reintroduced.
>>>>>>>>>>
>>>>>>>>>> - Victor
>>>>>>>>>>
>>>>>>>>>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> On 05/05/2022 04.08, Barry Revzin wrote:
>>>>>>>>>>> > I think I have applied this. Here's the rendered version:
>>>>>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>>>>>> <
>>>>>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>> > How does this look?
>>>>>>>>>>>
>>>>>>>>>>> p2.2
>>>>>>>>>>>
>>>>>>>>>>> For each code sequence X in S that either encodes a single
>>>>>>>>>>> character or encoding state transition or that is a sequence of ill-formed
>>>>>>>>>>> code units is processed in order as follows:
>>>>>>>>>>>
>>>>>>>>>>> That feels like bad English grammar to me.
>>>>>>>>>>>
>>>>>>>>>>> Why "encoding", yet there is an "encodes" before that?
>>>>>>>>>>> Why "either" and there are three things that don't
>>>>>>>>>>> exactly correspond grammatically?
>>>>>>>>>>>
>>>>>>>>>>> Maybe make a bulleted sub-list with the three items
>>>>>>>>>>> so that the structure is clear.
>>>>>>>>>>>
>>>>>>>>>>> "If C is one of the UCS scalar values the table below,"
>>>>>>>>>>>
>>>>>>>>>>> add "in"
>>>>>>>>>>>
>>>>>>>>>>> better clarify: "the two characters shown as the
>>>>>>>>>>> corresponding escape sequence are appended to E"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> after p2.3.4, p2.5
>>>>>>>>>>>
>>>>>>>>>>> "simple-hexadecimal-digit-sequence"
>>>>>>>>>>>
>>>>>>>>>>> I would not re-use lexing grammar for a local placeholder,
>>>>>>>>>>> just say \u{/hex-digit-sequence/} or so.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> p2.5
>>>>>>>>>>>
>>>>>>>>>>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>>>>>>>>>>
>>>>>>>>>>> -> "Otherwise (X is a sequence of ill-formed code units), each
>>>>>>>>>>> code unit ..."
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION MARK
>>>>>>>>>>> is left unchanged."
>>>>>>>>>>>
>>>>>>>>>>> Can we rephrase that to avoid "is escaped as"? We were on such
>>>>>>>>>>> a good
>>>>>>>>>>> track to just append characters and avoid any judgment calls.
>>>>>>>>>>>
>>>>>>>>>>> suggestion "
>>>>>>>>>>> - for each character U+0027 APOSTROPHE in S, the two characters
>>>>>>>>>>> \' are appended to E
>>>>>>>>>>> - U+0022 QUOTATION MARK is left unchanged"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jens
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Thanks Jens and Victor! I did my best to apply the suggested
>>>>>>>>> changes:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Updated rendered wording:
>>>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>>>> - New diff:
>>>>>>>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hopefully, this gradient is slowly descending to the correct
>>>>>>>>> solution :-)
>>>>>>>>>
>>>>>>>>> Thanks, Barry. This appears to have incorporated the parts of my
>>>>>>>>> prior suggestions that did not have opposition, so just minor issues noted
>>>>>>>>> below.
>>>>>>>>>
>>>>>>>>> Discussion at the last meeting
>>>>>>>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>>>>>>>> revealed that we're failing to specify the encoding used to interpret
>>>>>>>>> *S*. Change p2 as follows: (perhaps substitute "as described
>>>>>>>>> below" for "as follows")
>>>>>>>>>
>>>>>>>>> The escaped string *E* representation of a string *S* is
>>>>>>>>> constructed by encoding a sequence of characters *as follows.* in
>>>>>>>>> t*T*he associated character encoding *CE* for charT (
>>>>>>>>> [lex.string.literal]
>>>>>>>>> <http://eel.is/c++draft/tab:lex.string.literal>) as follows:* is
>>>>>>>>> used both to interpret S and to construct E.*
>>>>>>>>>
>>>>>>>>> In p2.2, "code sequence" -> "code unit sequence".
>>>>>>>>>
>>>>>>>>> In p2.3.4 and p2.5, I don't think we should re-use the
>>>>>>>>> *hexadecimal-digit* grammar term here. Just say, "hexadecimal
>>>>>>>>> digits".
>>>>>>>>>
>>>>>>>>> Add the following note to p2.4 to address a request by Hubert:
>>>>>>>>>
>>>>>>>>> Otherwise, if *X* encodes a state transition, the effect on *E*
>>>>>>>>> is unspecified.* [ Note: the intent is that a state transition be
>>>>>>>>> represented in E such that its original code unit sequence can be
>>>>>>>>> reconstructed - end note ]*
>>>>>>>>>
>>>>>>>>> Hubert pointed out during the last meeting that we should not be
>>>>>>>>> trying to interpret state transitions for stateful encodings as I had
>>>>>>>>> previously been trying to do. I think we can now simplify p2.5:
>>>>>>>>>
>>>>>>>>> Otherwise (*X* is a sequence of ill-formed code units), each code
>>>>>>>>> unit *U* is appended to *E* in order as the sequence
>>>>>>>>> *\x{hex-digit-sequence}*, where *hex-digit-sequence* is the
>>>>>>>>> shortest hexadecimal representation of *U* using lower-case
>>>>>>>>> hexadecimal digits. When encoding a stateful character encoding,
>>>>>>>>> these additions should have no effect on encoding state.
>>>>>>>>>
>>>>>>>>> In p3, we now need to drop "in a Unicode encoding". I think the
>>>>>>>>> result should also produce a string, not a character.
>>>>>>>>>
>>>>>>>>> The escaped character*string* representation of a character *C* in
>>>>>>>>> a Unicode encoding is equivalent to the escaped string
>>>>>>>>> representation of a string of *C*, except that:
>>>>>>>>>
>>>>>>>>> p4 should be removed now.
>>>>>>>>>
>>>>>>>>> The escaped character and escaped string representations of a
>>>>>>>>> character or string in a non-Unicode encoding is unspecified.
>>>>>>>>>
>>>>>>>>> Hubert, the wording does not explicitly address your request to be
>>>>>>>>> able to specify spacing and separator characters as a set of encoding
>>>>>>>>> agnostic code point values. I think the existing wording suffices to meet
>>>>>>>>> your goals since an implementation can document a method of identifying the
>>>>>>>>> set of escaped characters by, for example, specifying characters in EBCDIC
>>>>>>>>> 1047 and describing how to map those to other code pages. If you don't
>>>>>>>>> agree, could you suggest how the wording might be updated to better address
>>>>>>>>> your concern?
>>>>>>>>>
>>>>>>>>> Tom.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, Tom! I applied these changes. The diff can be found here:
>>>>>>>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>>>>>>>
>>>>>>>> Thanks, Barry. This looks good to me modulo Hubert's additional
>>>>>>>> tweak.
>>>>>>>>
>>>>>>>> One last thing I noticed. The example section has this:
>>>>>>>>
>>>>>>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>>>>>>> // s4 has
>>>>>>>> value [\u{0} \n \t \u{2} \u{1b}]
>>>>>>>>
>>>>>>>> That example depends on the encoding being ASCII-based in order for
>>>>>>>> the \x02 and \x1b escapes to be interpreted as characters \u{2}
>>>>>>>> and \u{1b}. Similarly, s5 and s6 have UTF-8 dependencies. Perhaps
>>>>>>>> we should add a comment?
>>>>>>>>
>>>>>>>> string s0 = format("[{}]", "h\tllo"); // s0 has
>>>>>>>> value: [h llo]
>>>>>>>> string s1 = format("[{:?}]", "h\tllo"); // s1 has
>>>>>>>> value: ["h\tllo"]
>>>>>>>> string s2 = format("[{:?}]", "Спасибо, Виктор ♥!"); // s2 has
>>>>>>>> value: ["Спасибо, Виктор ♥!"]
>>>>>>>> string s3 = format("[{:?}] [{:?}]", '\'', '"'); // s3 has
>>>>>>>> value: ['\'', '"']
>>>>>>>> *// The following examples assume use of the UTF-8 encoding.*
>>>>>>>> string s4 = format("[{:?}]", string("\0 \n \t \x02 \x1b", 9));
>>>>>>>> // s4 has
>>>>>>>> value [\u{0} \n \t \u{2} \u{1b}]
>>>>>>>> string s5 = format("[{:?}]", "\xc3\x28"); // invalid
>>>>>>>> UTF-8
>>>>>>>> // s5 has
>>>>>>>> value: ["\x{c3}\x{28}"]
>>>>>>>> string s6 = format("[{:?}]", "🤷🏻‍♂️"); // s6
>>>>>>>> has value: ["🤷🏻\u{200d}♂\u{fe0f}"]
>>>>>>>>
>>>>>>>> I never got around to translating "Спасибо, Виктор ♥!" until now.
>>>>>>>> Very nice :)
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>
>>>>>>> Applied Hubert's change and added this comment:
>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>> Thanks!
>>>>>>>
>>>>>>> The decreasing rate of requested changes is encouraging!
>>>>>>>
>>>>>>> Barry
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>

Received on 2022-05-16 17:17:22