ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 16 May 2022 13:11:00 -0400

On 5/14/22 9:11 PM, Hubert Tong wrote:
> On Sat, May 14, 2022 at 6:08 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 5/14/22 8:17 AM, Corentin Jabot wrote:
>> Hey.
>> Thanks for the work Barry.
>>
>> I'm still concerned how long are we still going to keep using the
>> term character incorrectly and in context in which its meaning is
>> ambiguous?
>
> Chair hat on: We did discuss this usage during the last telecon
> <https://github.com/sg16-unicode/sg16-meetings#may-11th-2022> and
> consensus was for this direction though I have no doubt that
> stronger consensus could be found with adoption of new terms.
>
> Chair hat off ...
>
> I don't agree that this wording uses "character" incorrectly, but
> I do agree that the use here is as ambiguous as usage elsewhere
> throughout the standard.
>
> If we want to clean up our use of "character" (and I think we
> would all like us to), then I think we need a paper that analyzes
> how it is currently used and how many terms are needed to replace
> it. We could then identify terms to fit to those uses.
> Unfortunately, such terms will likely have to be distinct from
> what ISO/IEC 10646 provides since many of those terms are defined
> in Unicode specific terms.
>
>> Do we have precedence for the use of the term state-transition?
>> (it's not an industry term to the best of my knowledge).
> I'm not aware of any other uses of this term in the standard. I'll
> defer to Hubert whether "state-transition" is an acceptable term
> of art or whether there is another term that would be preferred.
>
>
> The preferred term of art would be "shift sequence"; however, instead
> of saying "encodes a shift sequence", we should probably say "is a
> shift sequence".

Ok, thanks, Hubert. Here are the changes I think are then desired (based
on https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html
which I think is still the most recent revision).

In [format.string.escaped]p2.2:

    For each code unit sequence /X/ in /S/ that either encodes a single
    character, encodes a state transition_is a shift sequence_, or is a
    sequence of ill-formed code units, processing is in order as follows:

In [format.string.escaped]p2.4:

    Otherwise, if /X/ encodes a state transition_is a shift sequence_,
    the effect on /E/ and further decoding of /S/ is unspecified.

    /Recommended Practice/: a state transition_shift sequence_ should be
    represented in /E/ such that the original code unit sequence of /S/
    can be reconstructed.

Barry, I know I had said we were done, but ... are you ok making these
changes? The LWG chairs should of course be made aware of the additional
changes so they can decide if they want LWG to re-re-review again.

Tom.

>>
>> In all, I'm afraid i had a preference for the original
>> "unspecified" wording, as it's now still unspecified in practice
>> (there is a recommended practice without implementation
>> experience, which doesn't seem to be much better), and it's using
>> terms that are both imprecise and at the same time force
>> implementer hands in undesirable implementations.
>
> The recommended practice is only applicable to implementors that
> support stateful encodings and was requested by the one
> participating implementor that is most likely to be impacted by
> such encodings. I don't see anyone's hands being forced. Note that
> the entire relevant paragraph is:
>
> * Otherwise, if /X/ encodes a state transition, the effect on
> /E/ and further decoding of /S/ is unspecified./
> Recommended Practice/: a state transition should be
> represented in /E/ such that the original code unit sequence
> of /S/ can be reconstructed.
>
>> ie, it is not clear to me that preserving shift state in the
>> escaped string is a requirement or something implementers will
>> want to do in all cases, and in particular, I would expect an
>> escaped strings to be the same regardless of the encoding in a
>> high quality implementation
>
> In a case where escaped strings are "the same" regardless of encoding,
> some input strings that are encoded differently from each other, can,
> in stateful encodings, otherwise map to the same escaped string. There
> simply are nuances to the input string apart from the sequence of
> coded characters.
>
> I understand though if the perceived problem is that there is a
> trade-off between "human readability" and "accuracy for debugging
> purposes" that the design does not acknowledge (we only have the one
> escaping mechanism being introduced, and perhaps for both intents).
>
> While I tend to agree with your characteristic of a high-quality
> implementation, "characters" that contribute solely to change in
> state are particular to stateful encodings, so not generally
> applicable. If we didn't specify weaker requirements for them,
> then I would expect them to fall into the implementation-defined
> set of non-printable characters and be rendered as \u{xx} sequences.
>
> If we didn't specify weaker requirements for them, then we'd also
> introduce (non-escaped) shift sequences into the escaped string, which
> then makes encoding extra shift sequences or shift sequences omitted
> at the end of the string quite unworkable. Keep in mind that shift
> sequences can be multiple code units and attempts to interpret them as
> characters may involve characters in the initial shift state: trying
> to emit them as escaped characters would then possibly cause extra
> shift sequences in both the encoding of the escaped string and in
> attempts to translate the escaped string as a string literal.
>
> Tom.
>
>>
>> (I understand that LWG already decided on that (sorry for not
>> following) so, it might land on my pile of NB comments)
>>
>> Thanks,
>>
>> Corentin
>>
>> On Sat, May 14, 2022 at 4:48 AM Hubert Tong via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> On Fri, May 13, 2022 at 8:55 PM Tom Honermann via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> Thanks for the update, Barry. No concerns from me!
>>
>>
>> Thanks for the heads up, Barry. Looks okay to me too.
>>
>>
>> Tom.
>>
>>> On May 13, 2022, at 8:04 PM, Barry Revzin
>>> <barry.revzin_at_[hidden]> wrote:
>>>
>>>
>>> Thank you for making all these iterations!
>>>
>>> LWG re-affirmed this paper today, making one change. The
>>> wording you all provided me had a note:
>>>
>>> [ *Note*: the intent is that a state transition be
>>> represented in `$E$` such that the original code unit
>>> sequence of `$S$` can be reconstructed -*end note* ]
>>>
>>> which LWG wanted to elevate into recommended practice:
>>>
>>> *Recommended Practice*: a state transition should be
>>> represented in `$E$` such that the original code unit
>>> sequence of `$S$` can be reconstructed.
>>>
>>> Same words, just slightly more intentional about the
>>> intent. I hope that's okay with everybody. (dif:
>>> https://github.com/brevzin/cpp_proposals/commit/fc263d0be55e189a6f98996a7cb06f2f87f82bfd,
>>> rendered:
>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_21)
>>>
>>> Thanks again,
>>>
>>> Barry
>>>
>>> On Thu, May 12, 2022 at 8:56 AM Tom Honermann
>>> <tom_at_[hidden]> wrote:
>>>
>>> Ship it!
>>>
>>> Thank you for sticking with us through all these
>>> iterations!
>>>
>>> Tom.
>>>
>>> On 5/11/22 9:44 PM, Barry Revzin wrote:
>>>> Done!
>>>>
>>>> Barry "Ship it?" Revzin
>>>>
>>>> On Wed, May 11, 2022 at 3:36 PM Tom Honermann
>>>> <tom_at_[hidden]> wrote:
>>>>
>>>> Hi, Barry. We discussed in today's SG16 meeting
>>>> and identified one last minor change to make.
>>>> We then polled forwarding the paper to LWG with
>>>> unanimous consent so this is definitely the
>>>> last change!
>>>>
>>>> In 2.3.1, substitute "character" for "UCS
>>>> scalar value" in the first sentence and in the
>>>> table header.
>>>>
>>>> If /C/ is one of the UCS scalar
>>>> values_characters_ in the table below, then
>>>> the two characters shown as the
>>>> corresponding escape sequence are appended
>>>> to /E/:
>>>>
>>>> UCS scalar value_character_
>>>> escape sequence
>>>> U+0009 CHARACTER TABULATION |\t|
>>>> U+000A LINE FEED |\n|
>>>> U+000D CARRIAGE RETURN |\r|
>>>> U+0022 QUOTATION MARK |\"|
>>>> U+005C REVERSE SOLIDUS |\\|
>>>>
>>>> Tom.
>>>>
>>>> On 5/11/22 12:56 PM, Tom Honermann via SG16 wrote:
>>>>>
>>>>> I have a weak preference for "character" given
>>>>> that the wording is intended to address
>>>>> Unicode and non-Unicode behavior. I don't
>>>>> think we have any normative uses of "code
>>>>> point" at present.
>>>>>
>>>>> The definition of "code point" we have via our
>>>>> normative reference to ISO/IEC 10646 is:
>>>>> "value in the UCS codespace". That doesn't
>>>>> really work for the non-Unicode case and,
>>>>> regardless, would include surrogate code
>>>>> points which I don't think we want in this
>>>>> context.
>>>>>
>>>>> Tom.
>>>>>
>>>>> On 5/11/22 12:24 PM, Victor Zverovich via SG16
>>>>> wrote:
>>>>>> Thanks Tom and others for revising the
>>>>>> wording. The latest version of the escaping
>>>>>> section looks good to me with only one minor
>>>>>> question: is it clear that "character" in
>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_14
>>>>>> means a code point or shall we use the term
>>>>>> code point instead?
>>>>>>
>>>>>> Cheers,
>>>>>> Victor
>>>>>>
>>>>>> On Tue, May 10, 2022 at 6:32 PM Barry Revzin
>>>>>> <barry.revzin_at_[hidden]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 10, 2022 at 1:31 PM Tom
>>>>>> Honermann <tom_at_[hidden]> wrote:
>>>>>>
>>>>>> On 5/9/22 7:34 PM, Barry Revzin wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 9, 2022 at 4:14 PM Tom
>>>>>>> Honermann <tom_at_[hidden]> wrote:
>>>>>>>
>>>>>>> On 5/8/22 4:04 PM, Barry Revzin
>>>>>>> via SG16 wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, May 8, 2022 at 9:22 AM
>>>>>>>> Victor Zverovich
>>>>>>>> <victor.zverovich_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>> > One thing I noticed is
>>>>>>>> that the wording about
>>>>>>>> Grapheme_Extend is gone. I
>>>>>>>> didn't know what this meant
>>>>>>>> before, so I don't know now
>>>>>>>> if this is a good removal
>>>>>>>> or a bad removal.
>>>>>>>>
>>>>>>>> I don't recall any requests
>>>>>>>> for removing it and think
>>>>>>>> that it should be
>>>>>>>> reintroduced.
>>>>>>>>
>>>>>>>> - Victor
>>>>>>>>
>>>>>>>> On Wed, May 4, 2022 at
>>>>>>>> 10:44 PM Jens Maurer
>>>>>>>> <Jens.Maurer_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>> On 05/05/2022 04.08,
>>>>>>>> Barry Revzin wrote:
>>>>>>>> > I think I have
>>>>>>>> applied this. Here's
>>>>>>>> the rendered version:
>>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>>> <https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12>
>>>>>>>>
>>>>>>>> > How does this look?
>>>>>>>>
>>>>>>>> p2.2
>>>>>>>>
>>>>>>>> For each code sequence
>>>>>>>> X in S that either
>>>>>>>> encodes a single
>>>>>>>> character or encoding
>>>>>>>> state transition or
>>>>>>>> that is a sequence of
>>>>>>>> ill-formed code units
>>>>>>>> is processed in order
>>>>>>>> as follows:
>>>>>>>>
>>>>>>>> That feels like bad
>>>>>>>> English grammar to me.
>>>>>>>>
>>>>>>>> Why "encoding", yet
>>>>>>>> there is an "encodes"
>>>>>>>> before that?
>>>>>>>> Why "either" and there
>>>>>>>> are three things that don't
>>>>>>>> exactly correspond
>>>>>>>> grammatically?
>>>>>>>>
>>>>>>>> Maybe make a bulleted
>>>>>>>> sub-list with the three
>>>>>>>> items
>>>>>>>> so that the structure
>>>>>>>> is clear.
>>>>>>>>
>>>>>>>> "If C is one of the UCS
>>>>>>>> scalar values the table
>>>>>>>> below,"
>>>>>>>>
>>>>>>>> add "in"
>>>>>>>>
>>>>>>>> better clarify: "the
>>>>>>>> two characters shown as the
>>>>>>>> corresponding escape
>>>>>>>> sequence are appended to E"
>>>>>>>>
>>>>>>>>
>>>>>>>> after p2.3.4, p2.5
>>>>>>>>
>>>>>>>> "simple-hexadecimal-digit-sequence"
>>>>>>>>
>>>>>>>> I would not re-use
>>>>>>>> lexing grammar for a
>>>>>>>> local placeholder,
>>>>>>>> just say
>>>>>>>> \u{/hex-digit-sequence/}
>>>>>>>> or so.
>>>>>>>>
>>>>>>>>
>>>>>>>> p2.5
>>>>>>>>
>>>>>>>> "Otherwise, X is a
>>>>>>>> sequence of ill-formed
>>>>>>>> code units. Each"
>>>>>>>>
>>>>>>>> -> "Otherwise (X is a
>>>>>>>> sequence of ill-formed
>>>>>>>> code units), each code
>>>>>>>> unit ..."
>>>>>>>>
>>>>>>>>
>>>>>>>> "U+0027 APOSTROPHE is
>>>>>>>> escaped as \' while
>>>>>>>> U+0022 QUOTATION MARK
>>>>>>>> is left unchanged."
>>>>>>>>
>>>>>>>> Can we rephrase that to
>>>>>>>> avoid "is escaped as"?
>>>>>>>> We were on such a good
>>>>>>>> track to just append
>>>>>>>> characters and avoid
>>>>>>>> any judgment calls.
>>>>>>>>
>>>>>>>> suggestion "
>>>>>>>> - for each character
>>>>>>>> U+0027 APOSTROPHE in S,
>>>>>>>> the two characters \'
>>>>>>>> are appended to E
>>>>>>>> - U+0022 QUOTATION
>>>>>>>> MARK is left unchanged"
>>>>>>>>
>>>>>>>>
>>>>>>>> Jens
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks Jens and Victor! I did
>>>>>>>> my best to apply the suggested
>>>>>>>> changes:
>>>>>>>>
>>>>>>>> * Updated rendered wording:
>>>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>>>> * New diff:
>>>>>>>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>>>>>>>
>>>>>>>>
>>>>>>>> Hopefully, this gradient is
>>>>>>>> slowly descending to the
>>>>>>>> correct solution :-)
>>>>>>>
>>>>>>> Thanks, Barry. This appears to
>>>>>>> have incorporated the parts of
>>>>>>> my prior suggestions that did
>>>>>>> not have opposition, so just
>>>>>>> minor issues noted below.
>>>>>>>
>>>>>>> Discussion at the last meeting
>>>>>>> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
>>>>>>> revealed that we're failing to
>>>>>>> specify the encoding used to
>>>>>>> interpret /S/. Change p2 as
>>>>>>> follows: (perhaps substitute "as
>>>>>>> described below" for "as follows")
>>>>>>>
>>>>>>> The escaped string /E/
>>>>>>> representation of a string
>>>>>>> /S/ is constructed by
>>>>>>> encoding a sequence of
>>>>>>> characters_as follows._in
>>>>>>> t_T_he associated character
>>>>>>> encoding /CE/ for charT
>>>>>>> ([lex.string.literal]
>>>>>>> <http://eel.is/c++draft/tab:lex.string.literal>)as
>>>>>>> follows:_is used both to
>>>>>>> interpret /S/ and to
>>>>>>> construct /E/._
>>>>>>>
>>>>>>> In p2.2, "code sequence" ->
>>>>>>> "code unit sequence".
>>>>>>>
>>>>>>> In p2.3.4 and p2.5, I don't
>>>>>>> think we should re-use the
>>>>>>> /hexadecimal-digit/ grammar term
>>>>>>> here. Just say, "hexadecimal
>>>>>>> digits".
>>>>>>>
>>>>>>> Add the following note to p2.4
>>>>>>> to address a request by Hubert:
>>>>>>>
>>>>>>> Otherwise, if /X/ encodes a
>>>>>>> state transition, the effect
>>>>>>> on /E/ is unspecified._[
>>>>>>> /Note:/ the intent is that a
>>>>>>> state transition be
>>>>>>> represented in /E/ such that
>>>>>>> its original code unit
>>>>>>> sequence can be
>>>>>>> reconstructed /- end note/ ]_
>>>>>>>
>>>>>>> Hubert pointed out during the
>>>>>>> last meeting that we should not
>>>>>>> be trying to interpret state
>>>>>>> transitions for stateful
>>>>>>> encodings as I had previously
>>>>>>> been trying to do. I think we
>>>>>>> can now simplify p2.5:
>>>>>>>
>>>>>>> Otherwise (/X/ is a sequence
>>>>>>> of ill-formed code units),
>>>>>>> each code unit /U/ is
>>>>>>> appended to /E/ in order as
>>>>>>> the sequence
>>>>>>> /\x{hex-digit-sequence}/,
>>>>>>> where /hex-digit-sequence/
>>>>>>> is the shortest hexadecimal
>>>>>>> representation of /U/ using
>>>>>>> lower-case hexadecimal
>>>>>>> digits.When encoding a
>>>>>>> stateful character encoding,
>>>>>>> these additions should have
>>>>>>> no effect on encoding state.
>>>>>>>
>>>>>>> In p3, we now need to drop "in a
>>>>>>> Unicode encoding". I think the
>>>>>>> result should also produce a
>>>>>>> string, not a character.
>>>>>>>
>>>>>>> The escaped
>>>>>>> character_string_
>>>>>>> representation of a
>>>>>>> character /C/ in a Unicode
>>>>>>> encoding is equivalent to
>>>>>>> the escaped string
>>>>>>> representation of a string
>>>>>>> of /C/, except that:
>>>>>>>
>>>>>>> p4 should be removed now.
>>>>>>>
>>>>>>> The escaped character and
>>>>>>> escaped string
>>>>>>> representations of a
>>>>>>> character or string in a
>>>>>>> non-Unicode encoding is
>>>>>>> unspecified.
>>>>>>>
>>>>>>> Hubert, the wording does not
>>>>>>> explicitly address your request
>>>>>>> to be able to specify spacing
>>>>>>> and separator characters as a
>>>>>>> set of encoding agnostic code
>>>>>>> point values. I think the
>>>>>>> existing wording suffices to
>>>>>>> meet your goals since an
>>>>>>> implementation can document a
>>>>>>> method of identifying the set of
>>>>>>> escaped characters by, for
>>>>>>> example, specifying characters
>>>>>>> in EBCDIC 1047 and describing
>>>>>>> how to map those to other code
>>>>>>> pages. If you don't agree, could
>>>>>>> you suggest how the wording
>>>>>>> might be updated to better
>>>>>>> address your concern?
>>>>>>>
>>>>>>> Tom.
>>>>>>>
>>>>>>>
>>>>>>> Thanks, Tom! I applied these
>>>>>>> changes. The diff can be found here:
>>>>>>> https://github.com/brevzin/cpp_proposals/commit/6745d72f8c002b7ce8811f0c6aeb5591cff97d54
>>>>>>
>>>>>> Thanks, Barry. This looks good to me
>>>>>> modulo Hubert's additional tweak.
>>>>>>
>>>>>> One last thing I noticed. The example
>>>>>> section has this:
>>>>>>
>>>>>> string s4 = format("[{:?}]",
>>>>>> string("\0 \n \t \x02 \x1b", 9));
>>>>>>
>>>>>> // s4 has value [\u{0} \n \t
>>>>>> \u{2} \u{1b}]
>>>>>>
>>>>>> That example depends on the encoding
>>>>>> being ASCII-based in order for the
>>>>>> \x02 and \x1b escapes to be
>>>>>> interpreted as characters \u{2} and
>>>>>> \u{1b}. Similarly, s5 and s6 have
>>>>>> UTF-8 dependencies. Perhaps we should
>>>>>> add a comment?
>>>>>>
>>>>>> string s0 = format("[{}]",
>>>>>> "h\tllo"); // s0
>>>>>> has value: [h llo]
>>>>>> string s1 = format("[{:?}]",
>>>>>> "h\tllo"); // s1
>>>>>> has value: ["h\tllo"]
>>>>>> string s2 = format("[{:?}]",
>>>>>> "Спасибо, Виктор ♥!"); // s2
>>>>>> has value: ["Спасибо, Виктор ♥!"]
>>>>>> string s3 = format("[{:?}]
>>>>>> [{:?}]", '\'', '"'); // s3 has
>>>>>> value: ['\'', '"']
>>>>>> _// The following examples assume
>>>>>> use of the UTF-8 encoding._
>>>>>> string s4 = format("[{:?}]",
>>>>>> string("\0 \n \t \x02 \x1b", 9));
>>>>>>
>>>>>> // s4 has value [\u{0} \n \t
>>>>>> \u{2} \u{1b}]
>>>>>> string s5 = format("[{:?}]",
>>>>>> "\xc3\x28"); //
>>>>>> invalid UTF-8
>>>>>>
>>>>>> // s5 has value: ["\x{c3}\x{28}"]
>>>>>> string s6 = format("[{:?}]",
>>>>>> "🤷🏻‍♂️"); //
>>>>>> s6 has value:
>>>>>> ["🤷🏻\u{200d}♂\u{fe0f}"]
>>>>>>
>>>>>> I never got around to translating
>>>>>> "Спасибо, Виктор ♥!" until now. Very
>>>>>> nice :)
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>
>>>>>> Applied Hubert's change and added this
>>>>>> comment:
>>>>>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>>>>>> Thanks!
>>>>>>
>>>>>> The decreasing rate of requested changes
>>>>>> is encouraging!
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>>
>>>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>

Received on 2022-05-16 17:11:06