ISOCPP sg16 List: Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 10 May 2022 13:58:10 -0400

On 5/10/22 1:09 PM, Hubert Tong wrote:
> On Mon, May 9, 2022 at 5:14 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
>>
>>
>> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich
>> <victor.zverovich_at_[hidden]> wrote:
>>
>> > One thing I noticed is that the wording about
>> Grapheme_Extend is gone. I didn't know what this meant
>> before, so I don't know now if this is a good removal or a
>> bad removal.
>>
>> I don't recall any requests for removing it and think that it
>> should be reintroduced.
>>
>> - Victor
>>
>> On Wed, May 4, 2022 at 10:44 PM Jens Maurer
>> <Jens.Maurer_at_[hidden]> wrote:
>>
>> On 05/05/2022 04.08, Barry Revzin wrote:
>> > I think I have applied this. Here's the rendered
>> version:
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>> <https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12>
>>
>> > How does this look?
>>
>> p2.2
>>
>> For each code sequence X in S that either encodes a
>> single character or encoding state transition or that is
>> a sequence of ill-formed code units is processed in order
>> as follows:
>>
>> That feels like bad English grammar to me.
>>
>> Why "encoding", yet there is an "encodes" before that?
>> Why "either" and there are three things that don't
>> exactly correspond grammatically?
>>
>> Maybe make a bulleted sub-list with the three items
>> so that the structure is clear.
>>
>> "If C is one of the UCS scalar values the table below,"
>>
>> add "in"
>>
>> better clarify: "the two characters shown as the
>> corresponding escape sequence are appended to E"
>>
>>
>> after p2.3.4, p2.5
>>
>> "simple-hexadecimal-digit-sequence"
>>
>> I would not re-use lexing grammar for a local placeholder,
>> just say \u{/hex-digit-sequence/} or so.
>>
>>
>> p2.5
>>
>> "Otherwise, X is a sequence of ill-formed code units. Each"
>>
>> -> "Otherwise (X is a sequence of ill-formed code units),
>> each code unit ..."
>>
>>
>> "U+0027 APOSTROPHE is escaped as \' while U+0022
>> QUOTATION MARK is left unchanged."
>>
>> Can we rephrase that to avoid "is escaped as"? We were
>> on such a good
>> track to just append characters and avoid any judgment calls.
>>
>> suggestion "
>> - for each character U+0027 APOSTROPHE in S, the two
>> characters \' are appended to E
>> - U+0022 QUOTATION MARK is left unchanged"
>>
>>
>> Jens
>>
>>
>> Thanks Jens and Victor! I did my best to apply the suggested changes:
>>
>> * Updated rendered wording:
>> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
>> * New diff:
>> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
>>
>>
>> Hopefully, this gradient is slowly descending to the correct
>> solution :-)
>
> Thanks, Barry. This appears to have incorporated the parts of my
> prior suggestions that did not have opposition, so just minor
> issues noted below.
>
> Discussion at the last meeting
> <https://github.com/sg16-unicode/sg16-meetings#april-27th-2022>
> revealed that we're failing to specify the encoding used to
> interpret /S/. Change p2 as follows: (perhaps substitute "as
> described below" for "as follows")
>
> The escaped string /E/ representation of a string /S/ is
> constructed by encoding a sequence of characters_as
> follows._in t_T_he associated character encoding /CE/ for
> charT ([lex.string.literal]
> <http://eel.is/c++draft/tab:lex.string.literal>)as follows:_is
> used both to interpret /S/ and to construct /E/._
>
> In p2.2, "code sequence" -> "code unit sequence".
>
> In p2.3.4 and p2.5, I don't think we should re-use the
> /hexadecimal-digit/ grammar term here. Just say, "hexadecimal digits".
>
> Add the following note to p2.4 to address a request by Hubert:
>
> Otherwise, if /X/ encodes a state transition, the effect on
> /E/ is unspecified._[ /Note:/ the intent is that a state
> transition be represented in /E/ such that its original code
> unit sequence can be reconstructed /- end note/ ]_
>
> I think this needs to be:
>
> Otherwise, if X encodes a state transition, the effect on E<ins> and
> further decoding of S</ins> is unspecified. [ Note: the intent is that
> a state transition be represented in E such that<del> its</del><ins>
> the</ins> original code unit sequence<ins> of S</ins> can be
> reconstructed -end note ]
That update looks fine to me.
> The issue being that I am not aware of widespread implementation
> experience indicating that observing the state transition in the
> decoding is a "win".
> Indeed, problems can already be anticipated.
>
> For example, how would a /lack/ of return to the initial encoding
> state prior to \0 be represented?
>
> Regardless of what the Core language wording recommends about managing
> state transitions in encoding of string literals, there may be
> external requirements that cause an implementation to encode returns
> to the initial shift state prior to numeric escapes. Once E is placed
> into a non-initial shift state, there may simply be no way to follow
> these rules without introducing additional shift states not originally
> present.

Agreed. I had thought about adding wording to state that implementations
should insert state transitions in /E/ as needed, but then decided that
doing so is already implied by specifying the encoding used to construct
/E/; the implementation should follow the rules of the encoding and
insert state transitions as directed.

Tom.

>
> Hubert pointed out during the last meeting that we should not be
> trying to interpret state transitions for stateful encodings as I
> had previously been trying to do. I think we can now simplify p2.5:
>
> Otherwise (/X/ is a sequence of ill-formed code units), each
> code unit /U/ is appended to /E/ in order as the sequence
> /\x{hex-digit-sequence}/, where /hex-digit-sequence/ is the
> shortest hexadecimal representation of /U/ using lower-case
> hexadecimal digits.When encoding a stateful character
> encoding, these additions should have no effect on encoding state.
>
> In p3, we now need to drop "in a Unicode encoding". I think the
> result should also produce a string, not a character.
>
> The escaped character_string_ representation of a character
> /C/ in a Unicode encoding is equivalent to the escaped string
> representation of a string of /C/, except that:
>
> p4 should be removed now.
>
> The escaped character and escaped string representations of a
> character or string in a non-Unicode encoding is unspecified.
>
> Hubert, the wording does not explicitly address your request to be
> able to specify spacing and separator characters as a set of
> encoding agnostic code point values. I think the existing wording
> suffices to meet your goals since an implementation can document a
> method of identifying the set of escaped characters by, for
> example, specifying characters in EBCDIC 1047 and describing how
> to map those to other code pages. If you don't agree, could you
> suggest how the wording might be updated to better address your
> concern?
>
> I think that interpretation works.
>
> Tom.
>
>>
>> Barry
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2022-05-10 17:58:14