C++ Logo


Advanced search

Re: Suggested wording change for non-Unicode cases in P2286R7: Formatting Ranges

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 9 May 2022 17:14:41 -0400
On 5/8/22 4:04 PM, Barry Revzin via SG16 wrote:
> On Sun, May 8, 2022 at 9:22 AM Victor Zverovich
> <victor.zverovich_at_[hidden]> wrote:
> > One thing I noticed is that the wording about Grapheme_Extend is
> gone. I didn't know what this meant before, so I don't know now if
> this is a good removal or a bad removal.
> I don't recall any requests for removing it and think that it
> should be reintroduced.
> - Victor
> On Wed, May 4, 2022 at 10:44 PM Jens Maurer <Jens.Maurer_at_[hidden]>
> wrote:
> On 05/05/2022 04.08, Barry Revzin wrote:
> > I think I have applied this. Here's the rendered version:
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
> <https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12>
> > How does this look?
> p2.2
> For each code sequence X in S that either encodes a single
> character or encoding state transition or that is a sequence
> of ill-formed code units is processed in order as follows:
> That feels like bad English grammar to me.
> Why "encoding", yet there is an "encodes" before that?
> Why "either" and there are three things that don't
> exactly correspond grammatically?
> Maybe make a bulleted sub-list with the three items
> so that the structure is clear.
> "If C is one of the UCS scalar values the table below,"
> add "in"
> better clarify: "the two characters shown as the
> corresponding escape sequence are appended to E"
> after p2.3.4, p2.5
> "simple-hexadecimal-digit-sequence"
> I would not re-use lexing grammar for a local placeholder,
> just say \u{/hex-digit-sequence/} or so.
> p2.5
> "Otherwise, X is a sequence of ill-formed code units. Each"
> -> "Otherwise (X is a sequence of ill-formed code units), each
> code unit ..."
> "U+0027 APOSTROPHE is escaped as \' while U+0022 QUOTATION
> MARK is left unchanged."
> Can we rephrase that to avoid "is escaped as"? We were on
> such a good
> track to just append characters and avoid any judgment calls.
> suggestion "
> - for each character U+0027 APOSTROPHE in S, the two
> characters \' are appended to E
> - U+0022 QUOTATION MARK is left unchanged"
> Jens
> Thanks Jens and Victor! I did my best to apply the suggested changes:
> * Updated rendered wording:
> https://brevzin.github.io/cpp_proposals/2286_fmt_ranges/p2286r8.html#pnum_12
> * New diff:
> https://github.com/brevzin/cpp_proposals/commit/3d93043f5c296810d7e18b11d5b7083143554309
> Hopefully, this gradient is slowly descending to the correct solution :-)

Thanks, Barry. This appears to have incorporated the parts of my prior
suggestions that did not have opposition, so just minor issues noted below.

Discussion at the last meeting
<https://github.com/sg16-unicode/sg16-meetings#april-27th-2022> revealed
that we're failing to specify the encoding used to interpret /S/. Change
p2 as follows: (perhaps substitute "as described below" for "as follows")

    The escaped string /E/ representation of a string /S/ is constructed
    by encoding a sequence of characters_as follows._in t_T_he
    associated character encoding /CE/ for charT ([lex.string.literal]
    <http://eel.is/c++draft/tab:lex.string.literal>)as follows:_is used
    both to interpret /S/ and to construct /E/._

In p2.2, "code sequence" -> "code unit sequence".

In p2.3.4 and p2.5, I don't think we should re-use the
/hexadecimal-digit/ grammar term here. Just say, "hexadecimal digits".

Add the following note to p2.4 to address a request by Hubert:

    Otherwise, if /X/ encodes a state transition, the effect on /E/ is
    unspecified._[ /Note:/ the intent is that a state transition be
    represented in /E/ such that its original code unit sequence can be
    reconstructed /- end note/ ]_

Hubert pointed out during the last meeting that we should not be trying
to interpret state transitions for stateful encodings as I had
previously been trying to do. I think we can now simplify p2.5:

    Otherwise (/X/ is a sequence of ill-formed code units), each code
    unit /U/ is appended to /E/ in order as the sequence
    /\x{hex-digit-sequence}/, where /hex-digit-sequence/ is the shortest
    hexadecimal representation of /U/ using lower-case hexadecimal
    digits.When encoding a stateful character encoding, these additions
    should have no effect on encoding state.

In p3, we now need to drop "in a Unicode encoding". I think the result
should also produce a string, not a character.

    The escaped character_string_ representation of a character /C/ in a
    Unicode encoding is equivalent to the escaped string representation
    of a string of /C/, except that:

p4 should be removed now.

    The escaped character and escaped string representations of a
    character or string in a non-Unicode encoding is unspecified.

Hubert, the wording does not explicitly address your request to be able
to specify spacing and separator characters as a set of encoding
agnostic code point values. I think the existing wording suffices to
meet your goals since an implementation can document a method of
identifying the set of escaped characters by, for example, specifying
characters in EBCDIC 1047 and describing how to map those to other code
pages. If you don't agree, could you suggest how the wording might be
updated to better address your concern?


> Barry

Received on 2022-05-09 21:14:43