C++ Logo

sg16

Advanced search

Agenda for the 2023-10-11 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 7 Oct 2023 23:20:08 -0400
SG16 will hold a telecon on Wednesday, October 11th, at 19:30 UTC
(timezone conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20231011T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>).

The agenda follows.

  * P1729R2: Text Parsing <https://wg21.link/p1729r2>:
      o Continue review.

We made good progress reviewing this paper during the 2023-09-27 meeting
<https://github.com/sg16-unicode/sg16-meetings/tree/master#september-27th-2023>
and I expect we'll complete review in this meeting. Since wording is not
yet available, we won't poll forwarding this paper, but may poll support
for the paper and approval of the design as encouragement for LEWG to
review the design before a large investment is made in wording. We will,
of course, review again once wording is available.

One item I would like to discuss is that the proposed functionality
allows for a single code unit to be scanned and produced as a char (or
wchar_t) value. What does that imply for the following example (assume
that the ordinary literal encoding is UTF-8)?

    // U+12345 is 0xF0 0x92 0x8D 0x85 in UTF-8
    std::scan<char, std::string>("\u{12345}", "{}{}");

The scan of the char value presumably consumes the 0xF0 code unit such
that the scan of the std::string value then begins scanning at the 0x92
trailing code unit which presumably results in a scan error,
substitution of a replacement character in the scanned string, or a
sequence of ill-formed code units scanned into the string, all of which
seem undesirable; particularly if the programmer's expectation was that
a member of the basic character set would be scanned for the char value.
There are at least a couple of alternative options that we can consider:

 1. Scan of a single char or wchar_t value is only considered successful
    if the value read corresponds to a character that is encoded as a
    single code unit. E.g., scan or a leading or surrogate code unit is
    an error.
 2. Scan of a single char or wchar_t value consumes the full code unit
    sequence for the encoded character, the first code unit is used as
    the scanned value, and the remaining code units are discarded.

I reviewed the current list of papers with an SG16 label
<https://github.com/cplusplus/papers/issues?q=is%3Aissue+is%3Aopen+label%3Asg16+-label%3Aneeds-revision>,
but am refraining from adding any more to the agenda for this meeting.
The timing looks right for revisiting the following papers for the
following meeting assuming author availability.

  * P2749: Down with "character" <https://wg21.link/p2749>
  * P2626: charN_t incremental adoption: Casting pointers of UTF
    character types <https://wg21.link/p2626>

Tom.

Received on 2023-10-08 03:20:09