ISOCPP sg16 List: Re: Agenda for the 2023-10-11 SG16 telecon

From: Elias Kosunen <isocpp_at_[hidden]>
Date: Mon, 9 Oct 2023 23:54:10 +0300

Hi SG16,

There are updates to the paper since the last telecon. The majority of
the comments from that telecon should now be addressed.

A draft is in the paper system, under D1729R3:
https://isocpp.org/files/papers/D1729R3.html

On 10/8/23 06:20, Tom Honermann via SG16 wrote:
>
> SG16 will hold a telecon on Wednesday, October 11th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20231011T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>).
>
> The agenda follows.
>
> * P1729R2: Text Parsing <https://wg21.link/p1729r2>:
> o Continue review.
>
> We made good progress reviewing this paper during the 2023-09-27
> meeting
> <https://github.com/sg16-unicode/sg16-meetings/tree/master#september-27th-2023>
> and I expect we'll complete review in this meeting. Since wording is
> not yet available, we won't poll forwarding this paper, but may poll
> support for the paper and approval of the design as encouragement for
> LEWG to review the design before a large investment is made in
> wording. We will, of course, review again once wording is available.
>
> One item I would like to discuss is that the proposed functionality
> allows for a single code unit to be scanned and produced as a char (or
> wchar_t) value. What does that imply for the following example (assume
> that the ordinary literal encoding is UTF-8)?
>
> // U+12345 is 0xF0 0x92 0x8D 0x85 in UTF-8
> std::scan<char, std::string>("\u{12345}", "{}{}");
>
> The scan of the char value presumably consumes the 0xF0 code unit such
> that the scan of the std::string value then begins scanning at the
> 0x92 trailing code unit which presumably results in a scan error,
> substitution of a replacement character in the scanned string, or a
> sequence of ill-formed code units scanned into the string, all of
> which seem undesirable; particularly if the programmer's expectation
> was that a member of the basic character set would be scanned for the
> char value. There are at least a couple of alternative options that we
> can consider:
>
> 1. Scan of a single char or wchar_t value is only considered
> successful if the value read corresponds to a character that is
> encoded as a single code unit. E.g., scan or a leading or
> surrogate code unit is an error.
> 2. Scan of a single char or wchar_t value consumes the full code unit
> sequence for the encoded character, the first code unit is used as
> the scanned value, and the remaining code units are discarded.
>
As proposed, the behavior of the code above is that the scanned char
would contain '\xF0', and the string "\x92\x8D\x85", while also
potentially invoking erroneous behavior (if we end up going down that
route). In my opinion, this reflects the "garbage in, garbage out"
-discussion that we had in the last telecon.

Personally, I'm not convinced that we want reading a char to have any
other behavior than just reading the next code unit in the input.
Although, it could be argued that this is an out-of-range error, similar
to something that can be encountered when reading an integer: if the
char doesn't encode a code point, return an out-of-range error.

This may deserve further thought.

> I reviewed the current list of papers with an SG16 label
> <https://github.com/cplusplus/papers/issues?q=is%3Aissue+is%3Aopen+label%3Asg16+-label%3Aneeds-revision>,
> but am refraining from adding any more to the agenda for this meeting.
> The timing looks right for revisiting the following papers for the
> following meeting assuming author availability.
>
> * P2749: Down with "character" <https://wg21.link/p2749>
> * P2626: charN_t incremental adoption: Casting pointers of UTF
> character types <https://wg21.link/p2626>
>
> Tom.
>
>
- Elias

Received on 2023-10-09 20:54:16