C++ Logo

sg16

Advanced search

Re: Agenda for the 2023-10-11 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 10 Oct 2023 16:32:03 -0400
On 10/9/23 4:54 PM, Elias Kosunen via SG16 wrote:
>
> Hi SG16,
>
> There are updates to the paper since the last telecon. The majority of
> the comments from that telecon should now be addressed.
>
> A draft is in the paper system, under D1729R3:
> https://isocpp.org/files/papers/D1729R3.html
>
Excellent, thank you!

> On 10/8/23 06:20, Tom Honermann via SG16 wrote:
>>
>> SG16 will hold a telecon on Wednesday, October 11th, at 19:30 UTC
>> (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20231011T193000&p1=1440&p2=tz_pt&p3=tz_mt&p4=tz_ct&p5=tz_et&p6=tz_cest>).
>>
>> The agenda follows.
>>
>> * P1729R2: Text Parsing <https://wg21.link/p1729r2>:
>> o Continue review.
>>
>> We made good progress reviewing this paper during the 2023-09-27
>> meeting
>> <https://github.com/sg16-unicode/sg16-meetings/tree/master#september-27th-2023>
>> and I expect we'll complete review in this meeting. Since wording is
>> not yet available, we won't poll forwarding this paper, but may poll
>> support for the paper and approval of the design as encouragement for
>> LEWG to review the design before a large investment is made in
>> wording. We will, of course, review again once wording is available.
>>
>> One item I would like to discuss is that the proposed functionality
>> allows for a single code unit to be scanned and produced as a char
>> (or wchar_t) value. What does that imply for the following example
>> (assume that the ordinary literal encoding is UTF-8)?
>>
>> // U+12345 is 0xF0 0x92 0x8D 0x85 in UTF-8
>> std::scan<char, std::string>("\u{12345}", "{}{}");
>>
>> The scan of the char value presumably consumes the 0xF0 code unit
>> such that the scan of the std::string value then begins scanning at
>> the 0x92 trailing code unit which presumably results in a scan error,
>> substitution of a replacement character in the scanned string, or a
>> sequence of ill-formed code units scanned into the string, all of
>> which seem undesirable; particularly if the programmer's expectation
>> was that a member of the basic character set would be scanned for the
>> char value. There are at least a couple of alternative options that
>> we can consider:
>>
>> 1. Scan of a single char or wchar_t value is only considered
>> successful if the value read corresponds to a character that is
>> encoded as a single code unit. E.g., scan or a leading or
>> surrogate code unit is an error.
>> 2. Scan of a single char or wchar_t value consumes the full code
>> unit sequence for the encoded character, the first code unit is
>> used as the scanned value, and the remaining code units are
>> discarded.
>>
> As proposed, the behavior of the code above is that the scanned char
> would contain '\xF0', and the string "\x92\x8D\x85", while also
> potentially invoking erroneous behavior (if we end up going down that
> route). In my opinion, this reflects the "garbage in, garbage out"
> -discussion that we had in the last telecon.
>
> Personally, I'm not convinced that we want reading a char to have any
> other behavior than just reading the next code unit in the input.
> Although, it could be argued that this is an out-of-range error,
> similar to something that can be encountered when reading an integer:
> if the char doesn't encode a code point, return an out-of-range error.
>
> This may deserve further thought.
>
I don't think we place any restrictions on std::format with regard to
formatting of individual char values. The following is accepted and
probably useful at times; the ability to produce byte-precise output is
a feature.

    std::format("{}", '\x80');

Should we think of std::scan in the same way? Is byte-precise scanning a
desirable feature? Almost certainly. I find myself thinking that it is
probably useful to have a different specifier to opt-in to such behavior
in this case though. Perhaps "{:?}".

Tom.

>> I reviewed the current list of papers with an SG16 label
>> <https://github.com/cplusplus/papers/issues?q=is%3Aissue+is%3Aopen+label%3Asg16+-label%3Aneeds-revision>,
>> but am refraining from adding any more to the agenda for this
>> meeting. The timing looks right for revisiting the following papers
>> for the following meeting assuming author availability.
>>
>> * P2749: Down with "character" <https://wg21.link/p2749>
>> * P2626: charN_t incremental adoption: Casting pointers of UTF
>> character types <https://wg21.link/p2626>
>>
>> Tom.
>>
>>
> - Elias
>

Received on 2023-10-10 20:32:06