On 10/9/23 4:54 PM, Elias Kosunen via SG16 wrote:

Hi SG16,

There are updates to the paper since the last telecon. The majority of the comments from that telecon should now be addressed.

A draft is in the paper system, under D1729R3: https://isocpp.org/files/papers/D1729R3.html

Excellent, thank you!

On 10/8/23 06:20, Tom Honermann via SG16 wrote:

SG16 will hold a telecon on Wednesday, October 11th, at 19:30 UTC (timezone conversion).

The agenda follows.

P1729R2: Text Parsing:

Continue review.

We made good progress reviewing this paper during the 2023-09-27 meeting and I expect we'll complete review in this meeting. Since wording is not yet available, we won't poll forwarding this paper, but may poll support for the paper and approval of the design as encouragement for LEWG to review the design before a large investment is made in wording. We will, of course, review again once wording is available.

One item I would like to discuss is that the proposed functionality allows for a single code unit to be scanned and produced as a char (or wchar_t) value. What does that imply for the following example (assume that the ordinary literal encoding is UTF-8)?

// U+12345 is 0xF0 0x92 0x8D 0x85 in UTF-8
std::scan<char, std::string>("\u{12345}", "{}{}");

The scan of the char value presumably consumes the 0xF0 code unit such that the scan of the std::string value then begins scanning at the 0x92 trailing code unit which presumably results in a scan error, substitution of a replacement character in the scanned string, or a sequence of ill-formed code units scanned into the string, all of which seem undesirable; particularly if the programmer's expectation was that a member of the basic character set would be scanned for the char value. There are at least a couple of alternative options that we can consider:

Scan of a single char or wchar_t value is only considered successful if the value read corresponds to a character that is encoded as a single code unit. E.g., scan or a leading or surrogate code unit is an error.

Scan of a single char or wchar_t value consumes the full code unit sequence for the encoded character, the first code unit is used as the scanned value, and the remaining code units are discarded.

As proposed, the behavior of the code above is that the scanned char would contain '\xF0', and the string "\x92\x8D\x85", while also potentially invoking erroneous behavior (if we end up going down that route). In my opinion, this reflects the "garbage in, garbage out" -discussion that we had in the last telecon.

Personally, I'm not convinced that we want reading a char to have any other behavior than just reading the next code unit in the input. Although, it could be argued that this is an out-of-range error, similar to something that can be encountered when reading an integer: if the char doesn't encode a code point, return an out-of-range error.

This may deserve further thought.

I don't think we place any restrictions on std::format with regard to formatting of individual char values. The following is accepted and probably useful at times; the ability to produce byte-precise output is a feature.

std::format("{}", '\x80');

Should we think of std::scan in the same way? Is byte-precise scanning a desirable feature? Almost certainly. I find myself thinking that it is probably useful to have a different specifier to opt-in to such behavior in this case though. Perhaps "{:?}".

Tom.

I reviewed the current list of papers with an SG16 label, but am refraining from adding any more to the agenda for this meeting. The timing looks right for revisiting the following papers for the following meeting assuming author availability.

P2749: Down with "character"

P2626: charN_t incremental adoption: Casting pointers of UTF character types

Tom.

- Elias