On 5/3/19 7:44 PM, JeanHeyd Meneide wrote:
Note that c8rtomb is actually under-specified in the current C and C++ standards: that is what DR 488 fixed by Philipp K. Krause's n2040 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm) applied to standard C2x was for, albeit I forget if it was applied to the c32rtomb functions.

Well, c8rtomb is definitely under-specified in current C standards since it isn't defined there at all :)

When drafting the wording for c8rtomb for C++, I did incorporate updates from N2040.  P0482R6 contains the following note:

Drafting note: The wording for mbrtoc8 and c8rtomb is derived from wording for mbrtoc16 and c16rtomb in C18 (WG14 N2176), augmented by changes suggested in WG14 N2040 for WG14 DR488 to properly account for UTF-8 being a variable length encoding, and lightly edited for formatting style. The author was reluctant to stray from the existing C wording for related functions despite a belief that considerable improvements to the wording would be possible.
With regard to:

In the case that nothing is stored, use the return value of 0 as a marker that the current character is valid but the mbstate has been modified and that you may be working with a multi-byte sequence, and that you need to feed more input into c8rtomb with the same mbstate_t.
I think this is consistent with the current wording, though the wording is not explicit about this case.

With a return value of 0, you can sanity-check the implementation by doing mbsinit(&my_mb_state) and checking if it does NOT return the "I am still in the initial stateless sequence" value after claiming a return value of 0 (the mbstate_t object should be modified since it should be storing part of the accumulated multi-byte sequence).

To be honest with you, the whole situation is a bit awful and -- what's worse -- is that there are no string versions of any of these functions for fast, efficient processing (c8srtombs/mbsrtoc8s, c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s): they are just straight up missing. The latter 2 in that list are being fixed by Philipp K. Krause's N2282 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm) -- you should write to your C and/or C++ representatives in your country (or, really, anyone who's listening) and tell them that we need these for fast, competitive implementations that hope to hold a candle to proper Unicode conversion utilities employed around the world. (One of the kickbacks surrounding that paper is "waiting for implementation experience and feedback", I think?) I don't know how Tom feels about jumping the gun and writing c8srtombs/mbsrtoc8s for the C++ standard before its friends ( c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s) are accepted into the C standard, but I would highly encourage that to be a thing we do because one-by-one code point processing is a mistake for efficient processing. In days gone by, the C Committee added mbsrtowcs and other multiple-code point functions to the C standard for a reason (this reason), why the C standard is about to wait on it to make the same mistake is something I do not quite understand.

Philipp, do you perhaps know the history of how C came to have the UTF code-unit-at-a-time conversion functions (e.g., c16rtomb(), mbrtoc16()), but not the UTF string-at-a-time analogs of mbsrtowcs() and wcsrtombs()?


Maybe it's just a matter of being loud and vocal enough to the Committee and its representatives to have it put in?