Date: Fri, 3 May 2019 19:44:48 -0400
I have not written a conversion for this per se. I have used the 32
functions specifically to roundtrip the conversion through Unicode Code
Points.
Note that c8rtomb is actually under-specified in the current C and C++
standards: that is what DR 488 fixed by Philipp K. Krause's n2040 (
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm) applied to
standard C2x was for, albeit I forget if it was applied to the c32rtomb
functions.
In the case that nothing is stored, use the return value of 0 as a marker
that the current character is valid but the mbstate has been modified and
that you may be working with a multi-byte sequence, and that you need to
feed more input into c8rtomb with the same mbstate_t.
With a return value of 0, you can sanity-check the implementation by doing
mbsinit(&my_mb_state) and checking if it does NOT return the "I am still in
the initial stateless sequence" value after claiming a return value of 0
(the mbstate_t object should be modified since it should be storing part of
the accumulated multi-byte sequence).
To be honest with you, the whole situation is a bit awful and -- what's
worse -- is that there are no string versions of any of these functions for
fast, efficient processing (c8srtombs/mbsrtoc8s, c16srtombs/mbsrtoc16s,
c32srtombs/mbsrtoc32s): they are just straight up missing. The latter 2 in
that list are being fixed by Philipp K. Krause's N2282 (
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm) -- you should
write to your C and/or C++ representatives in your country (or, really,
anyone who's listening) and tell them that we need these for fast,
competitive implementations that hope to hold a candle to proper Unicode
conversion utilities employed around the world. (One of the kickbacks
surrounding that paper is "waiting for implementation experience and
feedback", I think?) I don't know how Tom feels about jumping the gun and
writing c8srtombs/mbsrtoc8s for the C++ standard before its friends (
c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s) are accepted into the C
standard, but I would highly encourage that to be a thing we do because
one-by-one code point processing is a mistake for efficient processing. In
days gone by, the C Committee added mbsrtowcs and other multiple-code point
functions to the C standard for a reason (this reason), why the C standard
is about to wait on it to make the same mistake is something I do not quite
understand.
Maybe it's just a matter of being loud and vocal enough to the Committee
and its representatives to have it put in?
functions specifically to roundtrip the conversion through Unicode Code
Points.
Note that c8rtomb is actually under-specified in the current C and C++
standards: that is what DR 488 fixed by Philipp K. Krause's n2040 (
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm) applied to
standard C2x was for, albeit I forget if it was applied to the c32rtomb
functions.
In the case that nothing is stored, use the return value of 0 as a marker
that the current character is valid but the mbstate has been modified and
that you may be working with a multi-byte sequence, and that you need to
feed more input into c8rtomb with the same mbstate_t.
With a return value of 0, you can sanity-check the implementation by doing
mbsinit(&my_mb_state) and checking if it does NOT return the "I am still in
the initial stateless sequence" value after claiming a return value of 0
(the mbstate_t object should be modified since it should be storing part of
the accumulated multi-byte sequence).
To be honest with you, the whole situation is a bit awful and -- what's
worse -- is that there are no string versions of any of these functions for
fast, efficient processing (c8srtombs/mbsrtoc8s, c16srtombs/mbsrtoc16s,
c32srtombs/mbsrtoc32s): they are just straight up missing. The latter 2 in
that list are being fixed by Philipp K. Krause's N2282 (
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm) -- you should
write to your C and/or C++ representatives in your country (or, really,
anyone who's listening) and tell them that we need these for fast,
competitive implementations that hope to hold a candle to proper Unicode
conversion utilities employed around the world. (One of the kickbacks
surrounding that paper is "waiting for implementation experience and
feedback", I think?) I don't know how Tom feels about jumping the gun and
writing c8srtombs/mbsrtoc8s for the C++ standard before its friends (
c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s) are accepted into the C
standard, but I would highly encourage that to be a thing we do because
one-by-one code point processing is a mistake for efficient processing. In
days gone by, the C Committee added mbsrtowcs and other multiple-code point
functions to the C standard for a reason (this reason), why the C standard
is about to wait on it to make the same mistake is something I do not quite
understand.
Maybe it's just a matter of being loud and vocal enough to the Committee
and its representatives to have it put in?
Received on 2019-05-04 01:45:00