C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Convert between std::u8string and std::string

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Fri, 3 May 2019 19:44:48 -0400
I have not written a conversion for this per se. I have used the 32
functions specifically to roundtrip the conversion through Unicode Code
Points.

Note that c8rtomb is actually under-specified in the current C and C++
standards: that is what DR 488 fixed by Philipp K. Krause's n2040 (
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm) applied to
standard C2x was for, albeit I forget if it was applied to the c32rtomb
functions.

In the case that nothing is stored, use the return value of 0 as a marker
that the current character is valid but the mbstate has been modified and
that you may be working with a multi-byte sequence, and that you need to
feed more input into c8rtomb with the same mbstate_t.

With a return value of 0, you can sanity-check the implementation by doing
mbsinit(&my_mb_state) and checking if it does NOT return the "I am still in
the initial stateless sequence" value after claiming a return value of 0
(the mbstate_t object should be modified since it should be storing part of
the accumulated multi-byte sequence).

To be honest with you, the whole situation is a bit awful and -- what's
worse -- is that there are no string versions of any of these functions for
fast, efficient processing (c8srtombs/mbsrtoc8s, c16srtombs/mbsrtoc16s,
c32srtombs/mbsrtoc32s): they are just straight up missing. The latter 2 in
that list are being fixed by Philipp K. Krause's N2282 (
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm) -- you should
write to your C and/or C++ representatives in your country (or, really,
anyone who's listening) and tell them that we need these for fast,
competitive implementations that hope to hold a candle to proper Unicode
conversion utilities employed around the world. (One of the kickbacks
surrounding that paper is "waiting for implementation experience and
feedback", I think?) I don't know how Tom feels about jumping the gun and
writing c8srtombs/mbsrtoc8s for the C++ standard before its friends (
c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s) are accepted into the C
standard, but I would highly encourage that to be a thing we do because
one-by-one code point processing is a mistake for efficient processing. In
days gone by, the C Committee added mbsrtowcs and other multiple-code point
functions to the C standard for a reason (this reason), why the C standard
is about to wait on it to make the same mistake is something I do not quite
understand.

Maybe it's just a matter of being loud and vocal enough to the Committee
and its representatives to have it put in?

Received on 2019-05-04 01:45:00