C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] Convert between std::u8string and std::string
From: Tom Honermann (tom_at_[hidden])
Date: 2019-05-05 22:32:56


On 5/3/19 7:44 PM, JeanHeyd Meneide wrote:
> Note that c8rtomb is actually under-specified in the current C and C++
> standards: that is what DR 488 fixed by Philipp K. Krause's n2040
> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm) applied to
> standard C2x was for, albeit I forget if it was applied to the
> c32rtomb functions.

Well, c8rtomb is definitely under-specified in current C standards since
it isn't defined there at all :)

When drafting the wording for c8rtomb for C++, I did incorporate updates
from N2040.  P0482R6 contains the following note:

> /Drafting note: The wording for mbrtoc8 and c8rtomb is derived from
> wording for mbrtoc16 and c16rtomb in C18 (WG14 N2176
> <http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf>),
> augmented by changes suggested in WG14 N2040
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2040.htm> for WG14
> DR488
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488> to
> properly account for UTF-8 being a variable length encoding, and
> lightly edited for formatting style. The author was reluctant to stray
> from the existing C wording for related functions despite a belief
> that considerable improvements to the wording would be possible. /
With regard to:

>
> In the case that nothing is stored, use the return value of 0 as a
> marker that the current character is valid but the mbstate has been
> modified and that you may be working with a multi-byte sequence, and
> that you need to feed more input into c8rtomb with the same mbstate_t.
I think this is consistent with the current wording, though the wording
is not explicit about this case.
>
> With a return value of 0, you can sanity-check the implementation by
> doing mbsinit(&my_mb_state) and checking if it does NOT return the "I
> am still in the initial stateless sequence" value after claiming a
> return value of 0 (the mbstate_t object should be modified since it
> should be storing part of the accumulated multi-byte sequence).
>
> To be honest with you, the whole situation is a bit awful and --
> what's worse -- is that there are no string versions of any of these
> functions for fast, efficient processing (c8srtombs/mbsrtoc8s,
> c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s): they are just straight
> up missing. The latter 2 in that list are being fixed by Philipp K.
> Krause's N2282
> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm) -- you
> should write to your C and/or C++ representatives in your country (or,
> really, anyone who's listening) and tell them that we need these for
> fast, competitive implementations that hope to hold a candle to proper
> Unicode conversion utilities employed around the world. (One of the
> kickbacks surrounding that paper is "waiting for implementation
> experience and feedback", I think?) I don't know how Tom feels about
> jumping the gun and writing c8srtombs/mbsrtoc8s for the C++ standard
> before its friends ( c16srtombs/mbsrtoc16s, c32srtombs/mbsrtoc32s) are
> accepted into the C standard, but I would highly encourage that to be
> a thing we do because one-by-one code point processing is a mistake
> for efficient processing. In days gone by, the C Committee added
> mbsrtowcs and other multiple-code point functions to the C standard
> for a reason (this reason), why the C standard is about to wait on it
> to make the same mistake is something I do not quite understand.

Philipp, do you perhaps know the history of how C came to have the UTF
code-unit-at-a-time conversion functions (e.g., c16rtomb(), mbrtoc16()),
but not the UTF string-at-a-time analogs of mbsrtowcs() and wcsrtombs()?

Tom.

>
> Maybe it's just a matter of being loud and vocal enough to the
> Committee and its representatives to have it put in?
>



SG16 list run by herb.sutter at gmail.com