C++ Logo

sg16

Advanced search

Re: Proposal for low-level Unicode decoding and encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 7 Oct 2022 16:43:28 -0400
Hi, Dimitrij.

To expand a bit on Peter's response; see the ztd.text API reference
documentation <https://ztdtext.readthedocs.io/en/latest/api.html> and,
in particular, the descriptions of the basic_text_view
<https://ztdtext.readthedocs.io/en/latest/api/views/basic_text_view.html>,
decode_view
<https://ztdtext.readthedocs.io/en/latest/api/views/decode_view.html>,
encode_view
<https://ztdtext.readthedocs.io/en/latest/api/views/encode_view.html>,
and transcode_view
<https://ztdtext.readthedocs.io/en/latest/api/views/transcode_view.html>
class templates.

You may also find these aging proposals of interest:

  * P1629: Standard Text Encoding <https://wg21.link/p1629>
  * P0244: Text_view: A C++ concepts and range based character encoding
    and code point enumeration library <https://wg21.link/p0244>
      o See https://github.com/tahonermann/text_view for an
        implementation of this proposal.

These works all seek to build encoding support on top of the iterator
and range abstractions that comprise and underlie much of the C++
standard library.

Though there are similarities between JeanHeyd's ztd.text work and my
P0244/text_view work, they are independent works. JeanHeyd and I both
started independently playing with these ideas around the same time and
came to some very similar conclusions. JeanHeyd also has given talks on
some of his work. See https://www.youtube.com/watch?v=BdUipluIf1E for
example; I highly recommend watching this talk.

Tom.

On 10/7/22 3:03 AM, Peter Brett via SG16 wrote:
>
> Hi Dimitrij,
>
>
> Thank you for sending your proposal to SG16.
>
> There is a proposed, similar facility currently being proposed for the
> C standard library through the WG14 process. See
> https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2595.pdf
>
> Are there any of the use-cases you envisage for your proposed library
> interfaces that would not be addressed by this C library?
>
> In addition, there is a prototype library for higher-level transcoding
> facilities being developed in the expectation that it will at some
> point become a C++ standard interface.
> https://ztdtext.readthedocs.io/en/latest/index.html This likewise does
> not attempt to add any container types and instead provide a generic
> and flexible facility.
>
> What advantages/disadvantages would your proposed interface have
> relative to ztd.text?
>
> Best regards,
>
> Peter
>
> *From:*SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Dimitrij
> Mijoski via SG16
> *Sent:* 06 October 2022 22:13
> *To:* SG16 <sg16_at_[hidden]>
> *Cc:* Dimitrij Mijoski <dim.mj.p_at_[hidden]>
> *Subject:* [SG16] Proposal for low-level Unicode decoding and encoding
>
> (This proposal is also available on Github
> https://github.com/dimztimz/unicode-proposal
> <https://urldefense.com/v3/__https:/github.com/dimztimz/unicode-proposal__;!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rAZg35_o$>
> if it is hard to read from the email).
>
> I propose that in the library we should standardize low-level
> facilities for transformations between Unicode encodings. Low-level
> means that the proposal generally does not introduce new types. It
> mostly contains functions and function templates that only work with
> existing ranges, strings, iterators and indexes. It is only for UTF-8,
> UTF-16 and UTF-32 and nothing else.
>
> My opinion is that this should be done before standardizing high-level
> facilities (that introduce new types) like |std::text| or Unicode
> iterators.
>
> The functions will fall in four categories:
>
> 1. *Functions that decode/encode only one code point.* ICU has this
> functionality in utf8.h
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html__;!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rrUnRf0U$>
> and utf16.h
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf16_8h.html__;!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rvrU3vYs$>.
> but it is implemented as macros that are not type-safe.
> 2. *Functions that take the full input and write to a fixed-size
> range.* ICU has a lot of such functions. Some examples are in
> header ustring.h
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/ustring_8h.html__;!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rVQNOyTA$>,
> for example u_strToUTF8
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/ustring_8h.html*a69430352fe5439927f48b98b209939d7__;Iw!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6r7xCDVTQ$>
> or u_strFromUTF8
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/ustring_8h.html*a706408344cfd478cac2a7413c954aba6__;Iw!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rVeK87N0$>.
> If the output does not fit in the given output range they return
> error and return the size needed for the complete output.
> 3. *Functions that take the full input and write to a resizable
> range.* Example in ICU is icu::UnicodeString::fromUTF8
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1UnicodeString.html*a71c230712cdace1eefe4b2497e964788__;Iw!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rmVCEEDc$>.
> 4. *Functions that take input in chunks and write in chunks.* They
> keep a state between calls. Examples are ucnv_fromUnicode
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html*aa820d3bc3942522eb31bdb5b8ae73727__;Iw!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rCYMFW1c$>
> and ucnv_toUnicode
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html*a9451f05be7b1b75832d5ec55b4e6d67f__;Iw!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6r7haV8kM$>.
> |std::codecvt| is also such an API, but it is hard to use. The
> problem with |codecvt| comes from the return value |partial|. If
> that is returned, one has to do additional complicated checks to
> see if the input has an incomplete sequence at the end of the
> chunk or the output chunk has no more space. Additionally,
> |codevt| may or may not save the few incomplete input bytes at the
> end into the state, the standard makes no guarantees.
>
> I will now focus on the first group of functions.
>
>
> Functions that decode/encode only one code point
>
> This group contains functions that do the following operations:
>
> * decode the next code point
> * decode the previous code point
> * move the index/iterator to the next code point without decoding it
> * move the index/iterator to the previous code point without decoding it
> * adjust index/iterator to point the start of the code point
> * encode a code point into some range
> * truncate the end of a range (move end iterator) if it contains an
> incomplete sequence of the last code point (incomplete means the
> sequence seen so far is valid, the rest if its bytes maybe are in
> a different chunk).
>
> See ICU header utf8.h
> <https://urldefense.com/v3/__https:/unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html__;!!EHscmS1ygiU1lA!GwhoLGdbeSzHb3A93Ki08g8ETLD2tzi5Aa5MMlRgYYXVhwilGQSHeAZNW1NAK0GcKQc1O_6rrUnRf0U$>
> for all details.
>
> Almost all operations come in two flavors, one that can accept invalid
> sequences and another that must have valid sequences. We will consider
> these flavors as separate operations.
>
> For each operation there should be at least 2 functions (or function
> templates).
>
> 1. One that takes a string (or more generally, a range) and index and
> returns a new index.
> 2. One that takes a pair of iterators and returns an updated iterator.
>
> I will now show a few examples how such functions can be defined and
> implemented using ICU.
>
> #include <string>
> #include <string_view>
> #include <tuple>
> #include <unicode/utf8.h>
> using namespace std;
> // In the bellow functions consider Str, Index, Iter as template type parameters
> using Str = string;
> using Index = Str::size_type;
> using Iter = Str::const_iterator;
> using OutIt = Str::iterator;
> /* =========== OPERATION: U8_NEXT =============== */
> constexpr char32_t cp_error = -1;
> auto u8_advance_i(const Str& s, Index& i) -> char32_t
> {
> char32_t cp;
> auto sz = size(s);
> U8_NEXT(s, i, sz, cp);
> return cp;
> }
> auto u8_next_i(const Str& s, Index i) -> std::pair<Index, char32_t>
> {
> auto cp = u8_advance_i(s, i);
> return {i, cp};
> }
> auto u8_advance_it(Iter& first, Iter last) -> char32_t
> {
> auto i = Iter::difference_type();
> auto sz = distance(first, last);
> char32_t cp;
> U8_NEXT(first, i, sz, cp);
> advance(first, i);
> return cp;
> }
> auto u8_next_it(Iter first, Iter last) -> std::pair<Iter, char32_t>
> {
> auto cpe = u8_advance_it(first, last);
> return {first, cpe};
> }
> /* =========== OPERATION: U8_NEXT_UNSAFE =============== */
> auto valid_u8_advance_i(const Str& s, Index& i) -> char32_t
> {
> char32_t c;
> U8_NEXT_UNSAFE(s, i, c);
> return c;
> }
> auto valid_u8_next_i(const Str& s, Index i) -> std::pair<Index, char32_t>
> {
> auto c = valid_u8_advance_i(s, i);
> return {i, c};
> }
> auto valid_u8_advance_it(Iter& first) -> char32_t
> {
> auto i = Iter::difference_type();
> char32_t c;
> U8_NEXT_UNSAFE(first, i, c);
> return c;
> }
> auto valid_u8_next_it(Iter first) -> std::pair<Iter, char32_t>
> {
> auto c = valid_u8_advance_it(first);
> return {first, c};
> }
> /* =========== OPERATION: U8_APPEND =============== */
> // starts writing at s[i]. Checks for space except for the first element s[i],
> // i < size(s) is precondition.
> auto encode_advance_u8(char32_t CP, Str& s, Index& i) -> bool
> {
> auto sz = size(s);
> auto err = false;
> U8_APPEND(s, i, sz, CP, err);
> return !err;
> }
> auto encode_u8(char32_t CP, Str& s, Index i) -> std::pair<Index, bool>
> {
> auto ok = encode_advance_u8(CP, s, i);
> return {i, ok};
> }
> // Writes to out, can check for space except for the first element i.e.
> // out < last is precondition.
> auto encode_u8(char32_t CP, OutIt out, OutIt last) -> std::pair<OutIt, bool>
> {
> auto i = OutIt::difference_type();
> auto sz = distance(out, last);
> auto err = false;
> U8_APPEND(out, i, sz, CP, err);
> return {out + i, !err};
> }
> // writes to out, can not check for space
> template <class OutIt>
> auto encode_u8(char32_t CP, OutIt out) -> std::pair<OutIt, bool>
> {
> auto i = typename iterator_traits<OutIt>::difference_type();
> auto sz = i + 4;
> auto err = false;
> U8_APPEND(out, i, sz, CP, err);
> return {out + i, !err};
> }
> /* =========== OPERATION: U8_APPEND_UNSAFE =============== */
> auto encode_valid_cp_u8(char32_t CP, Str& s, Index i) -> Index
> {
> U8_APPEND_UNSAFE(s, i, CP);
> return i;
> }
> template <class OutIt>
> auto encode_valid_cp_u8(char32_t CP, OutIt out) -> OutIt
> {
> auto i = typename iterator_traits<OutIt>::difference_type();
> U8_APPEND_UNSAFE(out, i, CP);
> return out + i;
> }
> /* =========== USAGE EXAMPLES =============== */
> void u8_next_usage(Str& s)
> {
> // u8_advance_i, index is inout parameter
> for (size_t i = 0; i != size(s);) {
> auto cp = u8_advance_i(s, i);
> // process cp
> }
> for (size_t i = 0; i != size(s);) {
> auto j = i;
> auto cp = u8_advance_i(s, j);
> auto cp_size = j - i;
> // process cp
> i = j;
> }
> for (size_t j = 0; j != size(s);) {
> auto i = j;
> auto cp = u8_advance_i(s, j);
> auto cp_size = j - i;
> // process cp
> }
> for (size_t i = 0, j = 0; i != size(s); i = j) {
> auto cp = u8_advance_i(s, j);
> auto cp_size = j - i;
> // process cp
> }
> // u8_next_i, index in return value
> for (size_t i = 0; i != size(s);) {
> char32_t cp;
> std::tie(i, cp) = u8_next_i(s, i);
> // process cp
> }
> for (size_t i = 0; i != size(s);) {
> auto [j, cp] = u8_next_i(s, i);
> auto cp_size = j - i;
> // process cp
> i = j;
> }
> for (size_t i = 0, j = 0; i != size(s); i = j) {
> auto [jj, cp] = u8_next_i(s, i);
> j = jj;
> auto cp_size = j - i;
> // process cp
> }
> for (size_t i = 0, j = 0; i != size(s); i = j) {
> char32_t cp;
> std::tie(j, cp) = u8_next_i(s, i);
> auto cp_size = j - i;
> // process cp
> }
> }
> auto find_cp_faster(const string& s, char32_t cp) -> size_t
> {
> char enc_cp[4];
> auto [it, ok] = encode_u8(cp, enc_cp);
> if (!ok)
> return s.npos;
> auto sv = string_view(enc_cp, it - enc_cp);
> return s.find(sv);
> }
> auto find_cp_slower(const string& s, char32_t cp) -> size_t
> {
> for (size_t i = 0, j = 0; i != size(s); i = j) {
> auto [jj, dec_cp] = u8_next_i(s, i);
> j = jj;
> if (dec_cp == cp_error)
> continue;
> else if (dec_cp == cp)
> return i;
> }
> return s.npos;
> }
>
> Looking at the above functions, I can ask few questions:
>
> 1. How should the error be reported in |u8_next|? These are the
> options I have considered:
>
> 1. Use |int32_t| as a type for code point in |u8_next| and return
> a negative value, exactly the same as ICU.
> 2. Use |char32_t| as a type for code point in |u8_next| and
> return unspecified high value above the Unicode range, i.e.
> casting the negative int from ICU to the unsigned |char32_t|.
> 3. Use a separate variable with its own type (some enum like
> |std::errc|).
> 4. Abstract the error in lightweight "sum" type
> |code_point_or_error| that resembles |std::expected| or
> |std::optional| but that can be implemented with just a single
> |char32_t| or with |int32_t|. |std::expected| and
> |std::optional| must use additional bool internally.
> 5. Use a single sentinel value for error, signed -1 or unsigned
> 0xFFFFFFFF. I noticed the exact implementation of |U8_NEXT| in
> ICU and it relies on |U8_INTERNAL_NEXT_OR_SUB| and uses
> exactly -1 to signal error, and not any other negative value.
> The sentinel can be even a special type with an overloaded
> |operator==|.
> 6. Use a slightly more type-safe variant of number 5, define an
> enum as a strong typedef for |char32_t|.
>
> 7.enum class code_point_or_error: char32_t {
> 8. error = char32_t(-1)
> };
>
> 2. For now I decided for the solution number 5, to use sentinel
> value. It goes with the idea of being low-level and not
> introducing new types.
> 3. Should the size of the encoded sequence of the code point be
> returned? Probably not. In the examples above I just subtract
> indexes to get that information.
> 4. Should we use return values or out-parameters? For the code point
> and error definitely use return value, but for the index or
> iterator both APIs make sense. Those parameters are actually
> in-out not just out. The standard library rarely uses
> out-parameters. Of all the algorithms with iterators, only
> |std::advance()| uses that.
> 5. Should the functions with string and index be generic or should
> they work only with |string_view|? Probably they should be generic
> as there are multiple character types that can be used for one
> encoding, e.g. |char| and |char8_t| for UTF-8. Then we can ask
> should they be generic for any range or just for
> |basic_string_view|? My vote is fully generic, as that way they
> will work with plain old arrays with known size too.
>
>

Received on 2022-10-07 20:43:33