C++ Logo


Advanced search

Re: Considerations for Unicode algorithms

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 31 Jan 2023 19:38:28 +0100
On Tue, Jan 31, 2023 at 7:01 PM Zach Laine <whatwasthataddress_at_[hidden]>

> Even though we discussed this offline, I'll repeat it here for everyone
> else.
> The idea is that users should not be required to copy data to make the
> data conform to a particular type in the interface. For instance,
> making char-users copy their data to char8_t, or making char8_t users
> copy data to char. Since the algorithm does not care what the input
> type is, that should be reflected in the interface; it should accept
> any 8-bit integral type.

Fortunately a "treat this as char8_t" function - whether it is a magic
core-sanctioned reinterpret_cast-like feature
or a simple view that just runs bit_cast over its input gets optimized
away, by construction in the first case or as QOI in the second.

> The right question is not "Who cares about vector<byte>?" The right
> question is "If the algorithm can process iterators from vector<byte>
> with no code change to the algorithm, why shouldn't it?" That is, why
> should that weirdo using vector<byte> have to copy her weird data?

Because forcing users to confirm intent, especially for something that
requires more domain-specific
knowledge or confidence that people reduce the opportunities for bugs.

This is especially true on platforms where UTF-8 is not the default, and on
which codepoint_view(char*) is likely to produce mojibake, fail, or
otherwise not behave correctly,
and to leave users confused (if only because it will work on some
platforms/environments but not other).

Just because we can accept anything (I agree with that) does not mean we

> Zach
> On Tue, Jan 31, 2023 at 2:25 AM Peter Brett <pbrett_at_[hidden]> wrote:
> >
> > Hi Zach,
> >
> > Doesn't this add a lot of complexity? I really would like to understand
> the rationale/motivation for this level of generality, with some examples
> of code that is significantly improved by them.
> >
> > For example, I am struggling to envisage a situation in which I'd find
> it useful to do sentence break iteration on a std::vector<byte> without any
> intermediate decoding step.
> >
> > Best regards,
> >
> > Peter
> >
> > -----Original Message-----
> > From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Zach Laine via
> SG16
> > Sent: 30 January 2023 21:33
> > To: Corentin <corentin.jabot_at_[hidden]>
> > Cc: Zach Laine <whatwasthataddress_at_[hidden]>; SG16 <
> sg16_at_[hidden]>
> > Subject: Re: [SG16] Considerations for Unicode algorithms
> >
> > Also, I think the algorithms should be generic. They should not work
> > only with char32_t, or only with int, etc. Users should be free to
> > use char8_t, char, unsigned char, etc., for UTF-8. std::byte if
> > you're nasty.
> >

Received on 2023-01-31 18:38:41