On Tue, Jan 31, 2023 at 7:01 PM Zach Laine <whatwasthataddress@gmail.com> wrote:
Even though we discussed this offline, I'll repeat it here for everyone else.

The idea is that users should not be required to copy data to make the
data conform to a particular type in the interface.  For instance,
making char-users copy their data to char8_t, or making char8_t users
copy data to char.  Since the algorithm does not care what the input
type is, that should be reflected in the interface; it should accept
any 8-bit integral type.

Agreed.
Fortunately a "treat this as char8_t" function - whether it is a magic core-sanctioned reinterpret_cast-like feature
or a simple view that just runs bit_cast over its input gets optimized away, by construction in the first case or as QOI in the second.
 
The right question is not "Who cares about vector<byte>?"  The right
question is "If the algorithm can process iterators from vector<byte>
with no code change to the algorithm, why shouldn't it?"  That is, why
should that weirdo using vector<byte> have to copy her weird data?

Because forcing users to confirm intent, especially for something that requires more domain-specific
knowledge or confidence that people reduce the opportunities for bugs.

This is especially true on platforms where UTF-8 is not the default, and on which codepoint_view(char*) is likely to produce mojibake, fail, or otherwise not behave correctly,
and to leave users confused (if only because it will work on some platforms/environments but not other).

Just because we can accept anything (I agree with that) does not mean we should!
 

 

Zach

On Tue, Jan 31, 2023 at 2:25 AM Peter Brett <pbrett@cadence.com> wrote:
>
> Hi Zach,
>
> Doesn't this add a lot of complexity?  I really would like to understand the rationale/motivation for this level of generality, with some examples of code that is significantly improved by them.
>
> For example, I am struggling to envisage a situation in which I'd find it useful to do sentence break iteration on a std::vector<byte> without any intermediate decoding step.
>
> Best regards,
>
>                  Peter
>
> -----Original Message-----
> From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Zach Laine via SG16
> Sent: 30 January 2023 21:33
> To: Corentin <corentin.jabot@gmail.com>
> Cc: Zach Laine <whatwasthataddress@gmail.com>; SG16 <sg16@lists.isocpp.org>
> Subject: Re: [SG16] Considerations for Unicode algorithms
>
> Also, I think the algorithms should be generic.  They should not work
> only with char32_t, or only with int, etc.  Users should be free to
> use char8_t, char, unsigned char, etc., for UTF-8.  std::byte if
> you're nasty.
>