ISOCPP sg16 List: Re: Performance requirements for Unicode views/types/algorithms

From: Peter Bindels <peterbindels_at_[hidden]>
Date: Wed, 1 Mar 2023 16:38:50 +0100

> I'd tend to answer: nobody sane chooses C++ to solve problems unless (i)
> they are forced to by a legacy codebase (ii) they need high performance.

I'm pretty sure I am not forced to use it, and I don't need high
performance (but won't reject it). I appreciate high efficiency, which is a
different side of the same coin that high performance is on... but that
still doesn't quite make it the reason to pick it. And I believe I am not
insane. That last one is probably the most disputable in the list, but even
then.

C++ is not just a language that we use because it's fast or because it is
forced on you. It's a language that does a few things right that no other
language I know of does. It allows you lots of *all of* control,
abstraction and portability.

> This is why I find WG21 working on high level abstractions somewhat
> misses the point for much of today's C++ users.

Many high level abstractions are welcomed with open arms. I see code bases
I work on become a lot cleaner now that std::filesystem has percolated down
to old enough versions for us to use them, as well as lambdas and other
similar constructs that were possible before, but much harder to write.

Coroutines would do the same, if the high level abstractions for them were
available.

In the context of text processing, there are two large groups of cases it
is used. I would describe it as quantitative use - transforming a large
amount of documents in some way, like normalizing or re-encoding - or
qualitative use - analyzing a relatively small document (<1MB of text) for
word splits, emoji combinations and other formatting operations.

The first of these wants to have fast, efficient operations, so that they
can run with less hardware or more users. The second of these needs to have
functioning abstractions that make getting the operations correct easy with
sufficient performance to not be noticeable.

In the triangle of safe, fast and easy, we should sit on the safe and easy
line, and then bend it as far towards fast as it will go.

Regards,
Peter

On Wed, Mar 1, 2023 at 4:14 PM Niall Douglas via SG16 <sg16_at_[hidden]>
wrote:

> On 28/02/2023 18:43, Jens Maurer wrote:
>
> >> I really wish SIMD had better support for UTF-8, only AVX-512 enables a
> >> decent fraction of main memory bandwidth
> >> (https://github.com/simdutf/simdutf).
> >
> > Thanks for the pointer. I was looking for a comparison like that.
> >
> > So, this means we do leave 5-10x performance on the table if we
> > go for an interface that can deliver ICU-level performance (only).
> >
> > Sadness engulfs me.
> >
> >> I'd like to see as much of that
> >> performance passed through by the standard library as possible, even if
> >> it makes the API non-STL-like.
> >
> > So, it seems we need an idea how to employ SIMD with a ranges-based
> > interface, or we go for eager transcoding algorithms (possibly
> > in addition to the ranges-based ones).
>
> I like to hold up <charconv> as the right design choice we ought to use
> going forwards:
>
> - Yes atoi() can parse numbers.
>
> - Yes strtol() can parse numbers.
>
> - Yes sscanf() can parse numbers.
>
> - Yes iostreams can parse numbers.
>
> One would have thought that number parsing were a done deal with such
> menu before us. However, none of the above were particularly fast, and
> some were downright slow. This is mainly due to historical reasons,
> especially around the required use and modification of global state.
>
> Thus <charconv> was born, and it can be orders of magnitude faster than
> any of the functions above because it was designed with the benefits of
> hindsight and an understanding of how recent CPUs work.
>
> I'd ask the same design thinking for UTF-8: a low level maximum
> performance API (ideally based on existing standard practice) and then
> there is the escape hatch out of the slower higher level APIs for those
> that need such a thing.
>
> There is always the argument that "why do we need such high performance
> given XXX?"
>
> I'd tend to answer: nobody sane chooses C++ to solve problems unless (i)
> they are forced to by a legacy codebase (ii) they need high performance.
>
> This is why I find WG21 working on high level abstractions somewhat
> misses the point for much of today's C++ users. They generally want more
> performance before they want new high level abstractions. I appreciate
> that is not a popular thing to say around WG21 folk, still ...
>
> Niall
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2023-03-01 15:39:03