ISOCPP sg16 List: Re: Performance requirements for Unicode views/types/algorithms

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 02 Mar 2023 14:59:16 +0000

On 01/03/2023 20:43, Thiago Macieira via SG16 wrote:
> On Tuesday, 28 February 2023 07:18:07 PST Niall Douglas via SG16 wrote:
>> I really wish SIMD had better support for UTF-8, only AVX-512 enables a
>> decent fraction of main memory bandwidth
>> (https://github.com/simdutf/simdutf)
>
> I did talk to some CPU architects about this a few years ago and our
> conclusion is that it wouldn't be worth it. The conversion was never a hot
> path in any of the content we looked at, and the instructions this would
> create would end up one of those complex beasts few people ever use
because
> they're not fast for anything except the narrow use-case they were
designed
> for.

I wasn't meaning dedicated opcodes - those are almost always a dead end
with only a few honourable exceptions - I rather meant that AVX pre-512
made working with bytes awkward due to how the instruction set is
designed. AVX512 shows it can be done better.

> You may be one of the few who would, but you're also one of the few who
> probably remember the STTNI (STring and Text New Instructions) from SSE
4.2 -
> the PCMPxSTRx instructions[1]. You'll also note that those have never been
> extended to 256- and 512-bit. 10 years ago, I rewrote the UTF16-to-Latin1
> codec in Qt with PCMPESTRM to detect out-of-range characters[2]. About 5
years
> ago I yanked it out and replaced with a much faster PMINUW[3].

I have a bad memory of that opcode. I spent far too much time trying to
make it perform well for the thing I was solving, and I failed. It had
too much latency if I remember rightly, or at least, however I was using
it was making the CPU stall.

I remember replacing it with something which did a bunch of bitscan
rights, and those are single cycle and many of them can be interleaved
concurrently, and that was orders of magnitude faster AND the exact same
code pattern also worked great on ARM, so it was a keeper.

And that fits the historical trend for forty years now: dedicated
operation CPU opcodes generally become slower than alternatives within
two or three CPU generations. So no loss to those SSE ops I think.

(I apologise for being vague. I probably would remember all the details
of what I was solving back then tomorrow)

In any case, I appreciate that SIMD wasn't originally intended for work
which has values cross lanes, hence byte-work with SIMD was historically
inefficient. We can do better on very new CPUs thankfully, but it does
mean we need to think in terms of the API being designed to allow
sucking down hundreds of bytes per opcode.

Niall

Received on 2023-03-02 14:59:18