Date: Thu, 2 Mar 2023 16:45:32 +0100
On 02/03/2023 16.28, Steve Downey via SG16 wrote:
> That ICU apparently has poor performance in this area indicates a little that many users are OK with it. But it does also sound like there are use cases for eager algorithms that could provide better performance. Probably something like the ones that JeanHeyd was proposing for C?
>
> Incidentally, should the charconv functions get `span`ed? Those char* make me itch a little? Or would that be an ABI pessimization?
<charconv> was intentionally designed to be a bare-bones interface
where other abstractions can be layered on top.
People have a number of ergonomics gripes about the interface,
so feel free to propose wrappers that are "nicer" to use,
if you feel you can make a bang-for-the-buck argument for a
facility somewhere between (say) strtoul and <charconv>.
Jens
> On Thu, Mar 2, 2023 at 9:59 AM Niall Douglas via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 01/03/2023 20:43, Thiago Macieira via SG16 wrote:
> > On Tuesday, 28 February 2023 07:18:07 PST Niall Douglas via SG16 wrote:
> >> I really wish SIMD had better support for UTF-8, only AVX-512 enables a
> >> decent fraction of main memory bandwidth
> >> (https://github.com/simdutf/simdutf <https://github.com/simdutf/simdutf>)
> >
> > I did talk to some CPU architects about this a few years ago and our
> > conclusion is that it wouldn't be worth it. The conversion was never a hot
> > path in any of the content we looked at, and the instructions this would
> > create would end up one of those complex beasts few people ever use
> because
> > they're not fast for anything except the narrow use-case they were
> designed
> > for.
>
> I wasn't meaning dedicated opcodes - those are almost always a dead end
> with only a few honourable exceptions - I rather meant that AVX pre-512
> made working with bytes awkward due to how the instruction set is
> designed. AVX512 shows it can be done better.
>
> > You may be one of the few who would, but you're also one of the few who
> > probably remember the STTNI (STring and Text New Instructions) from SSE
> 4.2 -
> > the PCMPxSTRx instructions[1]. You'll also note that those have never been
> > extended to 256- and 512-bit. 10 years ago, I rewrote the UTF16-to-Latin1
> > codec in Qt with PCMPESTRM to detect out-of-range characters[2]. About 5
> years
> > ago I yanked it out and replaced with a much faster PMINUW[3].
>
> I have a bad memory of that opcode. I spent far too much time trying to
> make it perform well for the thing I was solving, and I failed. It had
> too much latency if I remember rightly, or at least, however I was using
> it was making the CPU stall.
>
> I remember replacing it with something which did a bunch of bitscan
> rights, and those are single cycle and many of them can be interleaved
> concurrently, and that was orders of magnitude faster AND the exact same
> code pattern also worked great on ARM, so it was a keeper.
>
> And that fits the historical trend for forty years now: dedicated
> operation CPU opcodes generally become slower than alternatives within
> two or three CPU generations. So no loss to those SSE ops I think.
>
> (I apologise for being vague. I probably would remember all the details
> of what I was solving back then tomorrow)
>
> In any case, I appreciate that SIMD wasn't originally intended for work
> which has values cross lanes, hence byte-work with SIMD was historically
> inefficient. We can do better on very new CPUs thankfully, but it does
> mean we need to think in terms of the API being designed to allow
> sucking down hundreds of bytes per opcode.
>
> Niall
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
>
> That ICU apparently has poor performance in this area indicates a little that many users are OK with it. But it does also sound like there are use cases for eager algorithms that could provide better performance. Probably something like the ones that JeanHeyd was proposing for C?
>
> Incidentally, should the charconv functions get `span`ed? Those char* make me itch a little? Or would that be an ABI pessimization?
<charconv> was intentionally designed to be a bare-bones interface
where other abstractions can be layered on top.
People have a number of ergonomics gripes about the interface,
so feel free to propose wrappers that are "nicer" to use,
if you feel you can make a bang-for-the-buck argument for a
facility somewhere between (say) strtoul and <charconv>.
Jens
> On Thu, Mar 2, 2023 at 9:59 AM Niall Douglas via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 01/03/2023 20:43, Thiago Macieira via SG16 wrote:
> > On Tuesday, 28 February 2023 07:18:07 PST Niall Douglas via SG16 wrote:
> >> I really wish SIMD had better support for UTF-8, only AVX-512 enables a
> >> decent fraction of main memory bandwidth
> >> (https://github.com/simdutf/simdutf <https://github.com/simdutf/simdutf>)
> >
> > I did talk to some CPU architects about this a few years ago and our
> > conclusion is that it wouldn't be worth it. The conversion was never a hot
> > path in any of the content we looked at, and the instructions this would
> > create would end up one of those complex beasts few people ever use
> because
> > they're not fast for anything except the narrow use-case they were
> designed
> > for.
>
> I wasn't meaning dedicated opcodes - those are almost always a dead end
> with only a few honourable exceptions - I rather meant that AVX pre-512
> made working with bytes awkward due to how the instruction set is
> designed. AVX512 shows it can be done better.
>
> > You may be one of the few who would, but you're also one of the few who
> > probably remember the STTNI (STring and Text New Instructions) from SSE
> 4.2 -
> > the PCMPxSTRx instructions[1]. You'll also note that those have never been
> > extended to 256- and 512-bit. 10 years ago, I rewrote the UTF16-to-Latin1
> > codec in Qt with PCMPESTRM to detect out-of-range characters[2]. About 5
> years
> > ago I yanked it out and replaced with a much faster PMINUW[3].
>
> I have a bad memory of that opcode. I spent far too much time trying to
> make it perform well for the thing I was solving, and I failed. It had
> too much latency if I remember rightly, or at least, however I was using
> it was making the CPU stall.
>
> I remember replacing it with something which did a bunch of bitscan
> rights, and those are single cycle and many of them can be interleaved
> concurrently, and that was orders of magnitude faster AND the exact same
> code pattern also worked great on ARM, so it was a keeper.
>
> And that fits the historical trend for forty years now: dedicated
> operation CPU opcodes generally become slower than alternatives within
> two or three CPU generations. So no loss to those SSE ops I think.
>
> (I apologise for being vague. I probably would remember all the details
> of what I was solving back then tomorrow)
>
> In any case, I appreciate that SIMD wasn't originally intended for work
> which has values cross lanes, hence byte-work with SIMD was historically
> inefficient. We can do better on very new CPUs thankfully, but it does
> mean we need to think in terms of the API being designed to allow
> sucking down hundreds of bytes per opcode.
>
> Niall
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
>
Received on 2023-03-02 15:45:42