We need to consider the use cases.
Sure, being able to decode 15 gigabytes per second sounds nice but, why are we doing that?
Do we want 60 gigabytes of codepoints in memory?
You need to have *some* process to feed that data too. (note that Lemire provides some use cases, so they do exist).

We also should consider that simd only makes sense if you can feed the CPU to begin with. Most strings are small.
It certainly doesn't make sense in the context of feeding unicode algorithms, decoding is not the bottleneck!
(For small strings, you might spend more cpu cycles make sure the data is aligned and prefetching state tables that doing actual work) 

I'd also question whether ICU constitutes the paragon of non-SIMD performance. The benchmarks should probably consider
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ustrtrns.cpp#L260
https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ 

For validation, the interface will look like bool-or-some-elaborate-result-type is-valid(iterators-or-range)
regardless, and for that, implementers could use simd (I'd argue that the best motivated use case for eagerly munching through massive amount of utf data).

And again, to get Lemire kind of performance, you need Lemire kind of time commitment, for a few users (most people do not compile with avx 512
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512), and will implementers support neon, altivec, etc? Lemire and all actually provided different implementations for different intel generations)


On Tue, Feb 28, 2023 at 8:51 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
On 2/28/23 1:43 PM, Jens Maurer via SG16 wrote:
On 28/02/2023 16.18, Niall Douglas via SG16 wrote:
On 26/02/2023 01:48, Steve Downey via SG16 wrote:

Much text processing is tied to IO and the performance is mostly
secondary. If we could make accidentally incorrect harder to do that
would be a win.
My consumer hardware storage here does 14Gb/sec reads (two PCIe 4.0 SSDs
in RAID0). Only a few years ago that was main memory speeds for a high
end PC.

I think you need to assume text processing, and especially Unicode
parsing, is basically main memory speeds whether it is from i/o or not.

I really wish SIMD had better support for UTF-8, only AVX-512 enables a
decent fraction of main memory bandwidth
(https://github.com/simdutf/simdutf).
Thanks for the pointer.  I was looking for a comparison like that.

So, this means we do leave 5-10x performance on the table if we
go for an interface that can deliver ICU-level performance (only).

Sadness engulfs me.

  I'd like to see as much of that
performance passed through by the standard library as possible, even if
it makes the API non-STL-like.
So, it seems we need an idea how to employ SIMD with a ranges-based
interface, or we go for eager transcoding algorithms (possibly
in addition to the ranges-based ones).

The interfaces JeanHeyd is proposing for C in WG14 N3095 (Restartable Functions for Efficient Character Conversions) are intended to support implementation with SIMD. As a worst case fallback, that should be an option if/once it is adopted and deployed.

Tom.

Jens

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16