Date: Wed, 1 Mar 2023 10:36:21 +0100
We need to consider the use cases.
Sure, being able to decode 15 gigabytes per second sounds nice but, why are
we doing that?
Do we want 60 gigabytes of codepoints in memory?
You need to have *some* process to feed that data too. (note that Lemire
provides some use cases, so they do exist).
We also should consider that simd only makes sense if you can feed the CPU
to begin with. Most strings are small.
It certainly doesn't make sense in the context of feeding unicode
algorithms, decoding is not the bottleneck!
(For small strings, you might spend more cpu cycles make sure the data is
aligned and prefetching state tables that doing actual work)
I'd also question whether ICU constitutes the paragon of non-SIMD
performance. The benchmarks should probably consider
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ustrtrns.cpp#L260
https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
For validation, the interface will look like bool-or-some-elaborate-result-type
is-valid(iterators-or-range)
regardless, and for that, implementers could use simd (I'd argue that the
best motivated use case for eagerly munching through massive amount of utf
data).
And again, to get Lemire kind of performance, you need Lemire kind of time
commitment, for a few users (most people do not compile with avx 512
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512), and will
implementers support neon, altivec, etc? Lemire and all actually provided
different implementations for different intel generations)
On Tue, Feb 28, 2023 at 8:51 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:
> On 2/28/23 1:43 PM, Jens Maurer via SG16 wrote:
>
> On 28/02/2023 16.18, Niall Douglas via SG16 wrote:
>
> On 26/02/2023 01:48, Steve Downey via SG16 wrote:
>
>
> Much text processing is tied to IO and the performance is mostly
> secondary. If we could make accidentally incorrect harder to do that
> would be a win.
>
> My consumer hardware storage here does 14Gb/sec reads (two PCIe 4.0 SSDs
> in RAID0). Only a few years ago that was main memory speeds for a high
> end PC.
>
> I think you need to assume text processing, and especially Unicode
> parsing, is basically main memory speeds whether it is from i/o or not.
>
> I really wish SIMD had better support for UTF-8, only AVX-512 enables a
> decent fraction of main memory bandwidth
> (https://github.com/simdutf/simdutf).
>
> Thanks for the pointer. I was looking for a comparison like that.
>
> So, this means we do leave 5-10x performance on the table if we
> go for an interface that can deliver ICU-level performance (only).
>
> Sadness engulfs me.
>
>
> I'd like to see as much of that
> performance passed through by the standard library as possible, even if
> it makes the API non-STL-like.
>
> So, it seems we need an idea how to employ SIMD with a ranges-based
> interface, or we go for eager transcoding algorithms (possibly
> in addition to the ranges-based ones).
>
> The interfaces JeanHeyd is proposing for C in WG14 N3095 (Restartable
> Functions for Efficient Character Conversions)
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm> are intended
> to support implementation with SIMD. As a worst case fallback, that should
> be an option if/once it is adopted and deployed.
>
> Tom.
>
> Jens
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Sure, being able to decode 15 gigabytes per second sounds nice but, why are
we doing that?
Do we want 60 gigabytes of codepoints in memory?
You need to have *some* process to feed that data too. (note that Lemire
provides some use cases, so they do exist).
We also should consider that simd only makes sense if you can feed the CPU
to begin with. Most strings are small.
It certainly doesn't make sense in the context of feeding unicode
algorithms, decoding is not the bottleneck!
(For small strings, you might spend more cpu cycles make sure the data is
aligned and prefetching state tables that doing actual work)
I'd also question whether ICU constitutes the paragon of non-SIMD
performance. The benchmarks should probably consider
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/ustrtrns.cpp#L260
https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
For validation, the interface will look like bool-or-some-elaborate-result-type
is-valid(iterators-or-range)
regardless, and for that, implementers could use simd (I'd argue that the
best motivated use case for eagerly munching through massive amount of utf
data).
And again, to get Lemire kind of performance, you need Lemire kind of time
commitment, for a few users (most people do not compile with avx 512
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512), and will
implementers support neon, altivec, etc? Lemire and all actually provided
different implementations for different intel generations)
On Tue, Feb 28, 2023 at 8:51 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:
> On 2/28/23 1:43 PM, Jens Maurer via SG16 wrote:
>
> On 28/02/2023 16.18, Niall Douglas via SG16 wrote:
>
> On 26/02/2023 01:48, Steve Downey via SG16 wrote:
>
>
> Much text processing is tied to IO and the performance is mostly
> secondary. If we could make accidentally incorrect harder to do that
> would be a win.
>
> My consumer hardware storage here does 14Gb/sec reads (two PCIe 4.0 SSDs
> in RAID0). Only a few years ago that was main memory speeds for a high
> end PC.
>
> I think you need to assume text processing, and especially Unicode
> parsing, is basically main memory speeds whether it is from i/o or not.
>
> I really wish SIMD had better support for UTF-8, only AVX-512 enables a
> decent fraction of main memory bandwidth
> (https://github.com/simdutf/simdutf).
>
> Thanks for the pointer. I was looking for a comparison like that.
>
> So, this means we do leave 5-10x performance on the table if we
> go for an interface that can deliver ICU-level performance (only).
>
> Sadness engulfs me.
>
>
> I'd like to see as much of that
> performance passed through by the standard library as possible, even if
> it makes the API non-STL-like.
>
> So, it seems we need an idea how to employ SIMD with a ranges-based
> interface, or we go for eager transcoding algorithms (possibly
> in addition to the ranges-based ones).
>
> The interfaces JeanHeyd is proposing for C in WG14 N3095 (Restartable
> Functions for Efficient Character Conversions)
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3095.htm> are intended
> to support implementation with SIMD. As a worst case fallback, that should
> be an option if/once it is adopted and deployed.
>
> Tom.
>
> Jens
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2023-03-01 09:36:35