sg14: Re: [SG14] Low Latency C++ Brainstorming

From: Sophia Poirier <spoirier_at_[hidden]>
Date: Thu, 13 May 2021 10:08:39 -0500

> On May 13, 2021, at 4:05 AM, Niall Douglas via SG14 <sg14_at_[hidden]> wrote:
>
> On 12/05/2021 15:47, Michael Wong via SG14 wrote:
>> As of the Apr 2021 discussion on low latency general brainstorming
>> chaired by Staffan TjernstrÃm , captured in notes here:
>> https://lists.isocpp.org/sg14/2021/04/0616.php
>> <https://lists.isocpp.org/sg14/2021/04/0616.php>
>>
>> This is a summary of that discussion aiming to elicit more refinement,
>> leading to possible features:
>
> LLFIO and its related WG21 proposal papers cover a bunch of these already:
>
>> 2. cache-control
>
> In order to support directly mapped storage devices (Storage Class
> Memory), we implement force cache line eviction to main memory (i.e.
> fsync() for preceding RAM writes by a CPU)
>
> The remaining cache control which is useful is non-cache-affecting reads
> and writes, but let me come back to that later.
>
>> 3. scheduling periodic thread intervals or by the deadline
>
> LLFIO implements deadline i/o, either relative or absolute.
>
>> 4. CPU pinning like madvise
>
> LLFIO exposes as much of madvise() as is reasonably portable.
>
>> 5. manage memory-mapped files
>
> LLFIO provides comprehensive memory mapped file support, in so far as is
> reasonably portable. This is proposed for standardisation at
> http://wg21.link/P1883.
>
>> 7. enforce the real-time constraint on a call path or trap at
>> runtime if unable to satisfy
>
> I am unaware that this is possible to reliably implement outside hard
> realtime OSs. And if it can't be reliable, in my opinion it's not worth
> doing. The user base for hard realtime OS is tiny, so supporting them as
> a special case isn't worth the committee time, in my opinion.

This conversation was not limited to hard realtime OSes where it can be enforced. It is also common in more general OSes to have a mechanism to express this intent for a thread, and the scheduler can then behave appropriately as best it can (no guarantees when not a hard realtime OS). So I think it is worth considering a generalized mechanism for a common attribute like this on threads for supporting the domains that are the focus of this study group.

> I mentioned earlier non-cache-affecting reads and writes. Implementing
> these portably is painful, and probably a LLFIO i/o handle is one of the
> less worse ways because the handle can specify any unusual
> architecture-specific alignment and granularity requirements. For
> example, on Intel, the minimum alignment and granularity for NTA is 16
> bytes, and NTA is actually useful at cache line granularity which is 64
> bytes. For portable code, cache line size is a *runtime* property. It
> can be assumed constant for some architectures only.
>
> This brings me to a missing item from the above list: portable support
> for scalable vector arithmetic. This lets you do math in SIMD widths
> which literally vary from loop to loop, because only the width currently
> not in use by the CPU is chosen. ARM and and many GPUs support scalable
> vector arithmetic, and it can *greatly* improve tail latencies by orders
> of magnitude. In fact, if you have tail latency problems in bulk math,
> I'd call SVA *magical*.
>
> How best to support SVA in standard C++ is a puzzle however. A really
> loopy proposal is a special LLFIO i/o handle which applies bulk math to
> two buffers it reads from, and writes out to another buffer. LLFIO's i/o
> handles are easily lightweight enough that this "just works" with zero
> added overhead, but it "feels" weird to do vector math by writing code
> which does i/o. Still, it does actually work, and the hot loop which
> does the SVA tends to be self contained and thus the weird i/o
> abstraction doesn't leak out.
>
> Niall

Received on 2021-05-13 10:09:03