sg14: Re: [SG14] Low Latency C++ Brainstorming

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 13 May 2021 10:05:45 +0100

On 12/05/2021 15:47, Michael Wong via SG14 wrote:
> As of the Apr 2021 discussion on low latency general brainstorming
> chaired by Staffan TjernstrÃm , captured in notes here:
> https://lists.isocpp.org/sg14/2021/04/0616.php
> <https://lists.isocpp.org/sg14/2021/04/0616.php>
>
> This is a summary of that discussion aiming to elicit more refinement,
> leading to possible features:

LLFIO and its related WG21 proposal papers cover a bunch of these already:

> 2. cache-control

In order to support directly mapped storage devices (Storage Class
Memory), we implement force cache line eviction to main memory (i.e.
fsync() for preceding RAM writes by a CPU)

The remaining cache control which is useful is non-cache-affecting reads
and writes, but let me come back to that later.

> 3. scheduling periodic thread intervals or by the deadline

LLFIO implements deadline i/o, either relative or absolute.

> 4. CPU pinning like madvise

LLFIO exposes as much of madvise() as is reasonably portable.

> 5. manage memory-mapped files

LLFIO provides comprehensive memory mapped file support, in so far as is
reasonably portable. This is proposed for standardisation at
http://wg21.link/P1883.

> 7. enforce the real-time constraint on a call path or trap at
> runtime if unable to satisfy

I am unaware that this is possible to reliably implement outside hard
realtime OSs. And if it can't be reliable, in my opinion it's not worth
doing. The user base for hard realtime OS is tiny, so supporting them as
a special case isn't worth the committee time, in my opinion.

I mentioned earlier non-cache-affecting reads and writes. Implementing
these portably is painful, and probably a LLFIO i/o handle is one of the
less worse ways because the handle can specify any unusual
architecture-specific alignment and granularity requirements. For
example, on Intel, the minimum alignment and granularity for NTA is 16
bytes, and NTA is actually useful at cache line granularity which is 64
bytes. For portable code, cache line size is a *runtime* property. It
can be assumed constant for some architectures only.

This brings me to a missing item from the above list: portable support
for scalable vector arithmetic. This lets you do math in SIMD widths
which literally vary from loop to loop, because only the width currently
not in use by the CPU is chosen. ARM and and many GPUs support scalable
vector arithmetic, and it can *greatly* improve tail latencies by orders
of magnitude. In fact, if you have tail latency problems in bulk math,
I'd call SVA *magical*.

How best to support SVA in standard C++ is a puzzle however. A really
loopy proposal is a special LLFIO i/o handle which applies bulk math to
two buffers it reads from, and writes out to another buffer. LLFIO's i/o
handles are easily lightweight enough that this "just works" with zero
added overhead, but it "feels" weird to do vector math by writing code
which does i/o. Still, it does actually work, and the hot loop which
does the SVA tends to be self contained and thus the weird i/o
abstraction doesn't leak out.

Niall

Received on 2021-05-13 04:05:54