Date: Sat, 10 Aug 2024 12:05:27 +0200
Hi Phil,
from the referenced intro and the conversation so far I take that you suggest
- standard topologies like hypercubes, where the standard library handles communication and synchronization
- locally shared non-unified memory in this topologies, where the standard library handles or supports data transfers
And you state that Nvidia will more and more have to go that direction with their architecture, probably also considering the more and more hierarchical levels of well interconnected racks Nvidia is introducing in data centers, which have to be programmed as a unit.
That could make some algorithms much easier to implement, but would it make much more difficult to optimize those for the target architectures. On the other hand the architectures get so complicated that few would want to fully optimize for large-scale systems.
Understood so far and at least partly agree.
Why not use a library feature providing those functions?
What part needs to be extended in C++ to be able to write code running on a single node of a hypercube?
What abstractions are needed in the core language?
Also are the same abstractions needed for local data processing (32 to 10^3 to 10^5 threads in Cuda world) and for huge data centers? From a theoretical standpoint it is simpler to formulate, but e.g. Cuda was quite successful to have a separation into grid, block, warp and thread over many years. Perhaps one should keep some separation of the hierarchies instead of unifying everything into one framework loosing any way the algorithms could specifically profit from the parallel hardware features.
Not all problems can be translated into a hypercube. Your framework also would have to allow the combination of different topologies within one algorithm for different scale levels. E.g. for locally calculating a FFT-like function within a hypercube.
To be efficient those topologies would have to be analyzed at compile-time to generate the optimal code, or with some kind of JIT compiler.
IMHO I would assume that there is of course general interest to extend C++ better for parallel programming, but before standardization there should be a proven implementation out there. This seems much to complex to be introduced theoretically by committee only.
Best,
Sebastian
-----Ursprüngliche Nachricht-----
Von:Phil Bouchard via Std-Proposals <std-proposals_at_[hidden]>
Gesendet:Sa 10.08.2024 11:03
Betreff:Re: [std-proposals] C++ and Parallel Programming
An:Jens Maurer <jens.maurer_at_[hidden]>; std-proposals_at_[hidden];
CC:Phil Bouchard <boost_at_[hidden]>;
> Memory is not always shared, depending on your system architecture.
>
> This is the framework for concurrency that just got into the
> Working Draft:
>
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html
Is the working draft considering modular memory then?
> What are you looking for that wouldn't be covered by that framework?
Well this is actually an intro to true parallel programming:
https://courses.grainger.illinois.edu/cs554/fa2015/notes/01_overview_8up.pdf
And I think Nvidia needs to head that direction with their architecture.
But as soon as they do then the C++ concurrent framework will need to be
rewritten again because it doesn't look like the working draft is
considering it.
Received on 2024-08-10 10:05:34