I've decided where I'm going to go with this. 

The pre-existing function, 'alloca', allocates memory until the end of the function (not until the end of the current scope).

So I will create a new function, 'classalloca', which will not just allocate stack memory but also invoke a constructor.

The GNU compiler gives a function a quick scan before it compiles it. It looks for things like the use of 'alloca', and it enforces the use of a frame pointer if 'alloca' is used at all.

Similarly, if my new function, 'classalloca', is used at least once, then the function will be given a hidden local variable, which will be a linked list for destructors that need to be called.

When the function is exited, either by a 'return' or a 'throw', the linked list of destructor pointers will be traversed and invoked.

After I have 'classalloca' working perfectly, I'll then implement the unary '%' operator, which will pretty much be short-hand for 'classalloca', meaning we'll be able to do:

for (unsigned i = 0; i < N; ++i)
    for (unsigned j = 0; j < M; ++j)
        %async(Func);

So in the above snippet, we'll have N*M concurrent threads. And if we only want M concurrent threads, we can do:

for (unsigned i = 0; i < N; ++i)
    [&]()
    {
      for (unsigned j = 0; j < M; ++j)
          %async(Func);
    }();

The above lambda is of course a bonafide function, and so the objects created by 'classalloca' will be destroyed when the lambda returns. This means we'll get M concurrent threads.

This will be fun to implement.

By the way I've discarded the idea of using multiple percentage signs such as %% or %%%. Instead you'll have to use lambdas for finer control.