ISOCPP std-proposals List: Re: [std-proposals] Allowing coroutine_handle::from_promise to accept a pointer-interconvertible object

From: Lewis Baker <lewissbaker_at_[hidden]>
Date: Wed, 4 Jan 2023 16:54:27 +1030

On Tue, 3 Jan 2023 at 11:24 am, Aaron Jacobs via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> Hello all,
>
> While working on a C++ coroutine library for use at Google I've found that
> some
> important functionality is hindered by a precondition on
> `std::coroutine_handle<Promise>::from_promise` that seems to be very
> slightly
> stronger than necessary, and I'm seeking feedback on a proposal to weaken
> it
> accordingly. I think this can be done while keeping existing
> implementations
> conforming.
>
> [coroutine.handle.con] currently lists the following precondition for
> `from_promise(Promise& p)`:
>
> > Preconditions: `p` is a reference to a promise object of a coroutine.
>
> I propose changing it to something like the following (feedback on wording
> welcome):
>
> > Preconditions: p is a reference to an object that is
> > pointer-interconvertible with the promise object of a coroutine, and
> > has the same alignment as the promise.
>
> ---
>
> The motivation for this is to allow iterating over a call chain of
> suspended
> coroutines as a linked list, despite the fact that the coroutines in the
> call
> chain may have different promise types. Promise types naturally form a
> singly-linked list via coroutine handles to resume when a coroutine
> completes:
>
> struct NaivePromise {
> [...]
>
> // The coroutine to resume when my coroutine finishes.
> std::coroutine_handle<NaivePromise> waiter;
> };
>
> But this doesn't work if there are multiple promise types involved, which
> is
> very common since a promise is likely to be templated on the coroutine's
> result
> type:
>
> template <typename Result>
> struct MoreRealisticPromise {
> [...]
>
> // The value co_returned by the coroutine.
> std::optional<Result> result;
>
> // The coroutine to resume when my coroutine finishes.
> std::coroutine_handle<???> waiter;
> };
>
> In the example above we could use `std::coroutine_handle<>` for the
> waiter, but
> then we could no longer use it as a link in a list because there would be
> no
> UB-free way to get hold of the next node given a current node. Ideally we
> would
> be able to do this instead:
>
> struct PromiseBase {
> // The coroutine to resume when my coroutine finishes.
> std::coroutine_handle<PromiseBase> waiter;
> };
>
> template <typename Result>
> struct Promise : PromiseBase {
> [...]
>
> // The value co_returned by the coroutine.
> std::optional<Result> result;
> };
>
> static_assert(std::is_pointer_interconvertible_base_of_v<
> PromiseBase, Promise<int>>);
> static_assert(alignof(PromiseBase) == alignof(Promise<int>));
>
> This retains the ability to both resume the waiter and to iterate over the
> chain of promises. Except it's not possible to do in a standard-compliant
> way
> as far as I can tell, because due to the precondition on `from_promise`
> there
> seems to be no UB-free wait to obtain a
> `std::coroutine_handle<PromiseBase>`. A
> reference to the `PromiseBase` sub-object of `Promise<int>` is not
> technically
> a reference to a coroutine's promise object.
>
> ---
>
> For example, consider this case where Foo co_awaits the result of a call
> Bar,
> which co_awaits the result of a call to Baz:
>
> Task<int32_t> Baz() {
> [...]
> co_return 0;
> }
>
> Task<int64_t> Bar() {
> co_return (co_await Baz()) + 1;
> }
>
> Task<void> Foo() {
> assert(co_await Bar() == 1);
> }
>
> While Baz is suspended, it is useful to be able iterate in reverse over the
> co_await chain Baz -> Bar -> Foo. I've found the need for this in two
> contexts
> so far:
>
> 1. To offer good async backtraces. For example if Baz crashes, then with a
> naive unwinding implementation the backtrace in the logs would show
> only the
> functions on the physical thread call stack. But that reveals only
> what most
> recently resumed Baz:
>
> Baz()
> WakeCoroutine()
> SomeRpcFinished()
> ABunchOfNetworkCode()
> [...]
> start_thread
>
> This is useful, but another useful aspect is what the logical
> call/await
> chain is among the coroutines:
>
> Baz()
> Bar()
> Foo()
> [...]
>
> See [1] for a similar effort in Rust.

For an example of how async stack traces were done at FB, see this blog
post:
https://developers.facebook.com/blog/post/2021/09/16/async-stack-traces-folly-Introduction/

and the code in folly::AsyncFrame also has some detailed docs:
https://github.com/facebook/folly/blob/37f0006ea9c0599948da6b3c639e7bb411ac71ee/folly/tracing/AsyncStack.h#L52

This design also was trying to work around some of the limitations that you
mention here. What was implemented was ultimately the best portable
approach that I could reach at the time, but did come at a nontrivial
runtime cost which required several additional pointers to be stored per
frame.

With some tweaks to the ABI of coroutines to force the offset between the
frame address and the promise address to be determined without needing to
know the size/alignment of the promise type, much of this overhead could be
eliminated.
Adopting this sort of coroutine ABI in all compilers would be a necessary
prerequisite for adopting the proposal you mention here.

2. To clean up after cancellation of a coroutine call/await chain without
> stack overflow, by destroying the callee's frame before the caller's
> frame.
>
> In my environment cancellation is expressed by synchronizing on being
> suspended and then destroying the coroutine frame without resuming from
> co_await/co_yield. This can be done by arranging for the task
> destructor to
> call `std::coroutine_handle<>::destroy` and then destroying the "root"
> coroutine, letting destructors take care of the rest. Except this may
> cause
> a stack overflow if there are many coroutines in the chain, since the
> compiler can't necessarily use a tail call in the task's destructor.
> This
> is out of character with everything else in the coroutine environment,
> which can typically be made to be use only O(1) stack frames by
> leaning on
> symmetric transfer of control.
>
> An alternative that avoids stack overflow is to have the cancellation
> process itself destroy the coroutine frames, from the leaf to the
> root, so
> that callee if correctly cleaned up before caller. But this requires
> the
> ability for the cancellation process to iterate over the coroutine
> handles:
>
> for (std::coroutine_handle<PromiseBase> h = leaf; h;) {
> const std::coroutine_handle<PromiseBase> next =
> h.promise().waiter;
>
> h.destroy();
> h = next;
> }
>

The main motivation for guaranteed tail-calls was to avoid potentially
unbounded stack growth when a coroutine awaits another coroutine in a loop
and where that nested coroutine can potentially complete synchronously.
Without tail-recursion, the caller would resume() the child coroutine which
would then run to completion synchronously within that call to resume() and
would then call resume() on the caller's coroutine to resume execution of
the loop and then next time around the loop we now have some extra frames
still on the stack during the execution of the next iteration. As loops can
potentially execute an arbitrary number of iterations then this can easily
run into stack-overflow.

Developers tend to be more careful about recursion as they've been trained
to avoid unbounded/deep recursion due to the risk of stack overflow with
normal functions and so I haven't really seen issues with stack-overflow
due to deep levels of recursion of coroutines in practice.

Do you have use cases where the recursion depth during destruction is a
concern here?

I think if we want to support tail-recursion for the unwind/cancellation
paths of coroutines then we ideally want to separate the concerns of
destruction from the concerns of unwinding.

Instead of calling destroy() to unwind, we should ideally have a separate
"resume_with_cancel" operation that we can resume the coroutine along a
different path - a cancellation path that unwinds from the current suspend
point back to final_suspend(), but without destroying the coroutine - and
only allow destruction of the coroutine frame when the coroutine is
suspended either at an initial or final suspend point. The fact that we
have combined "destruction" and "unwind/cancellation" into the same
operation causes problems in general for use-cases like async-generators
where we might want to cancel the generator but still perform some async
cleanup work.

This was something the papers P1662, P1663 and P1745 identified and
explored potential solutions to - having separate resumption paths along
with symmetric-transfer leads to a need to separate the suspend-point
(which represents many possible resumption paths) from the continuation
(which represents a chosen resumption path).
I'm not sure how viable taking an approach like that is now that we have
standardised C++20 coroutines, however.
It is an area for future investigation.

Nevertheless, I think that the approach suggested here would only be at
best a partial solution to the problem of wanting to unwind/cancel a chain
of coroutines.
e.g. the approach of manually walking the coroutine chain and calling
destroy() wouldn't work if we wanted to do async cleanup work during unwind.

> ---
>
> I'm not an expert in various compilers' implementations of coroutines, but
> from
> my limited understanding I think at least the major ones would all already
> be
> in compliance with the proposed more relaxed precondition.
>
> For example, clang's ABI is that the coroutine frame starts with two
> function
> pointers, followed by an integer index, followed by the promise object at
> an
> appropriate alignment. [2] [3]

I think from memory the integer index is placed after the promise in Clang.
Not all coroutines require an index to be stored so it is generally
excluded from the ABI stable part of the coroutine frame.
In theory, clang could retain the same ABI and just update the function
pointers during suspend rather than updating an index and keeping the
function-pointers unchanged.

There can also potentially be some padding between the function pointers
and the promise if the promise has an alignment requirement greater than
two pointers in size.
This is one of the reasons why you need to know the concrete promise type
alignment when translating between coroutine_handle<> address and its
promise object - under Clang, the offset from the address of the function
pointers to the address of the promise is dependent on the alignment of the
promise.

The libc++ implementation of `from_promise` is
> to use a clang intrinsic for getting base address of the coroutine frame
> that
> passes the address of the promise and its alignment [4], which by
> definition
> are identical for an object that is pointer-interconvertible and has the
> same
> alignment. The intrinsic's implementation [5] would have no trouble with
> this
> as far as I can tell, and in practice using the code above does work
> reliably
> in clang despite technically being UB.
>
> The story for libstdc++ and gcc seems to be identical [6] [7].

I believe that GCC and current MSVC implements the coroutine frame layout
that was discussed during a meeting about ABI in 2019.
See
https://docs.google.com/document/d/1t53lAuQNd3rPqN-VByZabwL6PL2Oyl4zdJxm-gQlhfU/edit

The difference between this ABI and the one Clang currently uses revolves
around where the padding necessary to correctly align the promise object is
placed.
With the Clang ABI the padding is placed between the function pointers and
the promise object.
With the ABI listed in the document above, the padding is placed before the
function pointers so that the address of the promise is always located
immediately after the function-pointers. This means that the layout of the
promise type does not need to be known in order to convert between the
address of the coroutine-frame and the address of the promise object, which
would enable the semantics required by your proposal.

Unfortunately, to my knowledge, I don't think the clang developers ever got
around to implementing that ABI.
The ABI for coroutines in clang would need to be updated before the change
you propose here would be implementable in clang.

I don't have access to the MSVC source, but their STL also uses an intrinsic
> that simply accepts the address of the promise [8]. In this case there
> isn't
> even an alignment provided. (I don't see how it can be correct to leave
> out an
> alignment; perhaps it's a bug and over-aligned promises are not correctly
> supported.)
>

In some earlier versions of MSVC c. 2017/2019, the layout of the coroutine
frame had the promise object placed before the function-pointers.
i.e. it had something like:

struct __frame {
  promise_type promise;
  void(*resumeFn)(void*); // <-- coroutine address() pointed here - used
for both resume/destroy
  std::uint32_t resumeIndex; // least-significant bit indicates (0 -
resume, 1 - destroy)
  bool elided_allocation;
  // ...
};

Where the address of the promise was computed by subtracting the size of
the promise (padded to alignment of a function pointer).

The initial implementation divergence in coroutine frame layout where some
of the compilers had an offset between the frame address and the promise
object that was dependent on the layout of the promise was the main reason
for the restrictions that are currently in place in C++20.

- Lewis

Received on 2023-01-04 06:24:42