On Sun, Sep 4, 2022 at 3:46 PM Nikl Kelbon <kelbonage@gmail.com> wrote:

> The design of std::generator allows an optional allocator type to be specified.
It is possible in my realisation too. Default allocator is std::allocator<std::byte>
In case when you want to not specify default constructible allocator you can do

generator<int> foo_coro(/*arguments with alloc*/)
generator<int> foo(/*arguments without alloc*/) {
return foo_coro(std::allocator_arg, Alloc{}, Args...);
}
In this case if allocator is empty compiler must optimize frame and do not store allocator arg and alloc on frame
I dont see why it must be part of generator type and pollute the interface + break vector<generator<X>> just for syntax sugar

What about the current design breaks vector<generator<X>>?

I can write:

std::vector<std::generator<int>> v;

v.push_back([]() -> std::generator<int> { co_yield 42; }());

v.push_back([](std::allocator_arg_t, SomeAlloc) { co_yield 42; }(std::allocator_arg, {}));

It's possible we could add an implicit/explicit cast from generator<Ref, Val, Alloc> to generator<Ref, Val, void>.

The point of the Alloc template arg was to make it more ergonomic to write coroutine functions that use custom stateless allocators without having to pollute the signature of the functions with allocator_arg arguments.

> There is no extra type-erasure needed for allocators.
> Coroutines are themselves inherently type-erased

I know that coroutines is a type erasure, but it is not clear that implementation will not use additional type erasure and why Alloc = void called 'type erasure case'

It's called the type-erasure case because the allocator is type-erased.

The allocator can be type-erased without additional type-erasure above what the coroutine provides.

It is only the deallocation that is type-erased. The allocation happens with static knowledge of the allocator type.

Whether or not the implementation chooses to do additional type-erasure or takes advantage of the coroutine's inherent type-erasure is QoI.

> There is indeed some overhead for supporting the recursive generator use-case

This is why i say "Pay for what you dont use"

You are free to use a different generator type if you don't want the overhead of the recursive-generator type.

If you think that we should also add a non-recursive generator type to the standard library then I'm sure that Library Evolution would welcome a paper on that.

> Current compilers generally struggle to inline/optimise operator++() as this necessarily needs an extra level of indirection to be able to resume different coroutines depending on what is the current leaf coroutine

Because it is recursive generator. Is there a benchmarks where just generator will lose perfomance compare to recusrive generator?

See "Performance & Benchmarks" section of P2168R3

Recursive generator has overhead even when compiler optimized operator++, we want to create many generators, use them like fast-write output / input iterators etc etc

There is _some_ overhead, yes. Is it significant?

Let's image that recursive generator in recursive case 10% better then just generator, but recursive generator used in 1% use cases when just generator needed in 99% cases. So we will have perfomance loss in general.

In cases where the generator can be completely inlined there is a significant win.

Non-recursive generator is more likely to be inlined here.

In cases where the generator is not inlined there is very little difference in performance between recursive and non-recursive (maybe 1-2%).

> Would you expect begin() to advance the iteration to the next element?
It can be just documented, that generator::begin uses resume, but why it need to be UNDEFINED behavior?

If the generator is already at the end() then calling begin() and having that unconditionally call resume() would be UB.

To guard against this you'd need to conditionally call resume(), adding a branch to begin().

I think I recall some feedback during the review process that this branch was unnecessary.

I dont see any case when compiler/implementation will optimize something using this ub. Why not impl defined eventually?

Making calling begin() multiple times have implementation-defined behaviour would make code that relies on that non-portable.

And what about case when generator created by move from other generator? Will it be UB to use begin first time for this generator?

Generally, you are not allowed to move a view/range after you start iterating over it.

If you have not started iterating over the generator, move it, then call begin() - that should be fine.

for(auto& x : gen) {
handle(x);
if(token.stop_requested())
break;
}
... // time to continue
for(auto& x :gen) // UB

Yes, if you are writing a generic algorithm across input_range objects then attempting to iterate over them twice like this would be UB for at least some types that model input_range.

I don't think input_range guarantees that you can call begin() multiple times, only forward_range or stronger gives you this.

If you are going to try to continue iterating over an input_range in different for-loops then just use an iterator and do the iteration manually.

> You are not handling alignment requirements of the allocator object here. Need to pad to an address that is a multiple of alignof(Alloc).

I know and i leave a big comment about it. I dont really know why operator new/delete used, why not just "allocate_frame" function like await_suspend/await_resume/get_return_object?

I guess it may allow some reuse of wording around allocation/deallocation functions but otherwise there's no reason why it needed to be spelled 'operator new'.

Its really logically incorrect, this operator new used for corotuine frame, but defined for promise type.

Yes. This is a known inconsistency.

And i dont see any guarantees about frame alignment, what alignment i must use?)

Generally, the caller of an allocation function should be able to guarantee that it can construct any object of the requested size where that object does not have new-extended-alignment.

So if it asks to allocate >=8 bytes then generally it should be 8 byte aligned.

If it asks for >=16 bytes then it should be 16 byte aligned, etc.

Up to __STDCPP_DEFAULT_NEW_ALIGNMENT__ which is the maximum alignment that the default operator new should be assumed to provide.

Above this alignment, implementations are supposed to call the align_val_t overloads to obtain memory with greater alignment.

However coroutines do not currently support dispatching to align_val_t overloads.

So the compiler can generally only assume at most __STDCPP_DEFAULT_NEW_ALIGNMENT__ and if the frame requires greater alignment then they over-allocate and dynamically align within the returned buffer.

Why i cant use alloc which user provided and it is stored on frame? It is overhead.

The problem here is that during deallocation the deallocation function called _after_ the parameter-copies stored in that allocated frame are destroyed.

So you need to store an additional copy of a stateful allocator outside of the frame so that it's still available at the time when the deallocation function is called.

And main problem - why ALLOCATOR support required?

Lots of code-bases have existing custom-allocators that they have implemented and they want to be able to reuse them to customise allocations of things.

It is useful to be able to reuse them to customise allocation of generator coroutine frames.

When it must be memory resource support. In this case it must be exactly one non typed allocation. Use allocator for this is too much.

The reference implementation by Casey rebinds the allocator to a trivial type that has size/alignment of __STDCPP_DEFAULT_NEW_ALIGNMENT__ and then uses the allocator to allocate an array of those objects big enough to host the requested size.

What about some std:: type which will store this logic like "Im exactly one allocation and deallocation from this object X, it can be memory resource or allocator, and im a tag-type like std::allocator_arg_t"

I'm really not following what you're suggesting here.

generator<int> x(std::one_allocation_from<Alloc>);

> If you are storing the promise-pointer anyway, you could store a pointer to the base-class instead.
i dont see why std::coroutine_handle<void>::promise is not exist.

MSVC implementation dont use nothing from promise type to get promise(just reinterpret_cast), llvm implementation uses alignment, but why dont provide alignment in promise method?

The MSVC ABI places the alignment padding before the stable-ABI part (resume-fn/destroy-fn pointers) which allows it to place the promise immediately after the function-pointers.

The clang implementation places the alignment padding between the function-pointers and promise and so we need to know the promise alignment in order to calculate the address of the promise given the address of the function-pointers.

The clang implementation could be changed to use the same ABI as MSVC - in which case it would be easier to implement a coroutine_handle<void>::promise() that returned the address of the promise.

A paper would need to be written proposing to add this to coroutine_handle<void> and implementations for Clang and GCC would need to be changed so that their ABI allowed calculation of the promise address without knowing the promise-type.

> currently implemented coroutine ABIs
I dont see how to do not change ABI with coroutines. Changing code in coroutine may change ABI, optimization in corotuine frame may change ABI

It is only the ABI-stable portion that needs to be considered here. ie. the ABI of the coroutine that coroutine_handle interacts with.

The rest of the layout of the frame is type-erased and is handled by the relevant resume/destroy function pointers.

Changing the body of a coroutine function does not change the ABI of the coroutine-frame.

It is not an ABI break for those other aspects of the layout to change since things in other translation units will only ever interact with them through the coroutine_handle type, which only interacts with the ABI-stable part of the coroutine-frame.

вс, 4 сент. 2022 г. в 06:07, Lewis Baker <lewissbaker@gmail.com>:

On Sat, Sep 3, 2022 at 6:19 PM Nikl Kelbon via Std-Proposals <std-proposals@lists.isocpp.org> wrote:
std::generator has huge problems.
I will briefly list them:
1. The generator must not depend on the allocator (only the promise_type depends), but std::generator has a template argument for alloc, in particular, this prohibits adding 2 generators that allocated memory differently into a container. It is possible to implement this without type erasure.
Link to example implementation:
https://github.com/kelbon/dd_generator/blob/4053fcff456c885d9330708d2e67f103c4212319/include/dd_generator.hpp#L264

The design of std::generator allows an optional allocator type to be specified.

If you do not specify it then the allocator is determined solely by the arguments.
If there is a std::allocator_arg + allocator object pair passed as the first two arguments of the coroutine then it will use that.
Otherwise, it will fall back to using std::allocator by default.
In this way it is possible to create a container of std::generator objects that have different allocator types by just not specifying the allocator in the template argument.

If the template argument is specified then either:
- the allocator is stateless in which case it uses a default-constructed allocator and avoids the user needing to explictly specify the allocator
- the allocator is stateful and the caller still needs to pass an instance of the allocator using std::allocator_arg

The main motivation for allowing the allocator as a template argument to std::generator is to allow the stateless-allocator use-case to be more ergonomic.
e.g. so we can define an alias

template<typename Ref, typename Val = remove_cvref_t<Ref>>
using my_generator = std:::generator<Ref, Val, my_stateless_allocator>;

And then can just use my_generator throughout my program and it will just use the custom allocator without having to worry about passing std::allocator_arg to every coroutine function.

There is no additional type-erasure necessary for supporting these use-cases.
We can just leverage the inherent type-erasure of coroutines in general to handle this.
e.g. see https://github.com/lewissbaker/generator/blob/main/include/__generator.hpp#L277-L330

2. Overhead for ueslss default type erasure(default behavior) and stack in generator (used only for element_of, which is basically for(auto&& v : gen) co_yield v;
Its just uselss
You pay for what you dont use.

What "useless default type erasure" are you referring to here?
There is no extra type-erasure needed for allocators.
Coroutines are themselves inherently type-erased (which we can take advantage of).

There is indeed some overhead for supporting the recursive generator use-case.
Current compilers generally struggle to inline/optimise operator++() as this necessarily needs an extra level of indirection to be able to resume different coroutines depending on what is the current leaf coroutine.

The recursive generator is more efficient for recursive use-cases but less efficient for non-recursive use-cases.
Supporting 'co_yield elements_of(x)' is not possible without the recursive support.

We discussed whether it would be better to add a non-recursive generator or a recursive generator.
I think the consensus was that it was much harder to write your own recursive generator and so that had more value to exist in the standard library.
We can add another non-recursive generator type in future if compilers are unable to optimise the recursive generator sufficiently.

3. It is undefined behavior to call begin for std::generator twice. But iteraing not all values in one loop is a very logical action and expected behavior - just keep generating values
Besides, such an error is extremely non-obvious when using ranges or range based for loop.

This is a typical restriction provided by many other input_range implementations.

Would you expect begin() to advance the iteration to the next element?
Or would you expect to be able to call begin() multiple times and each time just get a copy of the iterator pointing to the same element?
Supporting either of these would potentially affect performance - either requiring more state to be held in the generator to keep track of whether this was the first call to begin() or not, or require an extra indirection when starting the coroutine for the first time (since a subsequent call to begin() might need to resume a different coroutine than the top-level coroutine).

4. interface that is very obscure to the user and requires knowledge of implementation to write efficient code
No one will write generator<const T&> foo(); without knowing why it is needed and why it is better for perfmornace

Can you please expand on what part of this you think is "obscure"?
The user needs to make a choice about whether they want to yield a sequence of lvalues, const-lvalues, rvalues.
This will affect what kinds of values they can pass to co_yield and whether they incur extra copies/etc.

But there are way to write generator without such problems
https://github.com/kelbon/dd_generator

Line 43: You are not handling alignment requirements of the allocator object here. Need to pad to an address that is a multiple of alignof(Alloc).
Line 32,48: You also need to check for std::allocator_traits<Alloc>::is_always_equal

The need for any_coroutine_handle (which stores the coroutine_handle and the promise pointer) and reinterpret_cast to get back to the 'value' data-member seems like a fairly unsafe way of implementing this.
If you are storing the promise-pointer anyway, you could store a pointer to the base-class instead.
e.g. see my generator<Ref, Val, use_allocator_arg> implementation

Note that we cannot have coroutine_handle<void> return a pointer to the promise with the currently implemented coroutine ABIs because the offset between the frame pointer and the promise address is not necessarily a constant - there may be padding added to align a promise type with alignment >= 2*sizeof(void*). We would either need to change the ABI so that the padding was placed before the function-pointers, or we'd need to store an extra field to record the offset of the promise so we can compute its address correctly.

I would also worry about the void* obtained from the coroutine_handle<void>::promise() being used in an unsafe way
- it could be confused with address()/from_address()
- you need to use reinterpret_cast<> which is difficult to get right without invoking UB
These are minor concerns, however.

- Lewis.