Bug: Very inefficient code generated for async functions setup (and likely for generators in general)

What we see in our project using also larger `Future`s is *a lot* of unnecessary memory copying. These memcpy() calls are the hottest function in the profile (and I mean, in some cases, very dominant, like 90%, even with optimization level 3). I searched for similar issues, but found none, so here we go:

What happens:

- Calling an async function in fact calls a stub producing a `Future` (which is in fact a generator, which is later polled). This `Future` is stored in the callers's async function "stack" (i.e., it's `Future`), so the caller's `Future` is the aggregate of the parent function state and the called function's `Future` (with some layout optimizations if calling multiple async function in independent blocks).
- Unfortunately, instead of telling the `Future` generator to materialize the `Future` directly into the proper place in the caller's state, the `Future` is first materialized on the regular stack and then *copied* from the stack to the caller's `Future`.
- Now, we have a loop calling a request handler which supports various request types (dispatched by a `match` statement) where one or more of them produce a large `Future`. Then, the call to the request handler materializes a `Future` by writing a few words to the stack and then this (practically empty) space is copied in full from the stack to the caller's `Future` (i.e., including the uninitialized space - it's simply a binary move).

This wastes cycles like there is no tomorrow - instead of materializing the callee's `Future` on the stack, async functions should materialize the callee's `Future` directly in-place in the caller's `Future`. That would save the copying (and, especially, copying of still uninitialized space!).

A minimal example in compiler explorer is here: https://godbolt.org/z/b45MTex3e. You can see that `callee().await` first materializes the `Future` on stack and then it's copied into proper place.

Generated instructions (Intel):

```
        sub     rsp, 520    # in function prelude, reserve space for temporaries
...
        mov     r15, rdi    # in function prelude, arg0 (&mut Future) of the caller
...
        mov     rbx, rsp    # the address of a temporary for callee's Future
        mov     rdi, rbx    # set as arg0 of the stub generating the Future
        call    qword ptr [rip + example::callee@GOTPCREL]
        mov     edx, 516    # size of the callee's Future (including any uninitialized stuff)
        mov     rdi, r15    # position of the callee's Future inside of caller's Future
        mov     rsi, rbx    # temporary variable with callee's Future
        call    qword ptr [rip + memcpy@GOTPCREL]
```

What I'd expect to see:

```
        (no space reservation for the temporary of callee's Future)
...
        mov     r15, rdi    # in function prelude, arg0 (&mut Future) of the caller
...
        mov     rdi, r15    # set as arg0 of the stub generating the Future to the proper position
        call    qword ptr [rip + example::callee@GOTPCREL]
        (no memcpy, since the callee's Future is already materialized in the right place)
```

This might be related to #97540, I also posted it there first.

Interestingly, the same problem does NOT happen when calling a function producing a large result and storing it in a variable in the async closure, subsequently using that variable later. In that case, the function producing a large result produces the value directly in future's state. This is also true when storing the large generated future in a variable, pinning it explicitly and awaiting it (as demonstrated via https://godbolt.org/z/dWzoqEjh1).

We found some temporary workarounds for the problem, boxing some especially large futures and/or the abovementioned workaround. This helps improve the performance somewhat, but memory allocation is also not particularly cheap. Further, hunting down these issues requires a lot of analysis time (since it's also not possible to set a warning level to warn about large futures). Therefore, these are not practicable.

Real solution to the problem, removing unnecessary memcpy, would be very desirable, since that would help performance in async code in general. It looks like some move optimization is missing.

BTW, I tried to post this directly as a bug, but the "Submit new issue" was grayed out. Therefore, I'm submitting it as a blank issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Very inefficient code generated for async functions setup (and likely for generators in general) #99504

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Very inefficient code generated for async functions setup (and likely for generators in general) #99504

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions