Description
What we see in our project using also larger Future
s is a lot of unnecessary memory copying. These memcpy() calls are the hottest function in the profile (and I mean, in some cases, very dominant, like 90%, even with optimization level 3). I searched for similar issues, but found none, so here we go:
What happens:
- Calling an async function in fact calls a stub producing a
Future
(which is in fact a generator, which is later polled). ThisFuture
is stored in the callers's async function "stack" (i.e., it'sFuture
), so the caller'sFuture
is the aggregate of the parent function state and the called function'sFuture
(with some layout optimizations if calling multiple async function in independent blocks). - Unfortunately, instead of telling the
Future
generator to materialize theFuture
directly into the proper place in the caller's state, theFuture
is first materialized on the regular stack and then copied from the stack to the caller'sFuture
. - Now, we have a loop calling a request handler which supports various request types (dispatched by a
match
statement) where one or more of them produce a largeFuture
. Then, the call to the request handler materializes aFuture
by writing a few words to the stack and then this (practically empty) space is copied in full from the stack to the caller'sFuture
(i.e., including the uninitialized space - it's simply a binary move).
This wastes cycles like there is no tomorrow - instead of materializing the callee's Future
on the stack, async functions should materialize the callee's Future
directly in-place in the caller's Future
. That would save the copying (and, especially, copying of still uninitialized space!).
A minimal example in compiler explorer is here: https://godbolt.org/z/b45MTex3e. You can see that callee().await
first materializes the Future
on stack and then it's copied into proper place.
Generated instructions (Intel):
sub rsp, 520 # in function prelude, reserve space for temporaries
...
mov r15, rdi # in function prelude, arg0 (&mut Future) of the caller
...
mov rbx, rsp # the address of a temporary for callee's Future
mov rdi, rbx # set as arg0 of the stub generating the Future
call qword ptr [rip + example::callee@GOTPCREL]
mov edx, 516 # size of the callee's Future (including any uninitialized stuff)
mov rdi, r15 # position of the callee's Future inside of caller's Future
mov rsi, rbx # temporary variable with callee's Future
call qword ptr [rip + memcpy@GOTPCREL]
What I'd expect to see:
(no space reservation for the temporary of callee's Future)
...
mov r15, rdi # in function prelude, arg0 (&mut Future) of the caller
...
mov rdi, r15 # set as arg0 of the stub generating the Future to the proper position
call qword ptr [rip + example::callee@GOTPCREL]
(no memcpy, since the callee's Future is already materialized in the right place)
This might be related to #97540, I also posted it there first.
Interestingly, the same problem does NOT happen when calling a function producing a large result and storing it in a variable in the async closure, subsequently using that variable later. In that case, the function producing a large result produces the value directly in future's state. This is also true when storing the large generated future in a variable, pinning it explicitly and awaiting it (as demonstrated via https://godbolt.org/z/dWzoqEjh1).
We found some temporary workarounds for the problem, boxing some especially large futures and/or the abovementioned workaround. This helps improve the performance somewhat, but memory allocation is also not particularly cheap. Further, hunting down these issues requires a lot of analysis time (since it's also not possible to set a warning level to warn about large futures). Therefore, these are not practicable.
Real solution to the problem, removing unnecessary memcpy, would be very desirable, since that would help performance in async code in general. It looks like some move optimization is missing.
BTW, I tried to post this directly as a bug, but the "Submit new issue" was grayed out. Therefore, I'm submitting it as a blank issue.