Skip to content

Bug: Very inefficient code generated for async functions setup (and likely for generators in general) #99504

Open
@schreter

Description

@schreter

What we see in our project using also larger Futures is a lot of unnecessary memory copying. These memcpy() calls are the hottest function in the profile (and I mean, in some cases, very dominant, like 90%, even with optimization level 3). I searched for similar issues, but found none, so here we go:

What happens:

  • Calling an async function in fact calls a stub producing a Future (which is in fact a generator, which is later polled). This Future is stored in the callers's async function "stack" (i.e., it's Future), so the caller's Future is the aggregate of the parent function state and the called function's Future (with some layout optimizations if calling multiple async function in independent blocks).
  • Unfortunately, instead of telling the Future generator to materialize the Future directly into the proper place in the caller's state, the Future is first materialized on the regular stack and then copied from the stack to the caller's Future.
  • Now, we have a loop calling a request handler which supports various request types (dispatched by a match statement) where one or more of them produce a large Future. Then, the call to the request handler materializes a Future by writing a few words to the stack and then this (practically empty) space is copied in full from the stack to the caller's Future (i.e., including the uninitialized space - it's simply a binary move).

This wastes cycles like there is no tomorrow - instead of materializing the callee's Future on the stack, async functions should materialize the callee's Future directly in-place in the caller's Future. That would save the copying (and, especially, copying of still uninitialized space!).

A minimal example in compiler explorer is here: https://godbolt.org/z/b45MTex3e. You can see that callee().await first materializes the Future on stack and then it's copied into proper place.

Generated instructions (Intel):

        sub     rsp, 520    # in function prelude, reserve space for temporaries
...
        mov     r15, rdi    # in function prelude, arg0 (&mut Future) of the caller
...
        mov     rbx, rsp    # the address of a temporary for callee's Future
        mov     rdi, rbx    # set as arg0 of the stub generating the Future
        call    qword ptr [rip + example::callee@GOTPCREL]
        mov     edx, 516    # size of the callee's Future (including any uninitialized stuff)
        mov     rdi, r15    # position of the callee's Future inside of caller's Future
        mov     rsi, rbx    # temporary variable with callee's Future
        call    qword ptr [rip + memcpy@GOTPCREL]

What I'd expect to see:

        (no space reservation for the temporary of callee's Future)
...
        mov     r15, rdi    # in function prelude, arg0 (&mut Future) of the caller
...
        mov     rdi, r15    # set as arg0 of the stub generating the Future to the proper position
        call    qword ptr [rip + example::callee@GOTPCREL]
        (no memcpy, since the callee's Future is already materialized in the right place)

This might be related to #97540, I also posted it there first.

Interestingly, the same problem does NOT happen when calling a function producing a large result and storing it in a variable in the async closure, subsequently using that variable later. In that case, the function producing a large result produces the value directly in future's state. This is also true when storing the large generated future in a variable, pinning it explicitly and awaiting it (as demonstrated via https://godbolt.org/z/dWzoqEjh1).

We found some temporary workarounds for the problem, boxing some especially large futures and/or the abovementioned workaround. This helps improve the performance somewhat, but memory allocation is also not particularly cheap. Further, hunting down these issues requires a lot of analysis time (since it's also not possible to set a warning level to warn about large futures). Therefore, these are not practicable.

Real solution to the problem, removing unnecessary memcpy, would be very desirable, since that would help performance in async code in general. It looks like some move optimization is missing.

BTW, I tried to post this directly as a bug, but the "Submit new issue" was grayed out. Therefore, I'm submitting it as a blank issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-mir-optArea: MIR optimizationsC-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions