Skip to content

Micro-optimize the __morestack fast path #3565

Closed
@brson

Description

@brson

This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.

Here's what happens after calling into __morestack, on the fast path

  • Set up the frame pointer
  • Push all possible argument registers of the calling function in case the call to upcall_new_stack clobbers them
  • Shuffle the argument registers from the __morestack custom calling convention registers to the C calling convention registers used by upcall_new_stack
  • Call upcall_new_stack, through the indirection of the dynamic linker
  • Call get_sp_limit, an entire assembly function consisting of movq %fs:112, %rax
  • Compare the sp_limit to 0 and don't branch to the rust_get_current_task slow path. This branch always makes the same decision during a __morestack call.
  • Do some math to find the task pointer from the stack limit
  • Check the stack canary to make sure we haven't run off the end of the stack
  • Assert that the task pointer is not null
  • Get the minimum stack size
  • Do some simple math and pointer indirections to determine if task->stk->next is a big enough stack segment to use
  • Assert some invariants
  • memcpy the arguments from the old stack to the new stack
  • Align the new stack frame
  • Call reuse_valgrind_stack to give valgrind hints
  • Call record_stack_limit to execute another single instruction
  • Return the stack pointer to __morestack
  • Pop all the saved argument registers
  • Finally, call the original function

And returning from the segment:

  • Call upcall_del_stack through the dynamic linker
  • Call get_sp_limit, an entire function consisting of movq %fs:112, %rax
  • Compare the sp_limit to 0, etc.
  • Check the stack canary to make sure we haven't run off the end of the stack
  • Assert that the task pointer is not null
  • Update the current stack pointer in the task
  • Call record_stack_limit

Potential optimizations:

  • Don't save the frame pointer - This could be tricky to make work with dwarf unwinding, due to the odd frame shapes around __morestack. Will be easier after rolling our own unwinder Invoke instructions kick us off the FastISel path #3551.
  • Inline get_sp_limit, record_stack_limit (Inline get_sp_limit, set_sp_limit, get_sp runtime functions #2521)
  • Statically link upcall_new_stack and upcall_del_stack, hitting new dynamically linked upcalls for the slow path
  • Create a new version of rust_get_current_task that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.
  • Consider saving the task pointer between upcall_new_stack/del_stack to avoid calculating it again
  • Do fewer pointer indirections and calculations to verify the suitability of the stack segment, possibly storing more information directly in the stack segment header, never accessing the task pointer directly. (See also Remove unnecessary logic in new_stack_fast #3566).
  • Put all asserts under the compile-time debug flag, including the canary check
  • Put the valgrind hinting under a debug flag too. I believe it does have a runtime penalty.
  • Ensure that upcall_new_stack doesn't use xmm registers and remove the xmm saves and restores in __morestack Stop saving floating point registers in __morestack #2043
  • Inline upcall_del_stack into __morestack
  • Write the entire fast path in assembly

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-runtimeArea: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflowsI-slowIssue: Problems and improvements with respect to performance of generated code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions