Closed
Description
This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.
Here's what happens after calling into __morestack
, on the fast path
- Set up the frame pointer
- Push all possible argument registers of the calling function in case the call to
upcall_new_stack
clobbers them - Shuffle the argument registers from the
__morestack
custom calling convention registers to the C calling convention registers used byupcall_new_stack
- Call
upcall_new_stack
, through the indirection of the dynamic linker - Call
get_sp_limit
, an entire assembly function consisting ofmovq %fs:112, %rax
- Compare the
sp_limit
to 0 and don't branch to therust_get_current_task
slow path. This branch always makes the same decision during a__morestack
call. - Do some math to find the
task
pointer from the stack limit - Check the stack canary to make sure we haven't run off the end of the stack
- Assert that the task pointer is not null
- Get the minimum stack size
- Do some simple math and pointer indirections to determine if
task->stk->next
is a big enough stack segment to use - Assert some invariants
- memcpy the arguments from the old stack to the new stack
- Align the new stack frame
- Call
reuse_valgrind_stack
to give valgrind hints - Call
record_stack_limit
to execute another single instruction - Return the stack pointer to
__morestack
- Pop all the saved argument registers
- Finally, call the original function
And returning from the segment:
- Call
upcall_del_stack
through the dynamic linker - Call
get_sp_limit
, an entire function consisting ofmovq %fs:112, %rax
- Compare the
sp_limit
to 0, etc. - Check the stack canary to make sure we haven't run off the end of the stack
- Assert that the task pointer is not null
- Update the current stack pointer in the task
- Call
record_stack_limit
Potential optimizations:
- Don't save the frame pointer - This could be tricky to make work with dwarf unwinding, due to the odd frame shapes around __morestack. Will be easier after rolling our own unwinder Invoke instructions kick us off the FastISel path #3551.
- Inline
get_sp_limit
,record_stack_limit
(Inline get_sp_limit, set_sp_limit, get_sp runtime functions #2521) - Statically link
upcall_new_stack
andupcall_del_stack
, hitting new dynamically linked upcalls for the slow path - Create a new version of
rust_get_current_task
that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack. - Consider saving the task pointer between upcall_new_stack/del_stack to avoid calculating it again
- Do fewer pointer indirections and calculations to verify the suitability of the stack segment, possibly storing more information directly in the stack segment header, never accessing the task pointer directly. (See also Remove unnecessary logic in new_stack_fast #3566).
- Put all asserts under the compile-time debug flag, including the canary check
- Put the valgrind hinting under a debug flag too. I believe it does have a runtime penalty.
- Ensure that
upcall_new_stack
doesn't use xmm registers and remove the xmm saves and restores in__morestack
Stop saving floating point registers in __morestack #2043 - Inline
upcall_del_stack
into__morestack
- Write the entire fast path in assembly