Micro-optimize the __morestack fast path

This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.

Here's what happens after calling into `__morestack`, on the fast path
- Set up the frame pointer
- Push all possible argument registers of the calling function in case the call to `upcall_new_stack` clobbers them
- Shuffle the argument registers from the `__morestack` custom calling convention registers to the C calling convention registers used by `upcall_new_stack`
- Call `upcall_new_stack`, through the indirection of the dynamic linker
- Call `get_sp_limit`, an entire assembly function consisting of `movq %fs:112, %rax`
- Compare the `sp_limit` to 0 and don't branch to the `rust_get_current_task` slow path. This branch always makes the same decision during a `__morestack` call.
- Do some math to find the `task` pointer from the stack limit
- Check the stack canary to make sure we haven't run off the end of the stack
- Assert that the task pointer is not null
- Get the minimum stack size
- Do some simple math and pointer indirections to determine if `task->stk->next` is a big enough stack segment to use
- Assert some invariants
- memcpy the arguments from the old stack to the new stack
- Align the new stack frame
- Call `reuse_valgrind_stack` to give valgrind hints
- Call `record_stack_limit` to execute another single instruction
- Return the stack pointer to `__morestack`
- Pop all the saved argument registers
- Finally, call the original function

And returning from the segment:
- Call `upcall_del_stack` through the dynamic linker
- Call `get_sp_limit`, an entire function consisting of `movq %fs:112, %rax`
- Compare the `sp_limit` to 0, etc.
- Check the stack canary to make sure we haven't run off the end of the stack
- Assert that the task pointer is not null
- Update the current stack pointer in the task
- Call `record_stack_limit`

Potential optimizations:
- Don't save the frame pointer - This could be tricky to make work with dwarf unwinding, due to the odd frame shapes around __morestack. Will be easier after rolling our own unwinder #3551.
- Inline `get_sp_limit`, `record_stack_limit` (#2521)
- Statically link `upcall_new_stack` and `upcall_del_stack`, hitting new dynamically linked upcalls for the slow path
- Create a new version of `rust_get_current_task` that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.
- Consider saving the task pointer between upcall_new_stack/del_stack to avoid calculating it again
- Do fewer pointer indirections and calculations to verify the suitability of the stack segment, possibly storing more information directly in the stack segment header, never accessing the task pointer directly. (See also #3566).
- Put all asserts under the compile-time debug flag, including the canary check
- Put the valgrind hinting under a debug flag too. I believe it does have a runtime penalty.
- Ensure that `upcall_new_stack` doesn't use xmm registers and remove the xmm saves and restores in `__morestack` #2043 
- Inline `upcall_del_stack` into `__morestack`
- Write the entire fast path in assembly


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro-optimize the __morestack fast path #3565

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Micro-optimize the __morestack fast path #3565

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions