-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Improve performance of slice::sort
with ipn_stable
#108005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r? @cuviper (rustbot has picked a reviewer for you, use r? to override) |
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
r? @thomcc |
This comment has been minimized.
This comment has been minimized.
Fix CI failures? @rustbot author |
@rustbot ready |
I developed a pretty thorough test suite https://github.com/Voultapher/sort-research-rs/blob/main/tests/main.rs, that checks things like, will the original elements still be in the input if any of the possible comparison calls panic, what if the type has self-modified on the stack via interior mutability and the comparison function panics, will all modifications be visible. Various ways in which Ord can be violated and more. All that with various input sizes and patterns. I run this test suite normally, with ASAN or with miri. I have a libfuzzer and a afl harnesses. This version of the code passes all those tests. |
This looks pretty good to me. It's certainly complex, but is very well commented. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit c7403f2722fdd76f3932c2736b8beefc5f838767 with merge 2c260bb5eff802611a4d5640969e76a92946c62d... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (2c260bb5eff802611a4d5640969e76a92946c62d): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Interesting regression results. I could see it being a real slowdown for smallish inputs size < 15, and I have some ideas how to address that. At the same time I wonder how representative that metric is. This new implementation deliberately exploits instruction-level parallelism. So it certainly issues and retires more instructions, but is faster in terms of wall clock. Is there a way to have these measurements done with time instead of instruction count? Tagging @rylev, you seem to know a lot about these performance regression benchmarks. |
This is achieved by writing code that exploits instruction-parallel hardware: - New small sort, that handles anything up to size 32 - Faster bi-directional merge function - Faster run creation with the help of a sorting-netwok Probing for a common value as done in ipn_stable is omitted and can be added incrementally in the future. A detailed writeup can be found here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md. Only difference being low-cardinality inputs, which will perform similar to random inputs.
Avoid perf regressions compared to rust_std_stable by carfully putting an inlined insertion sort in the hot path. Re-tune heuristics for when stable sorting-networks are worth it. Use a different pattern analysis algorithm, that will respect a pre-sorted streak in small-sort if it is the largest part of the input. Also improve edge run creation.
To get an idea how much time rustc spends in #[link(name = "my_inject", kind = "static")]
extern "C" {
fn inject_timer_start() -> u64;
fn inject_timer_stop(len: usize, start_time: u64);
}
[...]
let start_time = unsafe { inject_timer_start() };
sort::merge_sort(v, &mut is_less, elem_alloc_fn, elem_dealloc_fn, run_alloc_fn, run_dealloc_fn);
unsafe {
inject_timer_stop(v.len(), start_time);
} And the C code: #include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <x86intrin.h>
pthread_mutex_t LOCK = PTHREAD_MUTEX_INITIALIZER;
FILE* APPEND_LOG = NULL;
// rdtsc counts reference cycles, not CPU core clock cycles see
// https://stackoverflow.com/a/51907627.
inline __attribute__((always_inline)) uint64_t clock_start(void) {
_mm_lfence(); // wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
return tsc;
}
inline __attribute__((always_inline)) uint64_t clock_end() {
uint64_t tsc = __rdtsc();
_mm_lfence(); // block later instructions until rdtsc retires
return tsc;
}
__attribute__((visibility("default"))) uint64_t inject_timer_start() {
return clock_start();
}
__attribute__((visibility("default"))) void inject_timer_stop(
size_t len,
uint64_t start_time) {
const uint64_t end_time = clock_end();
const uint64_t time_diff_cycles = end_time - start_time;
pthread_mutex_lock(&LOCK);
if (!APPEND_LOG) {
APPEND_LOG = fopen("/tmp/inject_log.txt", "a+");
if (!APPEND_LOG) {
pthread_mutex_unlock(&LOCK);
// fprintf(stderr, "Failed to open file.\n");
return;
}
}
// fprintf(APPEND_LOG, "%zo %lo\n", len, time_diff_cycles); // WRONG
fprintf(APPEND_LOG, "%zu %" PRIu64 "\n", len, time_diff_cycles);
pthread_mutex_unlock(&LOCK);
} This appends the slice length and time in references tsc cycles it took to a file. I looked into measuring real core-cycles with hardware performance counters and I then ran html5ever
The total cycles reported by perf and the tsc cycles measured with the injection code are not perfectly comparable, however tsc cycles are perfectly comparable with each other. I think this shows two things:
Another interesting insight, rustc does a total of 492_581 calls to EDIT: Numbers fixed, to avoid confusion. |
@rustbot ready |
Thank you for the thorough measurement! (but unsatisfying that it leaves the cause of regression unknown...) |
So, I uhm did a pretty hilarious and stupid mistake 😝. While writing the code I looked at the wrong printf specifier column, and printed the numbers as octal. Luckily I can fix that in post-processing:
Everything else remains the same. |
I think the main reason is, it generates more code, and generating more code means more work for the compiler. This is kind of unavoidable with more sophisticated algorithms. |
It's tricky to say how much we want something like that though. Notably we get lots of requests for a faster compiler, and comparatively fewer requests for a faster sort implementation. Still, I'll take another look later when I have time. |
To further reduce binary size and limit a possible negative impact on certain types the stable sorting-network was removed. Further to trim code-gen even more only types that apply will now call bi_directional_merge.
I ran some benchmarks in real world scenarios and got a little spooked by the stable sorting-network, it's great for integers but I suspect that limiting it to hot-u64-random-10k:
hot-string-random-10k:
Would be good do another timing run. Also arguably, for integers users should use unstable sort, which I plan on upstreaming changes to soon (tm). |
Sorry for the delay, missed that comment. Thanks for the investigation. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 556ba1b with merge 430e29582121afa3a686a8e4cb4b84e300ff2925... |
This comment has been minimized.
This comment has been minimized.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (430e29582121afa3a686a8e4cb4b84e300ff2925): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
Looking at cycles, it's now down to 5 crates that are affected and mean 1.6% and peak 2.1%. That seems acceptable to me. |
Okay, I'll try to re-review soon. I might not be able to get to it until next week. Note that cycles is highly noisy so it's hard to draw too much from it, so perhaps someone from the perf group wants to weigh in on their thoughts. CC @rust-lang/wg-compiler-performance |
This is one is certainly tricky, because the CI perf runs can be affected by both the compiler itself using the new sorting code, and the compiler compiling the new sorting code in the benchmarks. This is the kind of change where instruction counts is not such a good metric, and cycles and wall time are better. Those results don't look so bad. I see https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md has detailed benchmarking. Does that cover the final version of the code with the sorting-network removed? If it's not too painful, I'd be very interested to see (probably in a different PR) a CI perf run with the current sort algorithm replaced with something very simple and small, like a basic mergesort. That would give an interesting perspective on the various factors. |
It benchmarked a version with two additions to what is currently being reviewed, the sorting-network for
I already have something like that, but it goes in the other direction of adding something larger #108662 to get a reference for how bad that would be. This can then be used in a synthetic compile time benchmark. Here are some clean compile time numbers for instantiating sorts with 256 NewType wrappers 50% int 45% String and 5% Cell, adjusted for the baseline:
The experiment PR is there to correlate this to overall regression. |
There is other ongoing work in this area, so I'm closing this. |
This is achieved by writing code that exploits instruction-parallel hardware:
Probing for a common value as done in ipn_stable is omitted and can be added incrementally in the future. A detailed writeup can be found here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md. Only difference being low-cardinality inputs, which will perform similar to random inputs.