Skip to content

Improve performance of slice::sort with ipn_stable #108005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

Voultapher
Copy link
Contributor

This is achieved by writing code that exploits instruction-parallel hardware:

  • New small sort, that handles anything up to size 32
  • Faster bi-directional merge function
  • Faster run creation with the help of a sorting-netwok

Probing for a common value as done in ipn_stable is omitted and can be added incrementally in the future. A detailed writeup can be found here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md. Only difference being low-cardinality inputs, which will perform similar to random inputs.

@rustbot
Copy link
Collaborator

rustbot commented Feb 13, 2023

r? @cuviper

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 13, 2023
@rustbot
Copy link
Collaborator

rustbot commented Feb 13, 2023

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@Voultapher
Copy link
Contributor Author

r? @thomcc

@rustbot rustbot assigned thomcc and unassigned cuviper Feb 13, 2023
@rust-log-analyzer

This comment has been minimized.

@thomcc
Copy link
Member

thomcc commented Feb 14, 2023

Fix CI failures?

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 14, 2023
@Voultapher
Copy link
Contributor Author

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 14, 2023
@Voultapher
Copy link
Contributor Author

I developed a pretty thorough test suite https://github.com/Voultapher/sort-research-rs/blob/main/tests/main.rs, that checks things like, will the original elements still be in the input if any of the possible comparison calls panic, what if the type has self-modified on the stack via interior mutability and the comparison function panics, will all modifications be visible. Various ways in which Ord can be violated and more. All that with various input sizes and patterns. I run this test suite normally, with ASAN or with miri. I have a libfuzzer and a afl harnesses. This version of the code passes all those tests.

@thomcc
Copy link
Member

thomcc commented Feb 19, 2023

This looks pretty good to me. It's certainly complex, but is very well commented.

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 19, 2023
@bors
Copy link
Collaborator

bors commented Feb 19, 2023

⌛ Trying commit c7403f2722fdd76f3932c2736b8beefc5f838767 with merge 2c260bb5eff802611a4d5640969e76a92946c62d...

@bors
Copy link
Collaborator

bors commented Feb 19, 2023

☀️ Try build successful - checks-actions
Build commit: 2c260bb5eff802611a4d5640969e76a92946c62d (2c260bb5eff802611a4d5640969e76a92946c62d)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (2c260bb5eff802611a4d5640969e76a92946c62d): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.3% [0.4%, 8.2%] 53
Regressions ❌
(secondary)
1.9% [0.4%, 9.4%] 47
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.3% [0.4%, 8.2%] 53

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.9% [0.8%, 6.5%] 7
Regressions ❌
(secondary)
2.3% [1.3%, 3.3%] 2
Improvements ✅
(primary)
-5.7% [-5.7%, -5.7%] 1
Improvements ✅
(secondary)
-3.6% [-3.6%, -3.6%] 1
All ❌✅ (primary) 2.7% [-5.7%, 6.5%] 8

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.8% [1.4%, 6.3%] 30
Regressions ❌
(secondary)
2.0% [1.1%, 6.7%] 22
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.8% [1.4%, 6.3%] 30

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 19, 2023
@Voultapher
Copy link
Contributor Author

Interesting regression results. I could see it being a real slowdown for smallish inputs size < 15, and I have some ideas how to address that. At the same time I wonder how representative that metric is. This new implementation deliberately exploits instruction-level parallelism. So it certainly issues and retires more instructions, but is faster in terms of wall clock. Is there a way to have these measurements done with time instead of instruction count?

Tagging @rylev, you seem to know a lot about these performance regression benchmarks.

This is achieved by writing code that exploits instruction-parallel hardware:

- New small sort, that handles anything up to size 32
- Faster bi-directional merge function
- Faster run creation with the help of a sorting-netwok

Probing for a common value as done in ipn_stable is omitted and can be
added incrementally in the future. A detailed writeup can be found here
https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md.
Only difference being low-cardinality inputs, which will perform similar
to random inputs.
Avoid perf regressions compared to rust_std_stable by carfully putting
an inlined insertion sort in the hot path. Re-tune heuristics for when
stable sorting-networks are worth it. Use a different pattern analysis
algorithm, that will respect a pre-sorted streak in small-sort if it is
the largest part of the input.

Also improve edge run creation.
@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2023
@Voultapher
Copy link
Contributor Author

Voultapher commented Feb 22, 2023

It could be evidence that the compiler's own sorting performance is negatively impacted.

To get an idea how much time rustc spends in slice::sort I instrumented the standard library with custom C code:

#[link(name = "my_inject", kind = "static")]
extern "C" {
    fn inject_timer_start() -> u64;
    fn inject_timer_stop(len: usize, start_time: u64);
}

[...]

let start_time = unsafe { inject_timer_start() };
sort::merge_sort(v, &mut is_less, elem_alloc_fn, elem_dealloc_fn, run_alloc_fn, run_dealloc_fn);
unsafe {
    inject_timer_stop(v.len(), start_time);
}

And the C code:

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <pthread.h>

#include <x86intrin.h>

pthread_mutex_t LOCK = PTHREAD_MUTEX_INITIALIZER;
FILE* APPEND_LOG = NULL;

// rdtsc counts reference cycles, not CPU core clock cycles see
// https://stackoverflow.com/a/51907627.
inline __attribute__((always_inline)) uint64_t clock_start(void) {
  _mm_lfence();  // wait for earlier insns to retire before reading the clock
  uint64_t tsc = __rdtsc();
  return tsc;
}

inline __attribute__((always_inline)) uint64_t clock_end() {
  uint64_t tsc = __rdtsc();
  _mm_lfence();  // block later instructions until rdtsc retires
  return tsc;
}

__attribute__((visibility("default"))) uint64_t inject_timer_start() {
  return clock_start();
}

__attribute__((visibility("default"))) void inject_timer_stop(
    size_t len,
    uint64_t start_time) {
  const uint64_t end_time = clock_end();
  const uint64_t time_diff_cycles = end_time - start_time;

  pthread_mutex_lock(&LOCK);
  if (!APPEND_LOG) {
    APPEND_LOG = fopen("/tmp/inject_log.txt", "a+");
    if (!APPEND_LOG) {
      pthread_mutex_unlock(&LOCK);
      //   fprintf(stderr, "Failed to open file.\n");
      return;
    }
  }

  // fprintf(APPEND_LOG, "%zo %lo\n", len, time_diff_cycles); // WRONG
  fprintf(APPEND_LOG, "%zu %" PRIu64 "\n", len, time_diff_cycles);
  pthread_mutex_unlock(&LOCK);
}

This appends the slice length and time in references tsc cycles it took to a file. I looked into measuring real core-cycles with hardware performance counters and rdpmc, but that turned out to be rather tricky and I gave up. So I'm using rdtsc with fences which is still quite precise. The tests were done before the 'Make sorting slices of len < 2 a likely branch' commit.

I then ran cargo clean && rm -f /tmp/inject_log.txt && taskset -c 2 perf stat -d cargo build --release for html5ever which shows the largest cycle regression at 2.43% for opt full.

html5ever

master:
total: 134_244_308_745 | in sort: 98_255_486 | ratio: ~0.073%
total: 134_577_777_746 | in sort: 99_223_434 | ratio: ~0.074%
 
This PR:
total: 134_230_898_387 | in sort: 75_349_000 | ratio: ~0.056%
total: 134_410_537_807 | in sort: 75_731_973 | ratio: ~0.056%

master clean no inject:
total: 123,579,549,549
total: 123,247,438,086

The total cycles reported by perf and the tsc cycles measured with the injection code are not perfectly comparable, however tsc cycles are perfectly comparable with each other. I think this shows two things:

  • The new code seems faster in this scenario
  • Even if the perf had regressed by 2x, the time spent in slice::sort would still be an order of magnitude too small to be the main cause of the 2.43% regression found by the CI.

Another interesting insight, rustc does a total of 492_581 calls to slice::sort when building html5ever and its dependencies. 352_576 (~71.6%) of those calls are for slices of length 0. And 112_245 (~22.8%) for slices of length 1. In total this accounts for ~94.4% of all calls. Looking at anything v.len() <= 20 which is the limit for which both the old and new implementation do insertion sort, yields 491_265 (~99.7%) of calls.

EDIT: Numbers fixed, to avoid confusion.

@Voultapher
Copy link
Contributor Author

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 22, 2023
@cuviper
Copy link
Member

cuviper commented Feb 22, 2023

Thank you for the thorough measurement! (but unsatisfying that it leaves the cause of regression unknown...)

@Voultapher
Copy link
Contributor Author

So, I uhm did a pretty hilarious and stupid mistake 😝. While writing the code I looked at the wrong printf specifier column, and printed the numbers as octal. Luckily I can fix that in post-processing:

master:
total: 134_244_308_745 | in sort: 98_255_486 | ratio: ~0.073%
total: 134_577_777_746 | in sort: 99_223_434 | ratio: ~0.074%
 
This PR:
total: 134_230_898_387 | in sort: 75_349_000 | ratio: ~0.056%
total: 134_410_537_807 | in sort: 75_731_973 | ratio: ~0.056%

Everything else remains the same.

@Voultapher
Copy link
Contributor Author

Thank you for the thorough measurement! (but unsatisfying that it leaves the cause of regression unknown...)

I think the main reason is, it generates more code, and generating more code means more work for the compiler. This is kind of unavoidable with more sophisticated algorithms.

@thomcc
Copy link
Member

thomcc commented Feb 22, 2023

This is kind of unavoidable with more sophisticated algorithms.

It's tricky to say how much we want something like that though. Notably we get lots of requests for a faster compiler, and comparatively fewer requests for a faster sort implementation.

Still, I'll take another look later when I have time.

To further reduce binary size and limit a possible negative impact on
certain types the stable sorting-network was removed. Further to trim
code-gen even more only types that apply will now call
bi_directional_merge.
@Voultapher
Copy link
Contributor Author

Voultapher commented Feb 23, 2023

I ran some benchmarks in real world scenarios and got a little spooked by the stable sorting-network, it's great for integers but I suspect that limiting it to Copy is not enough and I saw some serious regressions. So to be on the safe side and to limit code-gen even further, I completely deleted it. While a lot less ambitious, the current version still yields:

hot-u64-random-10k:

with network and bi-dir merge: 110us
with bi-dir merge: 165us (current PR state)
master: 260us

hot-string-random-10k:

with bi-dir merge: 848us
master: 1214us

Would be good do another timing run.

Also arguably, for integers users should use unstable sort, which I plan on upstreaming changes to soon (tm).

@thomcc
Copy link
Member

thomcc commented Feb 26, 2023

Sorry for the delay, missed that comment. Thanks for the investigation.

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 26, 2023
@bors
Copy link
Collaborator

bors commented Feb 26, 2023

⌛ Trying commit 556ba1b with merge 430e29582121afa3a686a8e4cb4b84e300ff2925...

@rust-timer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Feb 26, 2023

☀️ Try build successful - checks-actions
Build commit: 430e29582121afa3a686a8e4cb4b84e300ff2925 (430e29582121afa3a686a8e4cb4b84e300ff2925)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (430e29582121afa3a686a8e4cb4b84e300ff2925): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.7% [0.2%, 2.4%] 38
Regressions ❌
(secondary)
0.8% [0.2%, 1.6%] 9
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.7% [0.2%, 2.4%] 38

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.0% [0.5%, 3.9%] 3
Regressions ❌
(secondary)
0.9% [0.9%, 0.9%] 1
Improvements ✅
(primary)
-4.5% [-4.7%, -4.2%] 2
Improvements ✅
(secondary)
-4.9% [-4.9%, -4.9%] 1
All ❌✅ (primary) -0.6% [-4.7%, 3.9%] 5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.6% [1.3%, 2.1%] 5
Regressions ❌
(secondary)
1.3% [1.1%, 1.6%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.6% [1.3%, 2.1%] 5

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 27, 2023
@Voultapher
Copy link
Contributor Author

Looking at cycles, it's now down to 5 crates that are affected and mean 1.6% and peak 2.1%. That seems acceptable to me.

@thomcc
Copy link
Member

thomcc commented Feb 27, 2023

Okay, I'll try to re-review soon. I might not be able to get to it until next week. Note that cycles is highly noisy so it's hard to draw too much from it, so perhaps someone from the perf group wants to weigh in on their thoughts.

CC @rust-lang/wg-compiler-performance

@nnethercote
Copy link
Contributor

This is one is certainly tricky, because the CI perf runs can be affected by both the compiler itself using the new sorting code, and the compiler compiling the new sorting code in the benchmarks. This is the kind of change where instruction counts is not such a good metric, and cycles and wall time are better. Those results don't look so bad.

I see https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md has detailed benchmarking. Does that cover the final version of the code with the sorting-network removed?

If it's not too painful, I'd be very interested to see (probably in a different PR) a CI perf run with the current sort algorithm replaced with something very simple and small, like a basic mergesort. That would give an interesting perspective on the various factors.

@Voultapher
Copy link
Contributor Author

Does that cover the final version of the code with the sorting-network removed?

It benchmarked a version with two additions to what is currently being reviewed, the sorting-network for Copy types and common value filtering. If you look at the string performance that is representative of the current performance for patterns without low-cardinality. Here is a simple benchmark for the current state #108005 (comment).

If it's not too painful, I'd be very interested to see (probably in a different PR) a CI perf run with the current sort algorithm replaced with something very simple and small

I already have something like that, but it goes in the other direction of adding something larger #108662 to get a reference for how bad that would be. This can then be used in a synthetic compile time benchmark. Here are some clean compile time numbers for instantiating sorts with 256 NewType wrappers 50% int 45% String and 5% Cell, adjusted for the baseline:

Name                      Debug     Release
------------------------|---------|---------
- heapsort:                 254ms | 1_447ms
- slice::sort:            2_341ms | 6_783ms
- slice::sort_unstable:   4_122ms | 17_368ms
- glidesort:             10_997ms | 48_827ms

The experiment PR is there to correlate this to overall regression.

@Voultapher
Copy link
Contributor Author

There is other ongoing work in this area, so I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants