Improve performance of `slice::sort` with ipn_stable #108005

Voultapher · 2023-02-13T18:15:55Z

This is achieved by writing code that exploits instruction-parallel hardware:

New small sort, that handles anything up to size 32
Faster bi-directional merge function
Faster run creation with the help of a sorting-netwok

Probing for a common value as done in ipn_stable is omitted and can be added incrementally in the future. A detailed writeup can be found here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md. Only difference being low-cardinality inputs, which will perform similar to random inputs.

rustbot · 2023-02-13T18:16:02Z

r? @cuviper

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2023-02-13T18:16:06Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

Voultapher · 2023-02-13T18:16:17Z

r? @thomcc

thomcc · 2023-02-14T06:32:26Z

Fix CI failures?

@rustbot author

Voultapher · 2023-02-14T11:21:21Z

@rustbot ready

Voultapher · 2023-02-14T11:44:22Z

I developed a pretty thorough test suite https://github.com/Voultapher/sort-research-rs/blob/main/tests/main.rs, that checks things like, will the original elements still be in the input if any of the possible comparison calls panic, what if the type has self-modified on the stack via interior mutability and the comparison function panics, will all modifications be visible. Various ways in which Ord can be violated and more. All that with various input sizes and patterns. I run this test suite normally, with ASAN or with miri. I have a libfuzzer and a afl harnesses. This version of the code passes all those tests.

thomcc · 2023-02-19T01:41:18Z

This looks pretty good to me. It's certainly complex, but is very well commented.

@bors try @rust-timer queue

bors · 2023-02-19T01:41:28Z

⌛ Trying commit c7403f2722fdd76f3932c2736b8beefc5f838767 with merge 2c260bb5eff802611a4d5640969e76a92946c62d...

bors · 2023-02-19T04:27:01Z

☀️ Try build successful - checks-actions
Build commit: 2c260bb5eff802611a4d5640969e76a92946c62d (2c260bb5eff802611a4d5640969e76a92946c62d)

rust-timer · 2023-02-19T05:32:47Z

Finished benchmarking commit (2c260bb5eff802611a4d5640969e76a92946c62d): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.3%	[0.4%, 8.2%]	53
Regressions ❌ (secondary)	1.9%	[0.4%, 9.4%]	47
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.3%	[0.4%, 8.2%]	53

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.9%	[0.8%, 6.5%]	7
Regressions ❌ (secondary)	2.3%	[1.3%, 3.3%]	2
Improvements ✅ (primary)	-5.7%	[-5.7%, -5.7%]	1
Improvements ✅ (secondary)	-3.6%	[-3.6%, -3.6%]	1
All ❌✅ (primary)	2.7%	[-5.7%, 6.5%]	8

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.8%	[1.4%, 6.3%]	30
Regressions ❌ (secondary)	2.0%	[1.1%, 6.7%]	22
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.8%	[1.4%, 6.3%]	30

Voultapher · 2023-02-19T09:30:08Z

Interesting regression results. I could see it being a real slowdown for smallish inputs size < 15, and I have some ideas how to address that. At the same time I wonder how representative that metric is. This new implementation deliberately exploits instruction-level parallelism. So it certainly issues and retires more instructions, but is faster in terms of wall clock. Is there a way to have these measurements done with time instead of instruction count?

Tagging @rylev, you seem to know a lot about these performance regression benchmarks.

This is achieved by writing code that exploits instruction-parallel hardware: - New small sort, that handles anything up to size 32 - Faster bi-directional merge function - Faster run creation with the help of a sorting-netwok Probing for a common value as done in ipn_stable is omitted and can be added incrementally in the future. A detailed writeup can be found here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md. Only difference being low-cardinality inputs, which will perform similar to random inputs.

Avoid perf regressions compared to rust_std_stable by carfully putting an inlined insertion sort in the hot path. Re-tune heuristics for when stable sorting-networks are worth it. Use a different pattern analysis algorithm, that will respect a pre-sorted streak in small-sort if it is the largest part of the input. Also improve edge run creation.

Voultapher · 2023-02-22T16:28:16Z

It could be evidence that the compiler's own sorting performance is negatively impacted.

To get an idea how much time rustc spends in slice::sort I instrumented the standard library with custom C code:

#[link(name = "my_inject", kind = "static")]
extern "C" {
    fn inject_timer_start() -> u64;
    fn inject_timer_stop(len: usize, start_time: u64);
}

[...]

let start_time = unsafe { inject_timer_start() };
sort::merge_sort(v, &mut is_less, elem_alloc_fn, elem_dealloc_fn, run_alloc_fn, run_dealloc_fn);
unsafe {
    inject_timer_stop(v.len(), start_time);
}

And the C code:

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <pthread.h>

#include <x86intrin.h>

pthread_mutex_t LOCK = PTHREAD_MUTEX_INITIALIZER;
FILE* APPEND_LOG = NULL;

// rdtsc counts reference cycles, not CPU core clock cycles see
// https://stackoverflow.com/a/51907627.
inline __attribute__((always_inline)) uint64_t clock_start(void) {
  _mm_lfence();  // wait for earlier insns to retire before reading the clock
  uint64_t tsc = __rdtsc();
  return tsc;
}

inline __attribute__((always_inline)) uint64_t clock_end() {
  uint64_t tsc = __rdtsc();
  _mm_lfence();  // block later instructions until rdtsc retires
  return tsc;
}

__attribute__((visibility("default"))) uint64_t inject_timer_start() {
  return clock_start();
}

__attribute__((visibility("default"))) void inject_timer_stop(
    size_t len,
    uint64_t start_time) {
  const uint64_t end_time = clock_end();
  const uint64_t time_diff_cycles = end_time - start_time;

  pthread_mutex_lock(&LOCK);
  if (!APPEND_LOG) {
    APPEND_LOG = fopen("/tmp/inject_log.txt", "a+");
    if (!APPEND_LOG) {
      pthread_mutex_unlock(&LOCK);
      //   fprintf(stderr, "Failed to open file.\n");
      return;
    }
  }

  // fprintf(APPEND_LOG, "%zo %lo\n", len, time_diff_cycles); // WRONG
  fprintf(APPEND_LOG, "%zu %" PRIu64 "\n", len, time_diff_cycles);
  pthread_mutex_unlock(&LOCK);
}

This appends the slice length and time in references tsc cycles it took to a file. I looked into measuring real core-cycles with hardware performance counters and rdpmc, but that turned out to be rather tricky and I gave up. So I'm using rdtsc with fences which is still quite precise. The tests were done before the 'Make sorting slices of len < 2 a likely branch' commit.

I then ran cargo clean && rm -f /tmp/inject_log.txt && taskset -c 2 perf stat -d cargo build --release for html5ever which shows the largest cycle regression at 2.43% for opt full.

html5ever

master:
total: 134_244_308_745 | in sort: 98_255_486 | ratio: ~0.073%
total: 134_577_777_746 | in sort: 99_223_434 | ratio: ~0.074%
 
This PR:
total: 134_230_898_387 | in sort: 75_349_000 | ratio: ~0.056%
total: 134_410_537_807 | in sort: 75_731_973 | ratio: ~0.056%

master clean no inject:
total: 123,579,549,549
total: 123,247,438,086

The total cycles reported by perf and the tsc cycles measured with the injection code are not perfectly comparable, however tsc cycles are perfectly comparable with each other. I think this shows two things:

The new code seems faster in this scenario
Even if the perf had regressed by 2x, the time spent in slice::sort would still be an order of magnitude too small to be the main cause of the 2.43% regression found by the CI.

Another interesting insight, rustc does a total of 492_581 calls to slice::sort when building html5ever and its dependencies. 352_576 (~71.6%) of those calls are for slices of length 0. And 112_245 (~22.8%) for slices of length 1. In total this accounts for ~94.4% of all calls. Looking at anything v.len() <= 20 which is the limit for which both the old and new implementation do insertion sort, yields 491_265 (~99.7%) of calls.

EDIT: Numbers fixed, to avoid confusion.

Voultapher · 2023-02-22T16:28:25Z

@rustbot ready

cuviper · 2023-02-22T17:32:58Z

Thank you for the thorough measurement! (but unsatisfying that it leaves the cause of regression unknown...)

Voultapher · 2023-02-22T19:33:05Z

So, I uhm did a pretty hilarious and stupid mistake 😝. While writing the code I looked at the wrong printf specifier column, and printed the numbers as octal. Luckily I can fix that in post-processing:

master:
total: 134_244_308_745 | in sort: 98_255_486 | ratio: ~0.073%
total: 134_577_777_746 | in sort: 99_223_434 | ratio: ~0.074%
 
This PR:
total: 134_230_898_387 | in sort: 75_349_000 | ratio: ~0.056%
total: 134_410_537_807 | in sort: 75_731_973 | ratio: ~0.056%

Everything else remains the same.

Voultapher · 2023-02-22T19:34:09Z

Thank you for the thorough measurement! (but unsatisfying that it leaves the cause of regression unknown...)

I think the main reason is, it generates more code, and generating more code means more work for the compiler. This is kind of unavoidable with more sophisticated algorithms.

thomcc · 2023-02-22T22:17:44Z

This is kind of unavoidable with more sophisticated algorithms.

It's tricky to say how much we want something like that though. Notably we get lots of requests for a faster compiler, and comparatively fewer requests for a faster sort implementation.

Still, I'll take another look later when I have time.

To further reduce binary size and limit a possible negative impact on certain types the stable sorting-network was removed. Further to trim code-gen even more only types that apply will now call bi_directional_merge.

Voultapher · 2023-02-23T09:39:44Z

I ran some benchmarks in real world scenarios and got a little spooked by the stable sorting-network, it's great for integers but I suspect that limiting it to Copy is not enough and I saw some serious regressions. So to be on the safe side and to limit code-gen even further, I completely deleted it. While a lot less ambitious, the current version still yields:

hot-u64-random-10k:

with network and bi-dir merge: 110us
with bi-dir merge: 165us (current PR state)
master: 260us

hot-string-random-10k:

with bi-dir merge: 848us
master: 1214us

Would be good do another timing run.

Also arguably, for integers users should use unstable sort, which I plan on upstreaming changes to soon (tm).

thomcc · 2023-02-26T19:40:33Z

Sorry for the delay, missed that comment. Thanks for the investigation.

@bors try @rust-timer queue

bors · 2023-02-26T19:41:25Z

⌛ Trying commit 556ba1b with merge 430e29582121afa3a686a8e4cb4b84e300ff2925...

bors · 2023-02-26T22:21:42Z

☀️ Try build successful - checks-actions
Build commit: 430e29582121afa3a686a8e4cb4b84e300ff2925 (430e29582121afa3a686a8e4cb4b84e300ff2925)

rust-timer · 2023-02-27T00:29:07Z

Finished benchmarking commit (430e29582121afa3a686a8e4cb4b84e300ff2925): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.7%	[0.2%, 2.4%]	38
Regressions ❌ (secondary)	0.8%	[0.2%, 1.6%]	9
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.7%	[0.2%, 2.4%]	38

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.0%	[0.5%, 3.9%]	3
Regressions ❌ (secondary)	0.9%	[0.9%, 0.9%]	1
Improvements ✅ (primary)	-4.5%	[-4.7%, -4.2%]	2
Improvements ✅ (secondary)	-4.9%	[-4.9%, -4.9%]	1
All ❌✅ (primary)	-0.6%	[-4.7%, 3.9%]	5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.6%	[1.3%, 2.1%]	5
Regressions ❌ (secondary)	1.3%	[1.1%, 1.6%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.6%	[1.3%, 2.1%]	5

Voultapher · 2023-02-27T08:12:24Z

Looking at cycles, it's now down to 5 crates that are affected and mean 1.6% and peak 2.1%. That seems acceptable to me.

thomcc · 2023-02-27T08:13:27Z

Okay, I'll try to re-review soon. I might not be able to get to it until next week. Note that cycles is highly noisy so it's hard to draw too much from it, so perhaps someone from the perf group wants to weigh in on their thoughts.

CC @rust-lang/wg-compiler-performance

nnethercote · 2023-03-02T21:29:35Z

This is one is certainly tricky, because the CI perf runs can be affected by both the compiler itself using the new sorting code, and the compiler compiling the new sorting code in the benchmarks. This is the kind of change where instruction counts is not such a good metric, and cycles and wall time are better. Those results don't look so bad.

I see https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md has detailed benchmarking. Does that cover the final version of the code with the sorting-network removed?

If it's not too painful, I'd be very interested to see (probably in a different PR) a CI perf run with the current sort algorithm replaced with something very simple and small, like a basic mergesort. That would give an interesting perspective on the various factors.

Voultapher · 2023-03-03T06:59:26Z

Does that cover the final version of the code with the sorting-network removed?

It benchmarked a version with two additions to what is currently being reviewed, the sorting-network for Copy types and common value filtering. If you look at the string performance that is representative of the current performance for patterns without low-cardinality. Here is a simple benchmark for the current state #108005 (comment).

If it's not too painful, I'd be very interested to see (probably in a different PR) a CI perf run with the current sort algorithm replaced with something very simple and small

I already have something like that, but it goes in the other direction of adding something larger #108662 to get a reference for how bad that would be. This can then be used in a synthetic compile time benchmark. Here are some clean compile time numbers for instantiating sorts with 256 NewType wrappers 50% int 45% String and 5% Cell, adjusted for the baseline:

Name                      Debug     Release
------------------------|---------|---------
- heapsort:                 254ms | 1_447ms
- slice::sort:            2_341ms | 6_783ms
- slice::sort_unstable:   4_122ms | 17_368ms
- glidesort:             10_997ms | 48_827ms

The experiment PR is there to correlate this to overall regression.

Voultapher · 2023-03-23T12:08:01Z

There is other ongoing work in this area, so I'm closing this.

rustbot assigned cuviper Feb 13, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 13, 2023

rustbot assigned thomcc and unassigned cuviper Feb 13, 2023

This comment has been minimized.

Sign in to view

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 14, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 14, 2023

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 19, 2023

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 19, 2023

Voultapher added 3 commits February 19, 2023 22:40

Re-instate merge safety comments

7c7238b

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2023

Make sorting slices of len < 2 a likely branch

e314645

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 22, 2023

Remove stable sorting-network and specialize on Freeze

556ba1b

To further reduce binary size and limit a possible negative impact on certain types the stable sorting-network was removed. Further to trim code-gen even more only types that apply will now call bi_directional_merge.

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 26, 2023

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 27, 2023

Voultapher closed this Mar 23, 2023

Voultapher mentioned this pull request Apr 18, 2024

Replace sort implementations #124032

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `slice::sort` with ipn_stable #108005

Improve performance of `slice::sort` with ipn_stable #108005

Voultapher commented Feb 13, 2023

rustbot commented Feb 13, 2023

rustbot commented Feb 13, 2023

Voultapher commented Feb 13, 2023

This comment has been minimized.

thomcc commented Feb 14, 2023

Voultapher commented Feb 14, 2023

Voultapher commented Feb 14, 2023

thomcc commented Feb 19, 2023

This comment has been minimized.

bors commented Feb 19, 2023

bors commented Feb 19, 2023

This comment has been minimized.

rust-timer commented Feb 19, 2023

Voultapher commented Feb 19, 2023

Voultapher commented Feb 22, 2023 •

edited

Loading

Voultapher commented Feb 22, 2023

cuviper commented Feb 22, 2023

Voultapher commented Feb 22, 2023

Voultapher commented Feb 22, 2023

thomcc commented Feb 22, 2023

Voultapher commented Feb 23, 2023 •

edited

Loading

thomcc commented Feb 26, 2023 •

edited

Loading

This comment has been minimized.

bors commented Feb 26, 2023

This comment has been minimized.

bors commented Feb 26, 2023

This comment has been minimized.

rust-timer commented Feb 27, 2023

Voultapher commented Feb 27, 2023

thomcc commented Feb 27, 2023 •

edited

Loading

nnethercote commented Mar 2, 2023

Voultapher commented Mar 3, 2023

Voultapher commented Mar 23, 2023

Improve performance of slice::sort with ipn_stable #108005

Improve performance of slice::sort with ipn_stable #108005

Conversation

Voultapher commented Feb 13, 2023

rustbot commented Feb 13, 2023

rustbot commented Feb 13, 2023

Voultapher commented Feb 13, 2023

This comment has been minimized.

thomcc commented Feb 14, 2023

Voultapher commented Feb 14, 2023

Voultapher commented Feb 14, 2023

thomcc commented Feb 19, 2023

This comment has been minimized.

bors commented Feb 19, 2023

bors commented Feb 19, 2023

This comment has been minimized.

rust-timer commented Feb 19, 2023

Overall result: ❌ regressions - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Voultapher commented Feb 19, 2023

Voultapher commented Feb 22, 2023 • edited Loading

html5ever

Voultapher commented Feb 22, 2023

cuviper commented Feb 22, 2023

Voultapher commented Feb 22, 2023

Voultapher commented Feb 22, 2023

thomcc commented Feb 22, 2023

Voultapher commented Feb 23, 2023 • edited Loading

thomcc commented Feb 26, 2023 • edited Loading

This comment has been minimized.

bors commented Feb 26, 2023

This comment has been minimized.

bors commented Feb 26, 2023

This comment has been minimized.

rust-timer commented Feb 27, 2023

Overall result: ❌ regressions - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Voultapher commented Feb 27, 2023

thomcc commented Feb 27, 2023 • edited Loading

nnethercote commented Mar 2, 2023

Voultapher commented Mar 3, 2023

Voultapher commented Mar 23, 2023

Improve performance of `slice::sort` with ipn_stable #108005

Improve performance of `slice::sort` with ipn_stable #108005

Voultapher commented Feb 22, 2023 •

edited

Loading

Voultapher commented Feb 23, 2023 •

edited

Loading

thomcc commented Feb 26, 2023 •

edited

Loading

thomcc commented Feb 27, 2023 •

edited

Loading