Skip to content

Reduce builder size of jobs that take less than an hour #124985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 20, 2024

Conversation

dpaoliello
Copy link
Contributor

@dpaoliello dpaoliello commented May 10, 2024

The current longest build time is ~2hr for the dist-x86_64-linux-alt build. This is already on a 16-core builder, so we can't make it any faster (by throwing more hardware at it).

Given that overall build times will be at least 2hrs, we can reduce build costs by reducing the builder size for any job that takes less than 1hr since it will still complete before dist-x86_64-linux-alt does.

Note that scaling isn't linear, halving the core count increases end-to-end build times by about 25-50%. In this sample build arm-android went from ~52m to 1h 5m and dist-arm-linux went from ~55m to 1h 17m (then failed due to missing metrics).

Current job builder sizes and times and proposed new sizes:

Job Size Proposed Run 1 Run 2 Run 3 Run 4
aarch64-gnu - 1h 9m 1s 1h 8m 47s 1h 8m 45s 1h 9m 6s
arm-android 8c 4c 52m 32s 52m 38s 51m 30s 53m 13s
armhf-gnu 8c 4c 37m 30s 37m 40s 38m 41s 37m 56s
dist-aarch64-linux 8c 4c 57m 11s 56m 48s 55m 53s 56m 19s
dist-android 8c 4c 24m 37s 25m 13s 25m 15s 24m 17s
dist-arm-linux 16c 8c 53m 34s 55m 11s 56m 1s 54m 29s
dist-armhf-linux 8c 4c 42m 1s 43m 32s 43m 27s 41m 55s
dist-armv7-linux 8c 4c 44m 51s 44m 35s 43m 34s 46m 2s
dist-i586-gnu-i586-i686-musl 8c 4c 37m 59s 37m 56s 38m 4s 38m 24s
dist-i686-linux 8c 4c 52m 20s 51m 3s 52m 53s 50m 38s
dist-loongarch64-linux 8c 4c 40m 39s 40m 20s 41m 6s 40m 44s
dist-ohos 8c 4c 25m 5s 24m 34s 25m 18s 23m 40s
dist-powerpc-linux 8c 4c 42m 31s 43m 53s 42m 35s 41m 56s
dist-powerpc64-linux 8c 4c 42m 52s 44m 36s 45m 32s 43m 51s
dist-powerpc64le-linux 8c 4c 43m 41s 44m 11s 43m 2s 44m 21s
dist-riscv64-linux 8c 4c 41m 25s 42m 41s 41m 52s 43m 47s
dist-s390x-linux 8c 4c 46m 48s 47m 18s 47m 27s 46m 49s
dist-various-1 8c 4c 42m 14s 43m 20s 43m 20s 41m 41s
dist-various-2 8c 4c 36m 18s 38m 15s 37m 41s 39m 28s
dist-x86_64-freebsd 8c 4c 39m 21s 39m 40s 40m 1s 40m 2s
dist-x86_64-illumos 8c 4c 45m 35s 46m 43s 46m 2s 46m 4s
dist-x86_64-linux 16c 1h 53m 10s 1h 51m 15s 1h 52m 18s 1h 52m 26s
dist-x86_64-linux-alt 16c 2h 3m 33s 2h 3m 31s 2h 4m 12s 2h 2m 21s
dist-x86_64-musl 8c 1h 5m 42s 1h 6m 13s 1h 7m 49s 1h 6m 6s
dist-x86_64-netbsd 8c 4c 40m 4s 39m 48s 40m 16s 39m 43s
i686-gnu 8c 1h 13m 38s 1h 13m 39s 1h 13m 48s 1h 13m 12s
i686-gnu-nopt 8c 1h 17m 44s 1h 18m 14s 1h 19m 55s 1h 18m 44s
mingw-check 4c 28m 15s 27m 39s 28m 36s 28m 38s
test-various 8c 4c 37m 45s 37m 17s 38m 26s 38m 11s
x86_64-gnu 4c 1h 34m 1s 1h 31m 51s 1h 30m 35s 1h 32m 53s
x86_64-gnu-stable 4c 1h 28m 26s 1h 28m 11s 1h 29m 40s 1h 46m 28s
x86_64-gnu-aux 4c 1h 33m 32s 1h 31m 57s 1h 34m 8s 1h 32m 57s
x86_64-gnu-integration 8c 1h 22m 2s 1h 20m 14s 1h 19m 46s 1h 21m 24s
x86_64-gnu-debug 8c 4c 52m 41s 53m 40s 51m 51s 56m 9s
x86_64-gnu-distcheck 8c 1h 9m 14s 1h 5m 31s 1h 6m 29s 1h 5m 50s
x86_64-gnu-llvm-18 8c 1h 39m 47s 1h 37m 57s 1h 38m 40s 1h 37m 38s
x86_64-gnu-llvm-17 8c 1h 41m 50s 1h 45m 43s 1h 45m 4s 1h 43m 4s
x86_64-gnu-nopt 4c 1h 20m 42s 1h 21m 38s 1h 20m 4s 1h 22m 11s
x86_64-gnu-tools 8c 1h 5m 0s 1h 5m 30s 1h 3m 1s 1h 3m 20s
dist-x86_64-apple xl 1h 35m 1s 1h 39m 57s 2h 2m 31s 1h 47m 37s
dist-apple-various xl 1h 18m 54s 1h 22m 31s 1h 13m 19s 1h 38m 18s
x86_64-apple-1 xl 1h 32m 8s 1h 40m 12s 1h 51m 28s 1h 40m 26s
x86_64-apple-2 xl 1h 0m 32s 1h 4m 5s 1h 9m 0s 1h 7m 17s
dist-aarch64-apple m1 1h 3m 9s 1h 1m 14s 1h 2m 6s 1h 2m 24s
aarch64-apple m1 53m 38s 1h 1m 5s 1h 3m 15s 1h 6m 11s
x86_64-msvc 8c 1h 27m 48s 1h 29m 38s 1h 29m 55s 1h 28m 4s
i686-msvc 8c 1h 38m 28s 1h 34m 7s 1h 39m 19s 1h 39m 28s
x86_64-msvc-ext 8c 1h 44m 5s 1h 38m 40s 1h 45m 21s 1h 44m 19s
i686-mingw 8c 1h 49m 57s 1h 45m 1s 1h 52m 4s 1h 51m 4s
x86_64-mingw 8c 1h 44m 2s 1h 37m 36s 1h 49m 58s 1h 47m 5s
dist-x86_64-msvc 8c 1h 57m 14s 1h 49m 43s 1h 52m 53s 1h 52m 35s
dist-i686-msvc 8c 1h 8m 5s 1h 4m 9s 1h 9m 26s 1h 12m 0s
dist-aarch64-msvc 8c 1h 18m 40s 1h 14m 4s 1h 22m 1s 1h 19m 6s
dist-i686-mingw 8c 1h 15m 36s 1h 14m 36s 1h 16m 38s 1h 16m 2s
dist-x86_64-mingw 8c 1h 11m 54s 1h 16m 12s 1h 16m 54s 1h 18m 2s
dist-x86_64-msvc-alt 8c 1h 11m 17s 1h 10m 0s 1h 11m 8s 1h 13m 14s

@rustbot
Copy link
Collaborator

rustbot commented May 10, 2024

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added A-testsuite Area: The testsuite used to check the correctness of rustc S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels May 10, 2024
@rust-log-analyzer

This comment has been minimized.

@dpaoliello dpaoliello marked this pull request as ready for review May 10, 2024 21:19
@matthiaskrgr
Copy link
Member

Hmm I would be careful.
Lets say we have a 8c runner that fails after 30 minutes now (regular ci failure) and all jobs gets cancelled.

If we make it slower/reduce to 4 cores, we get to know about the same test failure 60 or 50 minutes after start instead of 30, which means CI runs now $num_jobs x 30 minutes longer in total, which may be more expensive?

@dpaoliello
Copy link
Contributor Author

Hmm I would be careful. Lets say we have a 8c runner that fails after 30 minutes now (regular ci failure) and all jobs gets cancelled.

If we make it slower/reduce to 4 cores, we get to know about the same test failure 60 or 50 minutes after start instead of 30, which means CI runs now $num_jobs x 30 minutes longer in total, which may be more expensive?

That's a reasonable concern.

One thing to note is that halving the core count doesn't double the time that it takes to run the build - I was seeing regressions closer to 25-50% (although if you want to do a try then we can get better numbers).

But, yes, it's reasonable to expect that if there were a failure in one of these jobs we may see it 5-20mins later than previously.

Looking at the rust-lang-ci queue for the auto and try branches, there are currently 6,071 successful runs and 2,268 failures, so 8:3 success to failure ratio. Looking at the most recent 25 failures, 8 failed in <10min and 14 in <30min, so failures usually happen early. So, the question is if the 11 late failures (assuming they were all for these particular jobs) would have eaten away at the saving from the ~29 successful builds (based on the success:failure ratio).

@Mark-Simulacrum
Copy link
Member

@bors r+

This seems broadly reasonable, and we can always revert if it gives us significant problems.

@bors
Copy link
Collaborator

bors commented May 19, 2024

📌 Commit 1b5d91b has been approved by Mark-Simulacrum

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 19, 2024
@bors
Copy link
Collaborator

bors commented May 20, 2024

⌛ Testing commit 1b5d91b with merge 44d679b...

@bors
Copy link
Collaborator

bors commented May 20, 2024

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing 44d679b to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label May 20, 2024
@bors bors merged commit 44d679b into rust-lang:master May 20, 2024
7 checks passed
@rustbot rustbot added this to the 1.80.0 milestone May 20, 2024
@dpaoliello dpaoliello deleted the rebalance branch May 20, 2024 14:47
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (44d679b): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results (primary -4.4%, secondary -2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.4% [-4.4%, -4.4%] 1
Improvements ✅
(secondary)
-2.1% [-2.1%, -2.1%] 1
All ❌✅ (primary) -4.4% [-4.4%, -4.4%] 1

Cycles

Results (secondary -3.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-3.5% [-5.1%, -1.9%] 2
All ❌✅ (primary) - - 0

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 670.321s -> 671.145s (0.12%)
Artifact size: 316.18 MiB -> 316.05 MiB (-0.04%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testsuite Area: The testsuite used to check the correctness of rustc merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants