Skip to content

Set up benchmarks server #55007

Open
Open
@datapythonista

Description

@datapythonista

This issue is to keep track of the benchmarks project we are about to start. Adding context and some initial ideas, but we'll keep evolving the project as we make progress.

The final goal of our benchmarks system is to detect when some functionality in pandas starts taking longer for no good reason. See this fictional example:

>>> import pandas
>>> pandas.__version__
2.0
>>> data = pandas.Series([1, 2, 3])
>>> %time data.sum()
1 ms

>>> import pandas
>>> pandas.__version__
2.1
>>> data = pandas.Series([1, 2, 3])
>>> %time data.sum()
5 ms

In the example, Series.sum for a given data takes 1 millisecond in 2.0, while in 2.1 it takes 5 times that time, probably because an unexpected side effect, not because we made some changes we consider it was worth making the function 5 times slower. When we introduce a performance regression like the one above, it is likely that users will end up reporting it via a GitHub issue. But ideally, we would like to detect it much earlier, and never release versions of pandas with those performance regressions.

For this reason, we implemented many functions that run pandas functionality with some arbitrary but constant data. Our benchmark suite, the code is in asv_bench/benchmarks. Those are implemented using a framework names asv. Running the benchmark suite gives us the time it takes to run each of many pandas functions with data consistent over runs. If comparing the executions with two different pandas versions, we can detect performance regressions.

In an ideal world, what we'd like is to detect performance regressions in the CI of our PRs. So, before merging a PR we can see there is a performance regression, and we make a decision on whether the change is worth making pandas slow. This would be equivalent to what we do with tests or with linting, where we get a red CI when something is not right, and merging doesn't happen until we're happy with the PR.

In practice, running consistent benchmarks for every commit in a pull request is not feasible at this point. The main reasons are:

  • Our benchmarks take around 7 hours in decent hardware
  • The CI doesn't use decent hardware, it uses small workers, where it would take way longer
  • To compare benchmarks we need consistent hardware. We could run in the same worker the benchmarks for main and the PR of interest, that would double the execution time
  • Virtual hardware, like a GitHub CI worker introduces a lot of noise. The time to execute a function will depend as much as how busy is the rest of the physical host than the time our implementation takes

For now, we are giving up on executing the benchmarks for every commit in an open PR, and the focus has been on executing them after merging a PR. When running the benchmarks, we run them in a physical server, not a virtual server. The server we've been using was bought by Wes many years ago and was running 24/7 from his home, and at some point it was moved to Tom's home. This is still how we run the benchmarks now. Some months ago, OVH Cloud donated to pandas credits to use a dedicated server from their cloud for our benchmarks (we also got credits to host the website/docs and for more things we may need). There was some work on setting up the server and improving things, but we didn't complete even the initial work.

There are three main challenges to what otherwise would be a somehow simple project:

  • Having stable benchmarks is very hard
  • Our benchmarks suite takes a very long time
  • ASV is not as good as we would like. The codebase was unmaintained for several years, and while we put a lot of work more recently, the codebase is complex and not easy to deal with, and the UX not always intuitive.

The most common approach for benchmark stability is what it's usually called statistical benchmarking. In its simplest form, the idea is for a single benchmark run, running the function to time something like 100 times and taking the mean. ASV does a slightly smarter version of it, where the first few runs (warm up) are discarded, and the exact number of times a benchmark runs depends on the variance of the first runs. But 100 repetitions is common.

This repetition brings more stability, but obviously makes the second challenge worse, as timing every function 100 times makes benchmarks 100 times slower. We have a CI job where we run our benchmark suite with just one call for each function, and the timing is very reasonable, the job takes 25 minutes in the CI worker. But the results are discarded, since they are very unstable both because the lack of repetition, and the instability of using a virtual machine / CI worker.

For now, what we want is:

  • Configure our dedicated server to be as stable as possible. An amazing resource to better understand the problem and possible solutions is Victor Stinner's blog series. Also this page: https://pyperf.readthedocs.io/en/latest/system.html
  • Find a way to run benchmarks for each commit to main. The existing server uses the code in https://github.com/asv-runner. There are many options here, but anything running in the server affects its stability. We could implement a web server that receives webhooks from github for each commit, but we can only run the benchmarks once at a time, and I'm not sure if the activity of the web server would be significant or not for the benchmark results
  • Find a way to run the benchmarks faster than adding them to the queue. If the benchmarks take 6 hours, we can run 4 times the suite per day, it's likely that we merge more than 4 PRs per day, so the queue would keep growing and over time our last executed benchmark would be very far from the recent commits. If we can run the benchmarks for each commit that would be amazing, it may require reducing the amount of data used to test the slower benchmarks, cap the number of repetitions, or other solutions. If we need to skip commits, we may want to do that in a known and consistent way.
  • Make sure that results are meaningful and useful. The results are currently available here: https://asv-runner.github.io/asv-collection/pandas/. ASV is able to run the benchmark suite with different environments. This means that for a specific pandas commit it can run for a matrix of Python version, Cython versions, NumPy versions... I'm not sure if there is a clear decision on how we'd like to benchmarks things regarding the environments, but I personally find it quite hard to understand the results easily. See for example: https://asv-runner.github.io/asv-collection/pandas/#algos.isin.IsIn.time_isin_mismatched_dtype. Even when filtering all but one result, results aren't obvious. There are like 50 results with their commit hash, but not easy to tell the dates of these measurements. Is ASV just keeping the last 50 measurements? Are there commits that didn't run? I guess most of the problem is in ASV itself, but since other projects are using it, I guess we can set it up in a way that it's more useful.

Once we have a reasonable version of the above, some other things can be considered. Some ideas:

  • Implement more complex benchmarks (benchmark whole pipelines, not only individual functions). TPCH for example, as Polars do
  • Consider ASV alternatives. I did some research on this, and there is no obvious too that we should be using instead of ASV, but ASV will be a blocker for any innovation
  • Anything that can be helpful towards the final goal of early detection of performance regressions.

CC: @DeaMariaLeon @lithomas1 @rhshadrach

Metadata

Metadata

Assignees

Labels

BenchmarkPerformance (ASV) benchmarks

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions