Open
Description
It would be valuable to have a benchmark suite for Pandas-like projects. This would help users reasonably compare the performance tradeoffs of different implementations and help developers identify possible performance issues.
There are, I think, a few axes that such a benchmark suite might engage:
- Operation type: filters, aggregations, random access, groupby-aggregate, set-index, merge, time series stuff, assignment, uniqueness, ...
- Datatype: grouping on ints, floats, strings, categoricals, etc.
- Cardinality: Lots of distinct floats, just a few common strings
- Data Size: How well do projects scale up? How well do they scale down?
- Cluster size: for those projects for which this is appropriate
- (probably lots of other things I'm missing)
Additionally, there are a few projects that I think might benefit from such an endeavor
- Pandas itself
- Newer pandas developments (whatever gets built on top of arrow memory), which may have enough API compatibility to take advantage of this?
- Pandas on Ray (see this nice blogpost: https://rise.cs.berkeley.edu/blog/pandas-on-ray/)
- Dask.dataframe
- Spark dataframes? If we can build in API tweaking (which I suspect will be necessary).
Some operational questions:
- How does one socially organize such a collection of benchmarks in a sensible way? My guess is that no one individual is likely to have time to put this together (though I would love to be proved wrong here). The objectives here are somewhat different from what currently lives in
asv_bench/benchmarks
. - How does one consistently execute such a benchmark? I was looking at http://pytest-benchmark.readthedocs.io/en/latest
- What challenges are we likely to observe due to the differences in each project? How do we reasonably work around them?
- How do we avoid developer bias when forming benchmarks?
- Does anyone have enthusiasm about working on this?
Anyway, those are some thoughts. Please let me know if this is out of scope for this issue tracker.