Shared Benchmark suite for Pandas-like projects

It would be valuable to have a benchmark suite for Pandas-like projects.  This would help users reasonably compare the performance tradeoffs of different implementations and help developers identify possible performance issues.

There are, I think, a few axes that such a benchmark suite might engage:

1.  Operation type: filters, aggregations, random access, groupby-aggregate, set-index, merge, time series stuff, assignment, uniqueness, ...
2.  Datatype: grouping on ints, floats, strings, categoricals, etc.
3.  Cardinality: Lots of distinct floats, just a few common strings
4.  Data Size: How well do projects scale up?  How well do they scale down?
5.  Cluster size: for those projects for which this is appropriate
6.   (probably lots of other things I'm missing)

Additionally, there are a few projects that I think might benefit from such an endeavor

1.  Pandas itself
2.  Newer pandas developments (whatever gets built on top of arrow memory), which may have enough API compatibility to take advantage of this?
3.  Pandas on Ray (see this nice blogpost: https://rise.cs.berkeley.edu/blog/pandas-on-ray/)
4.  Dask.dataframe
5.  Spark dataframes?  If we can build in API tweaking (which I suspect will be necessary).

Some operational questions:

1.  How does one socially organize such a collection of benchmarks in a sensible way?  My guess is that no one individual is likely to have time to put this together (though I would *love* to be proved wrong here).  The objectives here are somewhat different from what currently lives in `asv_bench/benchmarks`.  
2.  How does one consistently execute such a benchmark?  I was looking at http://pytest-benchmark.readthedocs.io/en/latest 
3.  What challenges are we likely to observe due to the differences in each project?  How do we reasonably work around them?
4.  How do we avoid [developer bias](http://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks) when forming benchmarks?
5.  Does anyone have enthusiasm about working on this?

Anyway, those are some thoughts.  Please let me know if this is out of scope for this issue tracker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared Benchmark suite for Pandas-like projects #19988

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shared Benchmark suite for Pandas-like projects #19988

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions