Skip to content

Shared Benchmark suite for Pandas-like projects #19988

Open
@mrocklin

Description

@mrocklin

It would be valuable to have a benchmark suite for Pandas-like projects. This would help users reasonably compare the performance tradeoffs of different implementations and help developers identify possible performance issues.

There are, I think, a few axes that such a benchmark suite might engage:

  1. Operation type: filters, aggregations, random access, groupby-aggregate, set-index, merge, time series stuff, assignment, uniqueness, ...
  2. Datatype: grouping on ints, floats, strings, categoricals, etc.
  3. Cardinality: Lots of distinct floats, just a few common strings
  4. Data Size: How well do projects scale up? How well do they scale down?
  5. Cluster size: for those projects for which this is appropriate
  6. (probably lots of other things I'm missing)

Additionally, there are a few projects that I think might benefit from such an endeavor

  1. Pandas itself
  2. Newer pandas developments (whatever gets built on top of arrow memory), which may have enough API compatibility to take advantage of this?
  3. Pandas on Ray (see this nice blogpost: https://rise.cs.berkeley.edu/blog/pandas-on-ray/)
  4. Dask.dataframe
  5. Spark dataframes? If we can build in API tweaking (which I suspect will be necessary).

Some operational questions:

  1. How does one socially organize such a collection of benchmarks in a sensible way? My guess is that no one individual is likely to have time to put this together (though I would love to be proved wrong here). The objectives here are somewhat different from what currently lives in asv_bench/benchmarks.
  2. How does one consistently execute such a benchmark? I was looking at http://pytest-benchmark.readthedocs.io/en/latest
  3. What challenges are we likely to observe due to the differences in each project? How do we reasonably work around them?
  4. How do we avoid developer bias when forming benchmarks?
  5. Does anyone have enthusiasm about working on this?

Anyway, those are some thoughts. Please let me know if this is out of scope for this issue tracker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BenchmarkPerformance (ASV) benchmarksPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions