Skip to content

Implement high performance rolling_rank #9481

Closed
@PH82

Description

@PH82

xref SO issue here

Im looking to set the rolling rank on a dataframe. Having posted, discussed and analysed the code it looks like the suggested way would be to use the pandas Series.rank function as an argument in rolling_apply. However on large datasets the performance is particularly poor. I have tried different implementations and using bottlenecks rank method orders of magnitude faster, but that only offers the average option for ties. It is also still some way off the performance of rolling_mean. I have previously implemented a rolling rank function which monitors changes on a moving window (in a similar way to algos.roll_mean I believe) rather that recalculating the rank from scratch on each window. Below is an example to highlight the performance, it should be possible to implement a rolling rank with comparable performance to rolling_mean.

python: 2.7.3
pandas: 0.15.2
scipy: 0.10.1
bottleneck: 0.7.0

rollWindow = 240
df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))
df.iloc[-3:-1]['A'] = 7.5
df.iloc[-1]['A'] = 5.5

df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries)
 #28.9secs (allows competition/min ranking for ties)

df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy)
 #70.89secs (allows competition/min ranking for ties)

df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck)
 #3.64secs (only provides average ranking for ties)

df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort)
 #3.56secs (only provides competition/min ranking for ties, not necessarily correct result)

df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow)
 #0.008secs

def rollingRankOnSeries (array):
    s = pd.Series(array)
    return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy (array):
     return array.size + 1 - sc.rankdata(array)[-1]

def rollingRankBottleneck (array):
    return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort (array):
    return array.size - array.argsort().argsort()[-1]
```python

I think this is likely to be a common request for users looking to use pandas for analysis on large datasets and thought it would be a useful addition to the pandas moving statistics/moments suite?

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceWindowrolling, ewma, expanding

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions