Skip to content

API? how to do a "rolling groupby" or groupby with overlapping groups? #26959

Open
@jorisvandenbossche

Description

@jorisvandenbossche

Sparked by @betatim's question on twitter: https://twitter.com/betatim/status/1141321049918906368

Suppose you have the following data:

repo_id = np.random.choice(np.arange(1000), 10000)
index = pd.Timestamp("2019-01-01") + pd.to_timedelta(np.random.randint(0, 24*30*6, size=10000), unit='H')
df = pd.DataFrame({'repo_id': repo_id}, index=index).sort_index()
In [111]: df.head() 
Out[111]: 
                     repo_id
2019-01-01 01:00:00      162
2019-01-01 01:00:00      850
2019-01-01 01:00:00      414
2019-01-01 02:00:00      753
2019-01-01 02:00:00      125

Data at somewhat randomly points in time (here hourly, but precision doesn't matter much).

Assume now you want to calculate a rolling 30D (monthly) statistic for each day.
For example the rolling number of unique values over a period of time (the elegant but somewhat verbose python code to do this: https://gist.github.com/betatim/c59039682d92fab89859358e8c585313)

A rolling operation does not exactly give what you want, because df.rolling() will calculate this 24H mean for each row, while there might be many rows for a single day, and you are only interested in 1 value for that day.
A groupby/resample operation can also not achieve this goal, as that could give monthly number of unique values, but it cannot shift this month period one day at a time, since the groups cannot be overlapping.

With rolling you can do something like df.rolling("30D").nunique().resample('D').last(). But this is 1) very slow, as the rolling in the first step is doing way too many calculations that you afterwards throw away with resample().last(), and 2) I am not sure the binning for timestamps within the same day happens fully as desired.

I found this an interesting use case, to think about if there are ways to better handle this in pandas, or how such an API could look like.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions