API? how to do a "rolling groupby" or groupby with overlapping groups?

Sparked by @betatim's question on twitter: https://twitter.com/betatim/status/1141321049918906368

Suppose you have the following data:

```
repo_id = np.random.choice(np.arange(1000), 10000)
index = pd.Timestamp("2019-01-01") + pd.to_timedelta(np.random.randint(0, 24*30*6, size=10000), unit='H')
df = pd.DataFrame({'repo_id': repo_id}, index=index).sort_index()
```

```
In [111]: df.head() 
Out[111]: 
                     repo_id
2019-01-01 01:00:00      162
2019-01-01 01:00:00      850
2019-01-01 01:00:00      414
2019-01-01 02:00:00      753
2019-01-01 02:00:00      125
```

Data at somewhat randomly points in time (here hourly, but precision doesn't matter much).

Assume now you want to calculate a rolling 30D (monthly) statistic for each day.  
For example the rolling number of unique values over a period of time (the elegant but somewhat verbose python code to do this: https://gist.github.com/betatim/c59039682d92fab89859358e8c585313)

A rolling operation does not exactly give what you want, because `df.rolling()` will calculate this 24H mean for each row, while there might be many rows for a single day, and you are only interested in 1 value for that day.
A groupby/resample operation can also not achieve this goal, as that could give monthly number of unique values, but it cannot shift this month period one day at a time, since the groups cannot be overlapping.

With rolling you can do something like `df.rolling("30D").nunique().resample('D').last()`. But this is 1) very slow, as the `rolling` in the first step is doing way too many calculations that you afterwards throw away with `resample().last()`, and 2) I am not sure the binning for timestamps within the same day happens fully as desired.

I found this an interesting use case, to think about if there are ways to better handle this in pandas, or how such an API could look like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API? how to do a "rolling groupby" or groupby with overlapping groups? #26959

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API? how to do a "rolling groupby" or groupby with overlapping groups? #26959

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions