Description
Sparked by @betatim's question on twitter: https://twitter.com/betatim/status/1141321049918906368
Suppose you have the following data:
repo_id = np.random.choice(np.arange(1000), 10000)
index = pd.Timestamp("2019-01-01") + pd.to_timedelta(np.random.randint(0, 24*30*6, size=10000), unit='H')
df = pd.DataFrame({'repo_id': repo_id}, index=index).sort_index()
In [111]: df.head()
Out[111]:
repo_id
2019-01-01 01:00:00 162
2019-01-01 01:00:00 850
2019-01-01 01:00:00 414
2019-01-01 02:00:00 753
2019-01-01 02:00:00 125
Data at somewhat randomly points in time (here hourly, but precision doesn't matter much).
Assume now you want to calculate a rolling 30D (monthly) statistic for each day.
For example the rolling number of unique values over a period of time (the elegant but somewhat verbose python code to do this: https://gist.github.com/betatim/c59039682d92fab89859358e8c585313)
A rolling operation does not exactly give what you want, because df.rolling()
will calculate this 24H mean for each row, while there might be many rows for a single day, and you are only interested in 1 value for that day.
A groupby/resample operation can also not achieve this goal, as that could give monthly number of unique values, but it cannot shift this month period one day at a time, since the groups cannot be overlapping.
With rolling you can do something like df.rolling("30D").nunique().resample('D').last()
. But this is 1) very slow, as the rolling
in the first step is doing way too many calculations that you afterwards throw away with resample().last()
, and 2) I am not sure the binning for timestamps within the same day happens fully as desired.
I found this an interesting use case, to think about if there are ways to better handle this in pandas, or how such an API could look like.