Description
Hey pandas team. Sorry to have gone MIA the past week, super busy with work. I promise (and look forward to) contributing more soon. :)
Still, I wanted to note that I came across what I believe to a be a bug in resample()
when trying to change the interval of the binning with closed='left'
. I know that there have been a few changes to the resample()
API since Wes' book, however, I don't believe they changed this functionality, but I have been wrong before :)
Bug can be reproduced using the example from Wes' book, generating 12 mins of data like:
In [3]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [4]: ts = pd.Series(np.arange(12), index=rng)
In [5]: ts
Out[5]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T, dtype: int64
we can do a simple resample to 5 mins like:
In [6]: ts.resample('5min', how='sum')
Out[6]:
2000-01-01 00:00:00 10
2000-01-01 00:05:00 35
2000-01-01 00:10:00 21
Freq: 5T, dtype: int64
For my use, I need this resampling to be 'backwards looking' so that the summations at each resampled timestamp include the previous 4 minutes. Documentation (and Wes' book) suggest this is achieved by binning with closed='left'
, however, this results in the same output as above:
In [7]: ts.resample('5min', how='sum', closed='left')
Out[8]:
2000-01-01 00:00:00 10
2000-01-01 00:05:00 35
2000-01-01 00:10:00 21
Freq: 5T, dtype: int64
I was looking for the following result (note that the first timestamp is at 00:05:00
and with hanging data dropped):
2000-01-01 00:05:00 10
2000-01-01 00:10:00 35
Freq: 5T, dtype: int64
I am able to generate this by combining loffset='5min'
and then slicing into the resultant Series to remove the:
In [10]: ts.resample('5min', how='sum', closed='left', loffset='5min')[:-1]
Out[10]:
2000-01-01 00:05:00 10
2000-01-01 00:10:00 35
Freq: 5T, dtype: int64
but this is hardly ideal as it's not known in advance if time series ends with a timestamp that resolves equally to the final timestamp of the resampling procedure!
Apologies if I am missing something—any thoughts, help or guidance is welcomed!
Thanks so much.