Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I'm trying to group a dataframe for a given interval of timestamp, non-overlapping, if their values of another column are above a certain threshold.
Something that could help would be:
- Adding an option for non-overlapping windows to
rolling
- Adding an option to filter windows to
rolling
If there is any other way to achieve this using pandas (instead of loops), let me know.
I still think it could be a worthwhile addition.
Interest in non-overlapping windows (although those are simpler cases that could easily be achieved by groupby since there are not thresholds involved):
- https://stackoverflow.com/questions/57595661/non-overlapping-rolling-windows-in-pandas-dataframes
- https://stackoverflow.com/questions/59212955/how-to-rolling-non-overlapping-window-in-pandas
- https://stackoverflow.com/questions/65469362/non-overlapping-rolling-windows-in-pandas-groupby
- https://stackoverflow.com/questions/61772142/get-sliding-window-mean-with-no-overlap
Feature Description
Filter options in rolling windows
Currently, rolling
has a min_periods
option which is especially useful with windows defined by a timedelta
, allowing to discard certain time windows when there aren't enough observations. I see this as a kind of threshold to take the window into account. In the same way it would be nice if other kinds of thresholds could be added based on other data of the window besides length (number of observations).
For example, a dataframe which has 2 columns, one with timestamps (index) and another with number related to a measurement (B), it could be interesting to use something like:
df.rolling("100ns", filter="len(index) >= 5 & B.sum() > 16")
Or a function:
df.rolling("100ns", filter=lambda x: len(x) >= 5 and x.B.sum() > 16 )
Where the len(index)
part would work in a similar way to min_periods
.
Currently, this could be achieved by iterating over the windows (btw, min_periods
does nothing regarding which windows are created).
I have added an example:
df_time = pd.DataFrame({'B': [0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11]},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:04'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06'),
pd.Timestamp('20130101 09:00:07'),
pd.Timestamp('20130101 09:00:15'),
pd.Timestamp('20130101 09:00:16'),
pd.Timestamp('20130101 09:00:17'),
pd.Timestamp('20130101 09:00:18'),
pd.Timestamp('20130101 09:00:19')])
df_time.rolling('5s', min_periods=5).sum()
Results in:
B
2013-01-01 09:00:00 NaN
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 NaN
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 NaN
2013-01-01 09:00:07 16.0
2013-01-01 09:00:15 NaN
2013-01-01 09:00:16 NaN
2013-01-01 09:00:17 NaN
2013-01-01 09:00:18 NaN
2013-01-01 09:00:19 45.0
The proposed behavior would result in:
B
2013-01-01 09:00:00 NaN
2013-01-01 09:00:03 NaN
2013-01-01 09:00:04 NaN
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 NaN
2013-01-01 09:00:07 NaN
2013-01-01 09:00:15 NaN
2013-01-01 09:00:16 NaN
2013-01-01 09:00:17 NaN
2013-01-01 09:00:18 NaN
2013-01-01 09:00:19 45.0
One difference from the min_periods
is that the windows are still available when using the rolling as an iterator:
sum(1 for i in df_time.rolling('5s', min_periods=5) if len(i) < 5) # = 9 instead of 0
The proposed behavior would only allow the iteration over the windows that respect the filter:
for i in df.rolling("100ns", filter=lambda x: len(x) >= 5 and x.B.sum() > 16 ):
print(i)
Would only show one window:
B
2013-01-01 09:00:15 7
2013-01-01 09:00:16 8
2013-01-01 09:00:17 9
2013-01-01 09:00:18 10
2013-01-01 09:00:19 11
Which, currently, can be obtained using:
for i in df_time.rolling('5s', min_periods=5):
if len(i) >= 5 and i.B.sum() > 16:
print(i)
So, why have it as an argument inside rolling
instead of extending the filter
method to rolling
objects?
Non-overlapping windows
One thing that currently isn't supported is non-overlapping windows.
This means that when a window is created the following window can only be created starting on the entry next to the last entry of the previous window.
When there aren't thresholds, it can be easily achieved by dividing the dataframe according to timestamp intervals, but when threshold are involved, the beginning of the window isn't know beforehand.
Alternative Solutions
A very slow for loop.
Additional Context
I don't know if this should be one or two issues, one for the filtering and one for the non-overlapping windows, but I think they are closely related.
If this isn't something that makes a worthwhile addition or if there is another way of doing this, please let me know.