Skip to content

ENH: Filter options for rolling windows and non-overlapping windows #53052

Open
@BGameiro2000

Description

@BGameiro2000

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I'm trying to group a dataframe for a given interval of timestamp, non-overlapping, if their values of another column are above a certain threshold.

Something that could help would be:

  • Adding an option for non-overlapping windows to rolling
  • Adding an option to filter windows to rolling

If there is any other way to achieve this using pandas (instead of loops), let me know.
I still think it could be a worthwhile addition.

Related: https://stackoverflow.com/questions/76063050/dataframe-groupby-timestamp-interval-non-overlapping-and-sum-of-column-values

Interest in non-overlapping windows (although those are simpler cases that could easily be achieved by groupby since there are not thresholds involved):

Feature Description

Filter options in rolling windows

Currently, rolling has a min_periods option which is especially useful with windows defined by a timedelta, allowing to discard certain time windows when there aren't enough observations. I see this as a kind of threshold to take the window into account. In the same way it would be nice if other kinds of thresholds could be added based on other data of the window besides length (number of observations).

For example, a dataframe which has 2 columns, one with timestamps (index) and another with number related to a measurement (B), it could be interesting to use something like:

df.rolling("100ns", filter="len(index) >= 5 & B.sum() > 16")

Or a function:

df.rolling("100ns", filter=lambda x: len(x) >= 5 and x.B.sum() > 16 )

Where the len(index) part would work in a similar way to min_periods.

Currently, this could be achieved by iterating over the windows (btw, min_periods does nothing regarding which windows are created).

I have added an example:

df_time = pd.DataFrame({'B': [0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11]},
                       index = [pd.Timestamp('20130101 09:00:00'),
                                pd.Timestamp('20130101 09:00:03'),
                                pd.Timestamp('20130101 09:00:04'),
                                pd.Timestamp('20130101 09:00:05'),
                                pd.Timestamp('20130101 09:00:06'),
                                pd.Timestamp('20130101 09:00:07'),
                                pd.Timestamp('20130101 09:00:15'),
                                pd.Timestamp('20130101 09:00:16'),
                                pd.Timestamp('20130101 09:00:17'),
                                pd.Timestamp('20130101 09:00:18'),
                                pd.Timestamp('20130101 09:00:19')])

df_time.rolling('5s', min_periods=5).sum()

Results in:

	B
2013-01-01 09:00:00	NaN
2013-01-01 09:00:03	NaN
2013-01-01 09:00:04	NaN
2013-01-01 09:00:05	NaN
2013-01-01 09:00:06	NaN
2013-01-01 09:00:07	16.0
2013-01-01 09:00:15	NaN
2013-01-01 09:00:16	NaN
2013-01-01 09:00:17	NaN
2013-01-01 09:00:18	NaN
2013-01-01 09:00:19	45.0

The proposed behavior would result in:

	B
2013-01-01 09:00:00	NaN
2013-01-01 09:00:03	NaN
2013-01-01 09:00:04	NaN
2013-01-01 09:00:05	NaN
2013-01-01 09:00:06	NaN
2013-01-01 09:00:07	NaN
2013-01-01 09:00:15	NaN
2013-01-01 09:00:16	NaN
2013-01-01 09:00:17	NaN
2013-01-01 09:00:18	NaN
2013-01-01 09:00:19	45.0

One difference from the min_periods is that the windows are still available when using the rolling as an iterator:

sum(1 for i in df_time.rolling('5s', min_periods=5) if len(i) < 5) # = 9 instead of 0

The proposed behavior would only allow the iteration over the windows that respect the filter:

for i in df.rolling("100ns", filter=lambda x: len(x) >= 5 and x.B.sum() > 16 ):
    print(i)

Would only show one window:

                      B
2013-01-01 09:00:15   7
2013-01-01 09:00:16   8
2013-01-01 09:00:17   9
2013-01-01 09:00:18  10
2013-01-01 09:00:19  11

Which, currently, can be obtained using:

for i in df_time.rolling('5s', min_periods=5):
    if len(i) >= 5 and i.B.sum() > 16:
        print(i)

So, why have it as an argument inside rolling instead of extending the filter method to rolling objects?

Non-overlapping windows

One thing that currently isn't supported is non-overlapping windows.
This means that when a window is created the following window can only be created starting on the entry next to the last entry of the previous window.
When there aren't thresholds, it can be easily achieved by dividing the dataframe according to timestamp intervals, but when threshold are involved, the beginning of the window isn't know beforehand.

Alternative Solutions

A very slow for loop.

Additional Context

I don't know if this should be one or two issues, one for the filtering and one for the non-overlapping windows, but I think they are closely related.

If this isn't something that makes a worthwhile addition or if there is another way of doing this, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions