Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
readings = pd.DataFrame(
[
('A', 'Saturday', 101),
('A', 'Sunday', 88),
('A', 'Saturday', 103),
('A', 'Sunday', 82),
('A', 'Saturday', 100),
('B', 'Saturday', 27),
('B', 'Sunday', 13),
('B', 'Saturday', 21),
('B', 'Sunday', 17),
('B', 'Saturday', 25)
],
columns=['building', 'day', 'reading']
)
class ShiftedWindow(pd.api.indexers.BaseIndexer):
def __init__(self, window_size):
self.window_size = window_size
def get_window_bounds(self, num_values=0, min_periods=None, center=None, closed=None):
starts = np.arange(-self.window_size, num_values - self.window_size)
ends = starts + self.window_size
starts[:self.window_size] = 0
return starts, ends
readings.groupby('building')['reading'].rolling(window=ShiftedWindow(2), min_periods=1).mean()
Problem description
I've defined a custom window that uses the previous values, and therefore ignores the current value. It's very useful for, say, target encoding on time series.
Expected Output
I would be expecting the following output:
>>> readings.groupby('building')['reading'].apply(lambda x: x.shift(1).rolling(2, min_periods=1).mean())
0 NaN
1 101.0
2 94.5
3 95.5
4 92.5
5 NaN
6 27.0
7 20.0
8 17.0
9 19.0
Name: reading, dtype: float64
Instead, I'm getting:
>>> readings.groupby('building')['reading'].rolling(window=ShiftedWindow(2), min_periods=1).mean()
building
A 0 101.0
1 88.0
2 103.0
3 82.0
4 100.0
B 5 27.0
6 13.0
7 21.0
8 17.0
9 25.0
Name: reading, dtype: float64
I've checked and my custom window works as expected without using groupby
. I've checked to see if get_window_bounds
gets called when a groupby
is used, and the answer is no. Basically, it seems that my custom window is being ignored entirely.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 19.5.0
Version : Darwin Kernel Version 19.5.0: Tue May 26 20:41:44 PDT 2020; root:xnu-6153.121.2~2/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fsspec : 0.5.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.45.1