Skip to content

BUG: groupby sort issue when using rolling #50296

Open
@domjanbaric

Description

@domjanbaric

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Simpler reproducible example

index = pd.MultiIndex.from_product(
    [[580, 546], [520, 655, 529], pd.date_range(dt.date(2020, 11, 1), dt.date(2020, 11, 3))],
    names=["machine_id",'product_id', "sample_date"],
)
df = pd.DataFrame(
    data={'sales_qty': np.random.random(index.shape[0])},
    index=index,
).reset_index()
df = df.sort_values(by=['machine_id', 'product_id', 'sample_date'])
df = df[df['machine_id'].ne(546) | df['product_id'].ne(529)]
result = df.groupby(['machine_id','product_id'], sort=False)['sales_qty'].rolling(2, center=True).sum()
print(result)

Reproducible Example

import pandas as pd

df = pd.MultiIndex.from_product(
    [[580,546], [519,520,527,526,528,655,526,529,532],pd.date_range(dt.date(2020, 11, 1), dt.date(2022, 10, 31))],
    names=["machine_id",'product_id', "sample_date"],
)
df = pd.DataFrame(index=df).reset_index()
df["sample_date"] = pd.to_datetime(df["sample_date"]).dt.date
df['sales_qty']=np.random.random(df.shape[0])

df=df.sort_values(by=['machine_id','product_id','sample_date'])
df=df.reset_index(drop=True)
df=df[~((df['machine_id']==546) & (df['product_id'].isin([526,529,532])))]
df = df.groupby(['machine_id','product_id'],sort=False)['sales_qty'].rolling(7, center=True).sum().reset_index()[['machine_id','product_id']].drop_duplicates()
df

Issue Description

When using pandas groupby with rolling, flag sort in groupby, sort order of group keys is changed. This occurs when we have more than 1 group key and values in key 2 are not equal for all values of key 1.

In example above we create dataframe with 4 columns, machine_id, product_id, sample_date and sales_qty. Dataframe consist of 2 different machine and 9 different products. Dataframe is sorted by: machine_id, product_id, sample_date. We want to calculate rolling mean for each machine_id-product_id group. However, if we remove some products for machine with lower machine_id (machine_id which is prior in sort, in this example 546) and then calculate rolling mean with flag sort= False in groupby order of group keys in result is actually changed. product_ids which were removed from machine with machine_id = 546 (product_id = 526,529,532) are at the end of dataframe.

    machine_id 	product_id
0 	546 	519
730 	546 	520
1460 	546 	527
2190 	546 	528
2920 	546 	655
3650 	580 	519
4380 	580 	520
5110 	580 	527
5840 	580 	528
6570 	580 	655
7300 	580 	526    <---
8760 	580 	529    <---
9490 	580 	532    <---

Expected Behavior

Because flag sort = False in groupby expected behaviour is that result will have same sort as the original dataframe, i.e. expected result is:

    machine_id 	product_id
0 	546 	519
730 	546 	520
1460 	546 	527
2190 	546 	528
2920 	546 	655
3650 	580 	519
4380 	580 	520
5110 	580 	526    <---
6570 	580 	527
7300 	580 	528
8030 	580 	529    <---
8760 	580 	532    <---
9490 	580 	655

Installed Versions

INSTALLED VERSIONS

commit : 91111fd
python : 3.10.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-56-generic
Version : #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.42
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Metadata

Metadata

Assignees

Labels

BugGroupbySortinge.g. sort_index, sort_valuesWindowrolling, ewma, expanding

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions