Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
F = pd.DataFrame({ "a" : np.r_[0:10], "b" : np.r_[0:10]//5 })
# this works as expected: the outer group indices are present
F.groupby("b").apply(lambda x : x.iloc[:2])
# this does not work as expected: the outer group indices are gone
F.groupby("b").apply(lambda x : x.iloc[:])
Issue Description
groupby.apply()
omits the outer group index in the dataframe it returns if the row indices of the returned dataframe are identical to the indices of the input dataframe. For example, given the following dataframe:
F = pd.DataFrame({ "a" : np.r_[0:10], "b" : np.r_[0:10]//5 })
a b
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
if I group by column b
and take only the first two rows from each group,
In [1]: F.groupby("b").apply(lambda x : x.iloc[:2])
Out[1]:
a b
b
0 0 0 0
1 1 0
1 5 5 1
6 6 1
the group index b
becomes the outer index of the dataframe, as expected. However, if I instead take all the rows from each group
In [2]: F.groupby("b").apply(lambda x : x.iloc[:])
Out[2]:
a b
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
the outer group index disappears. It seems that any time the row indices of the dataframe returned by groupby.apply()
are identical to those of the input dataframe, the outer group index disappears:
In [3]: F.groupby("b").apply(lambda x : x)
Out[3]:
a b
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
In [4]: F.groupby("b").apply(lambda x : x + 1)
Out[4]:
a b
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 2
6 7 2
7 8 2
8 9 2
9 10 2
If the row indices of the returned dataframe differ from those of the input dataframe, the outer group index is present:
In [5]: F.groupby("b").apply(lambda x : x.iloc[np.r_[0, 4, 1, 2, 3]])
Out[5]:
a b
b
0 0 0 0
4 4 0
1 1 0
2 2 0
3 3 0
1 5 5 1
9 9 1
6 6 1
7 7 1
8 8 1
In [6]: F.groupby("b").apply(lambda x : x.iloc[::-1])
Out[6]:
a b
b
0 4 4 0
3 3 0
2 2 0
1 1 0
0 0 0
1 9 9 1
8 8 1
7 7 1
6 6 1
5 5 1
Expected Behavior
I expect the outer group indices to always be present:
In [1]: F = pd.DataFrame({ "a" : np.r_[0:10], "b" : np.r_[0:10]//5 })
In [2]: F.groupby("b").apply(lambda x : x.iloc[:])
Out[2]:
a b
b
0 0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
1 5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
Installed Versions
INSTALLED VERSIONS
------------------
commit : 06d230151e6f18fdb8139d09abf539867a8cd481
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-210-generic
Version : #242-Ubuntu SMP Fri Apr 16 09:57:56 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.1
numpy : 1.20.3
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.0.0.post20201207
Cython : 0.29.22
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : 0.9.0
gcsfs : 0.8.0
matplotlib : 3.4.2
numba : 0.53.1
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : None
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None