Description
Code Sample
# Two dataframes that are identical, except for one dataframe having a categoricals column
df1 = pd.DataFrame({
'A': [1, 1, 1, 2, 2],
'B': [100, 100, 200, 100, 100],
'C': ['apple', 'orange', 'mango', 'mango', 'orange'],
'D': ['jupiter', 'mercury', 'mars', 'venus', 'venus'],
})
df2 = df1.astype({'D': 'category'})
# These groupby-aggregations all give results as expected
expected_result_1 = df1.groupby(by='A').first()
expected_result_2 = df2.groupby(by='A').first()
expected_result_3 = df1.groupby(by=['A', 'B']).first()
expected_result_4 = df1.groupby(by=['A', 'B']).head(1)
expected_result_5 = df2.groupby(by=['A', 'B']).head(1)
# This groupby-aggregation gives an unexpected result
unexpected_result = df2.groupby(by=['A', 'B']).first()
with the result dataframes looking as follows:
In [1]: expected_result_1
Out[1]:
B C D
A
1 100 apple jupiter
2 100 mango venus
In [2]: expected_result_2
Out[2]:
B C D
A
1 100 apple jupiter
2 100 mango venus
In [3]: expected_result_3
Out[3]:
C D
A B
1 100 apple jupiter
200 mango mars
2 100 mango venus
In [4]: expected_result_4
Out[4]:
A B C D
0 1 100 apple jupiter
2 1 200 mango mars
3 2 100 mango venus
In [5]: expected_result_5
Out[5]:
A B C D
0 1 100 apple jupiter
2 1 200 mango mars
3 2 100 mango venus
In [6]: unexpected_result
Out[6]:
C
A B
1 100 apple
200 mango
2 100 mango
Problem description
A grouping-aggregation operation with first()
as aggregation function and multiple columns as by
seems to discard categorical columns (see the unexpected_result
dataframe in the above code example). Besides being unexpected, this behaviour also seems inconsistent with the other similar grouping-aggregation operations above.
See also this stackoverflow page.
Expected Output
The result of df2.groupby(by=['A', 'B']).first()
would be expected as:
C D
A B
1 100 apple jupiter
200 mango mars
2 100 mango venus
The older pandas version 0.21.0 reportedly does produce this expected output.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: 3.7.2
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.7
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.7
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.7
lxml: 4.2.4
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None