Skip to content

Grouping-aggregation with first() discards categoricals column #22512

Closed
@zoquda

Description

@zoquda

Code Sample

# Two dataframes that are identical, except for one dataframe having a categoricals column
df1 = pd.DataFrame({
    'A': [1, 1, 1, 2, 2],
    'B': [100, 100, 200, 100, 100],
    'C': ['apple', 'orange', 'mango', 'mango', 'orange'],
    'D': ['jupiter', 'mercury', 'mars', 'venus', 'venus'],
})
df2 = df1.astype({'D': 'category'})

# These groupby-aggregations all give results as expected
expected_result_1 = df1.groupby(by='A').first()
expected_result_2 = df2.groupby(by='A').first()
expected_result_3 = df1.groupby(by=['A', 'B']).first()
expected_result_4 = df1.groupby(by=['A', 'B']).head(1)
expected_result_5 = df2.groupby(by=['A', 'B']).head(1)

# This groupby-aggregation gives an unexpected result
unexpected_result = df2.groupby(by=['A', 'B']).first()

with the result dataframes looking as follows:

In [1]: expected_result_1
Out[1]:
     B      C        D
A
1  100  apple  jupiter
2  100  mango    venus

In [2]: expected_result_2
Out[2]:
     B      C        D
A
1  100  apple  jupiter
2  100  mango    venus

In [3]: expected_result_3
Out[3]:
           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

In [4]: expected_result_4
Out[4]:
   A    B      C        D
0  1  100  apple  jupiter
2  1  200  mango     mars
3  2  100  mango    venus

In [5]: expected_result_5
Out[5]:
   A    B      C        D
0  1  100  apple  jupiter
2  1  200  mango     mars
3  2  100  mango    venus

In [6]: unexpected_result
Out[6]:
           C
A B
1 100  apple
  200  mango
2 100  mango

Problem description

A grouping-aggregation operation with first() as aggregation function and multiple columns as by seems to discard categorical columns (see the unexpected_result dataframe in the above code example). Besides being unexpected, this behaviour also seems inconsistent with the other similar grouping-aggregation operations above.

See also this stackoverflow page.

Expected Output

The result of df2.groupby(by=['A', 'B']).first() would be expected as:

           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

The older pandas version 0.21.0 reportedly does produce this expected output.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.2
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.7
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.7
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.7
lxml: 4.2.4
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeNeeds TestsUnit test(s) needed to prevent regressionsNuisance ColumnsIdentifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.applygood first issue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions