Skip to content

Groupby.transform returns data with incorrect indexing when grouping on categoricals with missing values #28788

Closed
@CassonStallings

Description

@CassonStallings

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.api.types import CategoricalDtype

cat1_type = CategoricalDtype(categories=[101, 202, 303], ordered=True)
cat2_type = CategoricalDtype(categories=["a", "b", "c", "d"], ordered=True)

df = pd.DataFrame({'cat1': [101, 101, 101, 101, 101, 101], 
                  'vals':[1, 2, 3, 4, 5, 6],
                  'cat2': ["a", "b", "d", "d", "a", "c"]
                  })

# Original DataFrame
print(df)

# Everything works as expected when "cat1" and "cat2" are not categories
df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
print(df)

# Convert to cat1 and cat2 to categories
df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})

# Everything works when there are no missing intermediate categories
df.cat_max = df.groupby(["cat1", "cat2"]).transform(max)
print(df)

# With missing "c" category everything works if observed=True is specified
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
print(df_miss_c)

# With missing "c" category the next group picks up its NaNs
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
print(df_miss_c)

# With with missing "202" category you get two different NaNs
# If you group by cat1 then cat2 you get missing values like above
df_miss_202 = df.copy()
df_miss_202.loc[5, 'cat1'] = 303
df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_202)

# With with missing "101" everything works as expected as long as only one value for cat1 is used
df_miss_101 = df.copy()
df_miss_101.cat1 = 202
df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_101)

print(f'pandas version {pd.show_versions()}')

Problem description

From the documentation:
Transformation: perform some group-specific computations and return a like-indexed object.

Given that transformation's purpose is to return a like-indexed object is seems that it should do this be default, even without specifying observed=True when there are missing categories. Also the default behavior with missing categories changes depending on which category is missing.

Expected Output

The expected output and output provided if the grouper values are not categories or if observed = True is specified follow.

>>> # Original DataFrame
>>> print(df)
    cat1  vals cat2
0   101     1    a
1   101     2    b
2   101     3    d
3   101     4    d
4   101     5    a
5   101     6    c
>>>
>>> # Everything works when "cat1" and "cat2" are not categories
>>> df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
>>> print(df)
    cat1  vals cat2  cat_max
0   101     1    a        5
1   101     2    b        2
2   101     3    d        4
3   101     4    d        4
4   101     5    a        5
5   101     6    c        6

>>> # Convert to cat1 and cat2 to categories
>>> df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})

>>> # With missing "c" category everything works if observed=True is specified
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
>>> print(df_miss_c)
   cat1  vals cat2  cat_max
0  101   1.0    a      5.0
1  101   2.0    b      2.0
2  101   3.0    d      4.0
3  101   4.0    d      4.0
4  101   5.0    a      5.0

>>> # With with missing "101" everything works as expected as long as only one value for cat1 is used
>>> df_miss_101 = df.copy()
>>> df_miss_101.cat1 = 202
>>> df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_101)
    cat1  vals cat2  cat_max
0   202   1.0    a      5.0
1   202   2.0    b      2.0
2   202   3.0    d      4.0
3   202   4.0    d      4.0
4   202   5.0    a      5.0
5   202   6.0    c      6.0

When an intermediate categorical is missing, "202", or "c" the indexing is thrown off.

>>> # With missing "c" category the next group picks up its NaNs
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
>>> print(df_miss_c)
  cat1  vals cat2  cat_max
0  101   1.0    a      5.0
1  101   2.0    b      2.0
2  101   3.0    d      NaN
3  101   4.0    d      NaN
4  101   5.0    a      5.0
>>>
>>> # With with missing "202" category you get two different NaNs
>>> # If you group by cat1 then cat2 you get missing values like above
>>> df_miss_202 = df.copy()
>>> df_miss_202.loc[5, 'cat1'] = 303
>>> df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_202)
  cat1  vals cat2  cat_max
0  101   1.0    a      5.0
1  101   2.0    b      NaN
2  101   3.0    d      2.0
3  101   4.0    d      2.0
4  101   5.0    a      5.0
5  303   6.0    c      NaN

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : 0.2.1
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.12.0
pytables : None
s3fs : None
scipy : 1.2.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapBugCategoricalCategorical Data TypeGroupby

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions