Description
Code Sample, a copy-pastable example if possible
import pandas as pd
from pandas.api.types import CategoricalDtype
cat1_type = CategoricalDtype(categories=[101, 202, 303], ordered=True)
cat2_type = CategoricalDtype(categories=["a", "b", "c", "d"], ordered=True)
df = pd.DataFrame({'cat1': [101, 101, 101, 101, 101, 101],
'vals':[1, 2, 3, 4, 5, 6],
'cat2': ["a", "b", "d", "d", "a", "c"]
})
# Original DataFrame
print(df)
# Everything works as expected when "cat1" and "cat2" are not categories
df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
print(df)
# Convert to cat1 and cat2 to categories
df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})
# Everything works when there are no missing intermediate categories
df.cat_max = df.groupby(["cat1", "cat2"]).transform(max)
print(df)
# With missing "c" category everything works if observed=True is specified
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
print(df_miss_c)
# With missing "c" category the next group picks up its NaNs
df_miss_c = df[df.cat2 != "c"].copy()
df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
print(df_miss_c)
# With with missing "202" category you get two different NaNs
# If you group by cat1 then cat2 you get missing values like above
df_miss_202 = df.copy()
df_miss_202.loc[5, 'cat1'] = 303
df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_202)
# With with missing "101" everything works as expected as long as only one value for cat1 is used
df_miss_101 = df.copy()
df_miss_101.cat1 = 202
df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
print(df_miss_101)
print(f'pandas version {pd.show_versions()}')
Problem description
From the documentation:
Transformation: perform some group-specific computations and return a like-indexed object.
Given that transformation's purpose is to return a like-indexed object is seems that it should do this be default, even without specifying observed=True when there are missing categories. Also the default behavior with missing categories changes depending on which category is missing.
Expected Output
The expected output and output provided if the grouper values are not categories or if observed = True is specified follow.
>>> # Original DataFrame
>>> print(df)
cat1 vals cat2
0 101 1 a
1 101 2 b
2 101 3 d
3 101 4 d
4 101 5 a
5 101 6 c
>>>
>>> # Everything works when "cat1" and "cat2" are not categories
>>> df['cat_max'] = df.groupby(["cat1", "cat2"]).transform(max)
>>> print(df)
cat1 vals cat2 cat_max
0 101 1 a 5
1 101 2 b 2
2 101 3 d 4
3 101 4 d 4
4 101 5 a 5
5 101 6 c 6
>>> # Convert to cat1 and cat2 to categories
>>> df = df.astype({'cat1': cat1_type, 'cat2': cat2_type, 'vals': 'float64'})
>>> # With missing "c" category everything works if observed=True is specified
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"], observed=True).transform(max)
>>> print(df_miss_c)
cat1 vals cat2 cat_max
0 101 1.0 a 5.0
1 101 2.0 b 2.0
2 101 3.0 d 4.0
3 101 4.0 d 4.0
4 101 5.0 a 5.0
>>> # With with missing "101" everything works as expected as long as only one value for cat1 is used
>>> df_miss_101 = df.copy()
>>> df_miss_101.cat1 = 202
>>> df_miss_101.cat_max = df_miss_101.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_101)
cat1 vals cat2 cat_max
0 202 1.0 a 5.0
1 202 2.0 b 2.0
2 202 3.0 d 4.0
3 202 4.0 d 4.0
4 202 5.0 a 5.0
5 202 6.0 c 6.0
When an intermediate categorical is missing, "202", or "c" the indexing is thrown off.
>>> # With missing "c" category the next group picks up its NaNs
>>> df_miss_c = df[df.cat2 != "c"].copy()
>>> df_miss_c.loc[:, 'cat_max'] = df_miss_c.groupby(["cat1", "cat2"]).transform(max)
>>> print(df_miss_c)
cat1 vals cat2 cat_max
0 101 1.0 a 5.0
1 101 2.0 b 2.0
2 101 3.0 d NaN
3 101 4.0 d NaN
4 101 5.0 a 5.0
>>>
>>> # With with missing "202" category you get two different NaNs
>>> # If you group by cat1 then cat2 you get missing values like above
>>> df_miss_202 = df.copy()
>>> df_miss_202.loc[5, 'cat1'] = 303
>>> df_miss_202.cat_max = df_miss_202.groupby(["cat2", "cat1"]).transform(max)
>>> print(df_miss_202)
cat1 vals cat2 cat_max
0 101 1.0 a 5.0
1 101 2.0 b NaN
2 101 3.0 d 2.0
3 101 4.0 d 2.0
4 101 5.0 a 5.0
5 303 6.0 c NaN
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : 0.2.1
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.12.0
pytables : None
s3fs : None
scipy : 1.2.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None