Description
First, thank you very much for your work ! I love pandas !
The problem
Code Sample
import pandas as pd
# intialise data of lists.
data = {'Fruit':['Apple', 'Orange', 'Apple'],
'Origin':['France', 'France', 'Spain'],
'Price':[10, 15, 20]}
# Create DataFrame without categorical variable
df = pd.DataFrame(data)
# Same dataframe with categorical variable
df_category = df.copy()
df_category['Origin'] = df_category['Origin'].astype('category')
df
# Normal behavior
df.groupby(['Fruit','Origin'])['Price'].mean()
# Abnormal behavior
df_category.groupby(['Fruit','Origin'])['Price'].mean()
Return
# Normal behavior
Fruit Origin
Apple France 10
Spain 20
Orange France 15
Name: Price, dtype: int64
# Abnormal behavior
Fruit Origin
Apple France 10.0
Spain 20.0
Orange France 15.0
Spain NaN
Name: Price, dtype: float64
Problem description
Bug that should have been solved with #20583. I supposed it is a regression. No new discussion since May 2018, so I prefer to reopen a issue.
Brief description: When a column of a groupby is a categorical variable, the output include rows with NA.
Expected Output
Fruit Origin
Apple France 10
Spain 20
Orange France 15
Name: Price, dtype: int64
Output of pd.show_versions()
pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : 2.7.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None