Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
#Create a dataframe with a categorical column with two categories and a (numpy) boolean column that is randomly True or False
df = pd.DataFrame.from_dict({'category':['A']*10+['B']*10,
'bool_numpy': np.random.rand(20)>0.5})
#Now make another column that is a copy of the numpy boolean column, but converted to pyarrow
df['bool_arrow'] = df['bool_numpy'].astype('bool[pyarrow]')
print(df.head())
# category bool_numpy bool_arrow
# 0 A True True
# 1 A True True
# 2 A True True
# 3 A True True
# 4 A False False
#Now do a gruopby and aggregate to compute the fraction of True values in each column:
true_fracs = df.groupby('category').agg(lambda x: x.sum()/x.count())
print(true_fracs)
# bool_numpy bool_arrow
# category
# A 0.7 True
# B 0.6 True
#I expect both columns above to have identical floating-point values, not boolean.
Issue Description
Doing a groupby and aggregation on a bool[pyarrow]
column returns a different datatype than the same operation on a numpy bool
column. In particular, it seems to always return another bool[pyarrow]
regardless of the aggregation performed.
Expected Behavior
I would expect the same datatype and results to be returned regardless of the backend chosen. Specifically, I would expect the result for category 'A'
to be the same as the result of the following calculation, which is the same regardless of backend:
print(df.query("category=='A'")[['bool_numpy','bool_arrow']].sum()/df[['bool_numpy','bool_arrow']].count())
# bool_numpy 0.7
# bool_arrow 0.7
# dtype: float64
OR, if this is the intended behavior, I would expect this change to be prominently displayed in the groupby
documentation.
Installed Versions
pandas : 2.0.1
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 23.0.1
Cython : 0.29.33
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : 2023.1.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.3.0
pyqt5 : None