Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby
https://pandas.pydata.org/docs/dev/user_guide/groupby.html#
Documentation problem
In my broad view of things, groupby can work on any (combination of) hashables, but there can be some unintuitive situations that should be mentioned in the docs.
E.g., the following should and does work (work as in there is no exception, but the results are not necessarily as expected):
>>> df = pd.DataFrame({'a': [False, 1, True, 'string'], 'b': [1, 2, 3, 4]})
a b
0 False 1
1 1 2
2 True 3
3 string 4
# let's groupby on 'a':
>>> df.groupby('a').describe()
b
count mean std min 25% 50% 75% max
a
False 1.0 1.0 NaN 1.0 1.00 1.0 1.00 1.0
1 2.0 2.5 0.707107 2.0 2.25 2.5 2.75 3.0
string 1.0 4.0 NaN 4.0 4.00 4.0 4.00 4.0
We can see that the value 1
and True
were grouped together. From a Python background, that is not unexpected, as 1 == True
evaluates as True
, i.e. 1
and True
are equal and thus in the same group.
While this data layout may be questionable, it certainly happens in reality. I find myself often using (abusing?) groupby()
on multiple cols to identify unique combinations of values, and in this case, I would not want 1
to be considered equal to True
(or 0
equal to False
and whichever other 'convenience' conversions Python does).
Suggested fix for documentation
I think the docs could benefit from mentioning this pitfall. The suggested workaround that I see would be to first convert the columns to strings before the groupby
:
>>> df.groupby(df['a'].apply(str)).describe()
b
count mean std min 25% 50% 75% max
a
1 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0
False 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
True 1.0 3.0 NaN 3.0 3.0 3.0 3.0 3.0
string 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0