Skip to content

DOC: Clarify how groupby forms groups for mixed-value columns #57526

Closed
@gabuzi

Description

@gabuzi

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby
https://pandas.pydata.org/docs/dev/user_guide/groupby.html#

Documentation problem

In my broad view of things, groupby can work on any (combination of) hashables, but there can be some unintuitive situations that should be mentioned in the docs.

E.g., the following should and does work (work as in there is no exception, but the results are not necessarily as expected):

>>> df = pd.DataFrame({'a': [False, 1, True, 'string'], 'b': [1, 2, 3, 4]})
        a  b
0   False  1
1       1  2
2    True  3
3  string  4

# let's groupby on 'a':
>>> df.groupby('a').describe()
           b
       count mean       std  min   25%  50%   75%  max
a
False    1.0  1.0       NaN  1.0  1.00  1.0  1.00  1.0
1        2.0  2.5  0.707107  2.0  2.25  2.5  2.75  3.0
string   1.0  4.0       NaN  4.0  4.00  4.0  4.00  4.0

We can see that the value 1 and True were grouped together. From a Python background, that is not unexpected, as 1 == True evaluates as True, i.e. 1 and True are equal and thus in the same group.

While this data layout may be questionable, it certainly happens in reality. I find myself often using (abusing?) groupby() on multiple cols to identify unique combinations of values, and in this case, I would not want 1 to be considered equal to True (or 0 equal to False and whichever other 'convenience' conversions Python does).

Suggested fix for documentation

I think the docs could benefit from mentioning this pitfall. The suggested workaround that I see would be to first convert the columns to strings before the groupby:

>>> df.groupby(df['a'].apply(str)).describe()
           b
       count mean std  min  25%  50%  75%  max
a
1        1.0  2.0 NaN  2.0  2.0  2.0  2.0  2.0
False    1.0  1.0 NaN  1.0  1.0  1.0  1.0  1.0
True     1.0  3.0 NaN  3.0  3.0  3.0  3.0  3.0
string   1.0  4.0 NaN  4.0  4.0  4.0  4.0  4.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions