Skip to content

BUG: groupby doesn't distinguish between different kinds of null values #48476

Open
@rhshadrach

Description

@rhshadrach
df = pd.DataFrame({'a': [np.nan, pd.NA, None], 'b': [1, 2, 3]})
gb = df.groupby('a', dropna=False)
print(gb.sum())

#      b
# a     
# NaN  6

The three types of null values currently get combined into a single group. There are various places in pandas where different types of null values are identified, e.g. pd.Series([np.nan, None]) converts None to np.nan. However within groupby, I think if the input contains distinct values, then they should remain distinct in the groupby result.

In order to change this, the change will need to be made in factorize, which could impact other parts outside of groupby. Our tests didn't catch any such instances (see #48477 for an implementation), but I plan to look into this further.

Assuming we consider the current output undesirable, we need to decide if we are going to call this a bug or if it should go through deprecation. This is tested for in test_groupby_dropna_multi_index_dataframe_nan_in_two_groups, which I think might suggest deprecating, however that was added in the original PR that introduced dropna to groupby (#30584). I went through that PR and did not see any discussion on this behavior. That makes me lean toward calling it a bug, but I could go either way here.

Slightly related: #32265

cc @charlesdong1991 @jorisvandenbossche @jbrockmendel @mroeschke @jreback for any thoughts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugGroupbyMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions