BUG: groupby doesn't distinguish between different kinds of null values

```
df = pd.DataFrame({'a': [np.nan, pd.NA, None], 'b': [1, 2, 3]})
gb = df.groupby('a', dropna=False)
print(gb.sum())

#      b
# a     
# NaN  6
```

The three types of null values currently get combined into a single group. There are various places in pandas where different types of null values are identified, e.g. `pd.Series([np.nan, None])` converts `None` to `np.nan`. However within groupby, I think if the input contains distinct values, then they should remain distinct in the groupby result.

In order to change this, the change will need to be made in `factorize`, which could impact other parts outside of groupby. Our tests didn't catch any such instances (see #48477 for an implementation), but I plan to look into this further.

Assuming we consider the current output undesirable, we need to decide if we are going to call this a bug or if it should go through deprecation. This is tested for in `test_groupby_dropna_multi_index_dataframe_nan_in_two_groups`, which I think might suggest deprecating, however that was added in the original PR that introduced `dropna` to groupby (#30584). I went through that PR and did not see any discussion on this behavior. That makes me lean toward calling it a bug, but I could go either way here.

Slightly related: https://github.com/pandas-dev/pandas/issues/32265

cc @charlesdong1991 @jorisvandenbossche @jbrockmendel @mroeschke @jreback for any thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby doesn't distinguish between different kinds of null values #48476

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG: groupby doesn't distinguish between different kinds of null values #48476

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions