Description
df = pd.DataFrame({'a': [np.nan, pd.NA, None], 'b': [1, 2, 3]})
gb = df.groupby('a', dropna=False)
print(gb.sum())
# b
# a
# NaN 6
The three types of null values currently get combined into a single group. There are various places in pandas where different types of null values are identified, e.g. pd.Series([np.nan, None])
converts None
to np.nan
. However within groupby, I think if the input contains distinct values, then they should remain distinct in the groupby result.
In order to change this, the change will need to be made in factorize
, which could impact other parts outside of groupby. Our tests didn't catch any such instances (see #48477 for an implementation), but I plan to look into this further.
Assuming we consider the current output undesirable, we need to decide if we are going to call this a bug or if it should go through deprecation. This is tested for in test_groupby_dropna_multi_index_dataframe_nan_in_two_groups
, which I think might suggest deprecating, however that was added in the original PR that introduced dropna
to groupby (#30584). I went through that PR and did not see any discussion on this behavior. That makes me lean toward calling it a bug, but I could go either way here.
Slightly related: #32265
cc @charlesdong1991 @jorisvandenbossche @jbrockmendel @mroeschke @jreback for any thoughts.