Description
Once #46072 is implemented, many groupby ops will be defaulting to numeric_only=False
in 2.0. However there are a number of group ops which can only ever work on numeric data. For API consistency, I believe a user trying to operate on non-numeric columns with these ops should raise. Consider the example
df = pd.DataFrame({'a': [1, 1], 'b': [3, 4], 'c': [5, 6]})
df['c'] = df['c'].astype(object)
gb = df.groupby('a')
print(gb.mean())
which gives the output
b
a
1 3.5
If a user has a numeric column that accidentally ends up as object dtype, the result will be silently missing expected columns. This is why I think we should run the op with all provided data, regardless if it is numeric or not.
The following groupby ops have no numeric_only
argument and act like numeric_only=True
, but only make sense on numeric data.
- quantile ENH: Add numeric_only to certain groupby ops #46728
- std ENH: Add numeric_only to certain groupby ops #46728
- mad DEPR: mad #46707
- sem ENH: Add numeric_only to certain groupby ops #46728
- var ENH: Add numeric_only to certain groupby ops #46728
- corrwith ENH: Add numeric_only to frame methods #46708
- cumprod ENH: Add numeric_only to frame methods #46708
- cov ENH: Add numeric_only to frame methods #46708
- corr ENH: Add numeric_only to frame methods #46708
The following groupby ops have no numeric_only
argument and act like numeric_only=True
, but make sense on non-numeric data.
- idxmin ENH: Add numeric_only to certain groupby ops #46728
- idxmax ENH: Add numeric_only to certain groupby ops #46728
- cummax ENH: Add numeric_only to frame methods #46708
- cumsum ENH: Add numeric_only to frame methods #46708
- cummin ENH: Add numeric_only to frame methods #46708
For both groups of ops, I propose we add the numeric_only argument defaulting to True in 1.5, which emits a warning message that it will default to False in the future. The warning would only be emitted if setting numeric_only to True/False would give rise to different output; i.e. if there are non-numeric columns that could have been operated on.
It's not ideal to add an argument and deprecate the default value in the same minor release (assuming 1.5 is the last minor release in the 1.x series), however I believe it will be of minor impact to users. The alternatives would be not carrying out the deprecation of numeric_only=True
or to leave these ops behaving as if numeric_only=True
(with no numeric_only argument). Both of these seem like worse alternatives to me.
cc @jreback @jbrockmendel @jorisvandenbossche @simonjayhawkins @Dr-Irv