Description
Everything in this issue also applies to Series.groupby
and SeriesGroupBy
; I will just be writing it for DataFrame
.
Currently DataFrame.groupby
have two arguments that are essentially for the same thing:
as_index
: Whether to include the group keys in the index or, when the groupby is done on column labels (see #49519), in the columns.group_keys
: Whether to include the group keys in the index when callingDataFrameGroupBy.apply
.
as_index
only applies to reductions, group_keys
only applies to apply
. I think this is confusing and unnecessarily restrictive.
I propose we
- Deprecate both
as_index
andgroup_keys
- Add
keys_axis
to bothDataFrame.groupby
andDataFrameGroupBy.apply
; these take the same arguments, the only difference is that the value inDataFrameGroupBy.apply
, if specified, overrides the value inDataFrame.groupby
.
keys_axis
can accept the following values:
- "infer" (the default): One of the following behaviors, inferred from the computation depending on if it is a reduction, transform, or filter.
- "index" or 0: Add the keys to the index (similar to
as_index=True
orgroup_keys=False
) - "columns" or 1: Add the keys to the columns (similar to
as_index=False
) - "none": Don't add the keys to either the index nor the columns. For pandas methods (e.g.
sum
,cumsum
,head
), reductions will return aRangeIndex
, transforms and filters will behave as they do today returning the input's index or a subset of it for a filter. Forapply
, this will behave the same asgroup_keys=False
today.
Unlike as_index
, this argument will be respected in all groupby functions whether they be reductions, transforms, or filters.
Path to implementation:
- Add
keys_axis
in 2.0, and either add a PendingDeprecationWarning or a DeprecationWarning to as_index / group_keys - Change warnings for as_index / group_keys to a FutureWarning in 2.1
- Enforce depredations in 3.0
A few natural questions come to mind:
- Why introduce a new argument, why not keep either
as_index
orgroup_keys
?
Currently these arguments are Boolean, the new argument needs to accept more than two values where the name reflects that it is accepting an axis
. Also, adding a new argument provides a cleaner and more gradual path for deprecation.
- Why add
group_keys
toDataFrameGroupBy.apply
?
In other groupby methods, we can reliably use keys_axis="infer"
to determine the correct placement of the keys. However in apply, it is inferred from the output, and various cases can coincide - e.g. a reduction and transformation on a DataFrame with a single row. We want the user to be able to use "infer" on other groupby methods, but be able to specify how their UDF in apply acts. E.g.
gb = df.groupby(["a", "b"], keys_axis="infer")
print(gb.sum()) # Act as a reduction
print(gb.head()) # Act as a filter
print(gb.cumsum()) # Act as a transform
print(gb.apply(my_udf, keys_axis="index")) # infer from the groupby call is not reliable here, allow user to specify how apply should act
- Why should
keys_axis
accept the value"none"
?
This is currently how transforms and filters work - where the keys are added to neither the index nor the columns. We need to keep the ability to specify to groupby(...).apply
that the UDF they are provided acts as a transform or filter.
- Why not name the argument
group_keys_axis
?
I find "group" here redundant, but would be fine with this name too, and happy to consider other potential names.
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage