Skip to content

API: Consolidate groupby as_index and group_keys #49543

Open
@rhshadrach

Description

@rhshadrach

Everything in this issue also applies to Series.groupby and SeriesGroupBy; I will just be writing it for DataFrame.

Currently DataFrame.groupby have two arguments that are essentially for the same thing:

  • as_index: Whether to include the group keys in the index or, when the groupby is done on column labels (see #49519), in the columns.
  • group_keys: Whether to include the group keys in the index when calling DataFrameGroupBy.apply.

as_index only applies to reductions, group_keys only applies to apply. I think this is confusing and unnecessarily restrictive.

I propose we

  • Deprecate both as_index and group_keys
  • Add keys_axis to both DataFrame.groupby and DataFrameGroupBy.apply; these take the same arguments, the only difference is that the value in DataFrameGroupBy.apply, if specified, overrides the value in DataFrame.groupby.

keys_axis can accept the following values:

  • "infer" (the default): One of the following behaviors, inferred from the computation depending on if it is a reduction, transform, or filter.
  • "index" or 0: Add the keys to the index (similar to as_index=True or group_keys=False)
  • "columns" or 1: Add the keys to the columns (similar to as_index=False)
  • "none": Don't add the keys to either the index nor the columns. For pandas methods (e.g. sum, cumsum, head), reductions will return a RangeIndex, transforms and filters will behave as they do today returning the input's index or a subset of it for a filter. For apply, this will behave the same as group_keys=False today.

Unlike as_index, this argument will be respected in all groupby functions whether they be reductions, transforms, or filters.

Path to implementation:

  • Add keys_axis in 2.0, and either add a PendingDeprecationWarning or a DeprecationWarning to as_index / group_keys
  • Change warnings for as_index / group_keys to a FutureWarning in 2.1
  • Enforce depredations in 3.0

A few natural questions come to mind:

  1. Why introduce a new argument, why not keep either as_index or group_keys?

Currently these arguments are Boolean, the new argument needs to accept more than two values where the name reflects that it is accepting an axis. Also, adding a new argument provides a cleaner and more gradual path for deprecation.

  1. Why add group_keys to DataFrameGroupBy.apply?

In other groupby methods, we can reliably use keys_axis="infer" to determine the correct placement of the keys. However in apply, it is inferred from the output, and various cases can coincide - e.g. a reduction and transformation on a DataFrame with a single row. We want the user to be able to use "infer" on other groupby methods, but be able to specify how their UDF in apply acts. E.g.

gb = df.groupby(["a", "b"], keys_axis="infer")
print(gb.sum())  # Act as a reduction
print(gb.head())  # Act as a filter
print(gb.cumsum())  # Act as a transform
print(gb.apply(my_udf, keys_axis="index"))  # infer from the groupby call is not reliable here, allow user to specify how apply should act
  1. Why should keys_axis accept the value "none"?

This is currently how transforms and filters work - where the keys are added to neither the index nor the columns. We need to keep the ability to specify to groupby(...).apply that the UDF they are provided acts as a transform or filter.

  1. Why not name the argument group_keys_axis?

I find "group" here redundant, but would be fine with this name too, and happy to consider other potential names.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions