Description
This topic was discussed in the dev meeting on 2024-04-10. I believe I captured all ideas expressed there.
Summary: DataFrameGroupBy.agg will sometimes pass the UDF each column of the original DataFrame as a Series ("by-column"), and sometimes pass the UDF the entire DataFrame ("by-frame"). This causes issues for users. Either we should only operate by-column, or allow users to control whether the UDF they provide is passed a Series or DataFrame. If we do enable users to control this, we need to decide on what the right behavior is in the by-frame case.
Relevant code:
pandas/pandas/core/groupby/generic.py
Lines 1546 to 1569 in b4493b6
Logic: In the case where the user passes a callable (UDF), pandas will operate by-frame when:
- There is one grouping (e.g.
.groupby("a")
and not.groupby(["a", "b"])
) and the user is passingargs
orkwargs
through to the UDF; or - There is one grouping and attempting
.agg([func])
raises a ValueError with "No objects to concatenate" in the message
In all other cases, pandas operates by-column. The only place (2) is currently hit in the test-suite is aggregating an empty DataFrame. I do not see how else it might be possible to hit (2).
Impact: When a user provides a UDF that can work either by-column or by-frame, but not necessarily produce the same result, whether they have a single grouping and/or provide a arg/kwarg to pass through can produce different results. This is #39169.
This seems to me like a particularly bad behavior in DataFrameGroupBy.agg, and I'd like to resolve this bug.
Option A: Only support by-column in groupby(...).agg
Pros: Most straight-forward.
Cons: Removes ability of users to use agg when they want the UDF to use multiple columns (which isn't very accessible anyways).
Option B: Add an argument
Users would communicate what they want the UDF to accept by passing an argument.
df.groupby(...).agg(func, by="frame")
# or
df.groupby(...).agg(func, by="column")
Pros: Enables users to have their UDF accept a frame when the desire.
Cons: We need to decide on behavior in the case of by="frame"
(see below).
Option C-a: Required Decorator
Users would communicate what they want the UDF to accept by adding a decorator.
@pd.agg_decorator(by="frame")
def func(...):
...
# or
@pd.agg_decorator(by="column")
def func(...):
...
Pros: Enables users to have their UDF accept a frame when the desire.
Cons:
- Would require users to use a decorator.
- Would be awkward to use with lambdas.
- We need to decide on behavior in the case of
by="frame"
(see below).
Option C-b: Optional Decorator
This is somewhat of a combination of Option A and Option C-a. Without a decorator, we would always operate by-column. If the decorator is provided, we could operate by-frame.
If we do decide to support by-frame
Then we need to decide on its behavior. Currently _aggregate_frame
will create a dictionary, roughly as
{group: func(group_df, *args, **kwargs) for group, group_df in self}
pass this dictionary to the DataFrame constructor, and then take a transpose.
pandas/pandas/core/groupby/generic.py
Lines 1614 to 1621 in b4493b6
This behavior works well when scalars are returned by the UDF, but not iterable objects. E.g.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
def func1(x, y):
return [1, 2, 3]
print(gb.agg(func1, y=3))
# ValueError: Length of values (3) does not match length of index (1)
def func2(x, y):
return [1]
print(gb.agg(func2, y=3))
# b
# a
# 1 1
# 2 1
def func3(x, y):
return pd.Series([1, 2, 3])
print(gb.agg(func3, y=3))
# b
# a
# 1 NaN
# 2 NaN
Option I: Treat all returns as if they are scalars.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}))
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1 x 5
# y 4
# dtype: int64
# 2 x 5
# y 4
# dtype: int64
# dtype: object
Pros: Simple for users to understand and us to implement.
Cons: Does not allow performance enhancement by operating on a DataFrame and returning a Series (e.g. lambda x: x.sum()
where x
is a DataFrame)
Option II: Add an argument.
Users would communicate what they want the UDF return interpreted as via an argument.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="row")
# x y
# 0 5 4
# 1 5 4
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="scalar")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1 x 5
# y 4
# dtype: int64
# 2 x 5
# y 4
# dtype: int64
# dtype: object
Pros: Explicit user control.
Cons: Has an additional argument.
Option IIIa: Series is interpeted as stack vertically, all others are scalars.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
# x y
# 0 5 4
# 1 5 4
Pros: No additional argument / features needed.
Cons: Somewhat magical, users cannot have pandas treat Series as if it were a scalar.
Option IIIb: Series is interpeted as stack vertically unless boxed with pd.Scalar
, all others are scalars.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
# x y
# 0 5 4
# 1 5 4
By boxing the UDFs result with pd.Scalar
tells pandas to treat it like a scalar. Note that pd.Scalar
does not currently exist.
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Scalar(pd.Series({"x": 5, "y": 4})), by="frame")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1 x 5
# y 4
# dtype: int64
# 2 x 5
# y 4
# dtype: int64
# dtype: object
Pros: No additional argument needed.
Cons: Somewhat magical, need to add pd.Scalar
.