Skip to content

ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg #58225

Open
@rhshadrach

Description

@rhshadrach

This topic was discussed in the dev meeting on 2024-04-10. I believe I captured all ideas expressed there.

Summary: DataFrameGroupBy.agg will sometimes pass the UDF each column of the original DataFrame as a Series ("by-column"), and sometimes pass the UDF the entire DataFrame ("by-frame"). This causes issues for users. Either we should only operate by-column, or allow users to control whether the UDF they provide is passed a Series or DataFrame. If we do enable users to control this, we need to decide on what the right behavior is in the by-frame case.

Relevant code:

if self._grouper.nkeys > 1:
# test_groupby_as_index_series_scalar gets here with 'not self.as_index'
return self._python_agg_general(func, *args, **kwargs)
elif args or kwargs:
# test_pass_args_kwargs gets here (with and without as_index)
# can't return early
result = self._aggregate_frame(func, *args, **kwargs)
else:
# try to treat as if we are passing a list
gba = GroupByApply(self, [func], args=(), kwargs={})
try:
result = gba.agg()
except ValueError as err:
if "No objects to concatenate" not in str(err):
raise
# _aggregate_frame can fail with e.g. func=Series.mode,
# where it expects 1D values but would be getting 2D values
# In other tests, using aggregate_frame instead of GroupByApply
# would give correct values but incorrect dtypes
# object vs float64 in test_cython_agg_empty_buckets
# float64 vs int64 in test_category_order_apply
result = self._aggregate_frame(func)

Logic: In the case where the user passes a callable (UDF), pandas will operate by-frame when:

  1. There is one grouping (e.g. .groupby("a") and not .groupby(["a", "b"])) and the user is passing args or kwargs through to the UDF; or
  2. There is one grouping and attempting .agg([func]) raises a ValueError with "No objects to concatenate" in the message

In all other cases, pandas operates by-column. The only place (2) is currently hit in the test-suite is aggregating an empty DataFrame. I do not see how else it might be possible to hit (2).

Impact: When a user provides a UDF that can work either by-column or by-frame, but not necessarily produce the same result, whether they have a single grouping and/or provide a arg/kwarg to pass through can produce different results. This is #39169.

This seems to me like a particularly bad behavior in DataFrameGroupBy.agg, and I'd like to resolve this bug.

Option A: Only support by-column in groupby(...).agg

Pros: Most straight-forward.
Cons: Removes ability of users to use agg when they want the UDF to use multiple columns (which isn't very accessible anyways).

Option B: Add an argument

Users would communicate what they want the UDF to accept by passing an argument.

df.groupby(...).agg(func, by="frame")
    

# or    
    
df.groupby(...).agg(func, by="column")

Pros: Enables users to have their UDF accept a frame when the desire.
Cons: We need to decide on behavior in the case of by="frame" (see below).

Option C-a: Required Decorator

Users would communicate what they want the UDF to accept by adding a decorator.

@pd.agg_decorator(by="frame")
def func(...):
    ...
    

# or    
    
@pd.agg_decorator(by="column")
def func(...):
    ...

Pros: Enables users to have their UDF accept a frame when the desire.
Cons:

  • Would require users to use a decorator.
  • Would be awkward to use with lambdas.
  • We need to decide on behavior in the case of by="frame" (see below).

Option C-b: Optional Decorator

This is somewhat of a combination of Option A and Option C-a. Without a decorator, we would always operate by-column. If the decorator is provided, we could operate by-frame.

If we do decide to support by-frame

Then we need to decide on its behavior. Currently _aggregate_frame will create a dictionary, roughly as

{group: func(group_df, *args, **kwargs) for group, group_df in self}

pass this dictionary to the DataFrame constructor, and then take a transpose.

result: dict[Hashable, NDFrame | np.ndarray] = {}
for name, grp_df in self._grouper.get_iterator(obj):
fres = func(grp_df, *args, **kwargs)
result[name] = fres
result_index = self._grouper.result_index
out = self.obj._constructor(result, index=obj.columns, columns=result_index)
out = out.T

This behavior works well when scalars are returned by the UDF, but not iterable objects. E.g.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")


def func1(x, y):
    return [1, 2, 3]
print(gb.agg(func1, y=3))
# ValueError: Length of values (3) does not match length of index (1)

def func2(x, y):
    return [1]
print(gb.agg(func2, y=3))
#    b
# a   
# 1  1
# 2  1

def func3(x, y):
    return pd.Series([1, 2, 3])
print(gb.agg(func3, y=3))
#     b
# a    
# 1 NaN
# 2 NaN

Option I: Treat all returns as if they are scalars.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}))
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object

Pros: Simple for users to understand and us to implement.
Cons: Does not allow performance enhancement by operating on a DataFrame and returning a Series (e.g. lambda x: x.sum() where x is a DataFrame)

Option II: Add an argument.

Users would communicate what they want the UDF return interpreted as via an argument.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="row")
#    x  y
# 0  5  4
# 1  5  4
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="scalar")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object

Pros: Explicit user control.
Cons: Has an additional argument.

Option IIIa: Series is interpeted as stack vertically, all others are scalars.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
#    x  y
# 0  5  4
# 1  5  4

Pros: No additional argument / features needed.
Cons: Somewhat magical, users cannot have pandas treat Series as if it were a scalar.

Option IIIb: Series is interpeted as stack vertically unless boxed with pd.Scalar, all others are scalars.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
#    x  y
# 0  5  4
# 1  5  4

By boxing the UDFs result with pd.Scalar tells pandas to treat it like a scalar. Note that pd.Scalar does not currently exist.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Scalar(pd.Series({"x": 5, "y": 4})), by="frame")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object

Pros: No additional argument needed.
Cons: Somewhat magical, need to add pd.Scalar.

cc @jorisvandenbossche @jbrockmendel @Dr-Irv

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapEnhancementGroupbyNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions