ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg

This topic was discussed in the dev meeting on 2024-04-10. I believe I captured all ideas expressed there.

Summary: DataFrameGroupBy.agg will sometimes pass the UDF each column of the original DataFrame as a Series ("by-column"), and sometimes pass the UDF the entire DataFrame ("by-frame"). This causes issues for users. Either we should only operate by-column, or allow users to control whether the UDF they provide is passed a Series or DataFrame. If we do enable users to control this, we need to decide on what the right behavior is in the by-frame case.

Relevant code: https://github.com/pandas-dev/pandas/blob/b4493b6b174434a2c7f99e7bcd58eedf18f037fa/pandas/core/groupby/generic.py#L1546-L1569

Logic: In the case where the user passes a callable (UDF), pandas will operate by-frame when:

 1. There is one grouping (e.g. `.groupby("a")` and not `.groupby(["a", "b"])`) and the user is passing `args` or `kwargs` through to the UDF; or
 2. There is one grouping and attempting `.agg([func])` raises a ValueError with "No objects to concatenate" in the message
 
In all other cases, pandas operates by-column. The only place (2) is currently hit in the test-suite is aggregating an empty DataFrame. I do not see how else it might be possible to hit (2).

Impact: When a user provides a UDF that can work either by-column or by-frame, but not necessarily produce the same result, whether they have a single grouping and/or provide a arg/kwarg to pass through can produce different results. This is #39169.

This seems to me like a particularly bad behavior in DataFrameGroupBy.agg, and I'd like to resolve this bug.

Option A: Only support by-column in groupby(...).agg

Pros: Most straight-forward.
Cons: Removes ability of users to use agg when they want the UDF to use multiple columns (which isn't very accessible anyways).

Option B: Add an argument

Users would communicate what they want the UDF to accept by passing an argument.

```
df.groupby(...).agg(func, by="frame")
    

# or    
    
df.groupby(...).agg(func, by="column")
```

Pros: Enables users to have their UDF accept a frame when the desire.
Cons: We need to decide on behavior in the case of `by="frame"` (see below).

Option C-a: Required Decorator

Users would communicate what they want the UDF to accept by adding a decorator.

```
@pd.agg_decorator(by="frame")
def func(...):
    ...
    

# or    
    
@pd.agg_decorator(by="column")
def func(...):
    ...
```

Pros: Enables users to have their UDF accept a frame when the desire.
Cons: 
 - Would require users to use a decorator.
 - Would be awkward to use with lambdas.
 - We need to decide on behavior in the case of `by="frame"` (see below).
 
Option C-b: Optional Decorator

This is somewhat of a combination of Option A and Option C-a. Without a decorator, we would always operate by-column. If the decorator is provided, we could operate by-frame.

## If we do decide to support by-frame

Then we need to decide on its behavior. Currently `_aggregate_frame` will create a dictionary, roughly as

    {group: func(group_df, *args, **kwargs) for group, group_df in self}
    
pass this dictionary to the DataFrame constructor, and then take a transpose. https://github.com/pandas-dev/pandas/blob/b4493b6b174434a2c7f99e7bcd58eedf18f037fa/pandas/core/groupby/generic.py#L1614-L1621

This behavior works well when scalars are returned by the UDF, but not iterable objects. E.g.

```
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")


def func1(x, y):
    return [1, 2, 3]
print(gb.agg(func1, y=3))
# ValueError: Length of values (3) does not match length of index (1)

def func2(x, y):
    return [1]
print(gb.agg(func2, y=3))
#    b
# a   
# 1  1
# 2  1

def func3(x, y):
    return pd.Series([1, 2, 3])
print(gb.agg(func3, y=3))
#     b
# a    
# 1 NaN
# 2 NaN
```

Option I: Treat all returns as if they are scalars.

```
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}))
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object
```

Pros: Simple for users to understand and us to implement.
Cons: Does not allow performance enhancement by operating on a DataFrame and returning a Series (e.g. `lambda x: x.sum()` where `x` is a DataFrame)

Option II: Add an argument.

Users would communicate what they want the UDF return interpreted as via an argument.

```
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="row")
#    x  y
# 0  5  4
# 1  5  4
```


```
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame", returns="scalar")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object
```

Pros: Explicit user control.
Cons: Has an additional argument.

Option IIIa: Series is interpeted as stack vertically, all others are scalars.

```
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
#    x  y
# 0  5  4
# 1  5  4
```

Pros: No additional argument / features needed.
Cons: Somewhat magical, users cannot have pandas treat Series as if it were a scalar.

Option IIIb: Series is interpeted as stack vertically unless boxed with `pd.Scalar`, all others are scalars.

```
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Series({"x": 5, "y": 4}), by="frame")
#    x  y
# 0  5  4
# 1  5  4
```

By boxing the UDFs result with `pd.Scalar` tells pandas to treat it like a scalar. Note that `pd.Scalar` does not currently exist.

```
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
gb = df.groupby("a")
gb.agg(lambda x: pd.Scalar(pd.Series({"x": 5, "y": 4})), by="frame")
# The result is a Series with index [1, 2] and the elements are themselves Series
# 1    x    5
# y    4
# dtype: int64
# 2    x    5
# y    4
# dtype: int64
# dtype: object
```

Pros: No additional argument needed.
Cons: Somewhat magical, need to add `pd.Scalar`.

cc @jorisvandenbossche @jbrockmendel @Dr-Irv 

	if self._grouper.nkeys > 1:
	# test_groupby_as_index_series_scalar gets here with 'not self.as_index'
	return self._python_agg_general(func, args, *kwargs)
	elif args or kwargs:
	# test_pass_args_kwargs gets here (with and without as_index)
	# can't return early
	result = self._aggregate_frame(func, args, *kwargs)

	else:
	# try to treat as if we are passing a list
	gba = GroupByApply(self, [func], args=(), kwargs={})
	try:
	result = gba.agg()

	except ValueError as err:
	if "No objects to concatenate" not in str(err):
	raise
	# _aggregate_frame can fail with e.g. func=Series.mode,
	# where it expects 1D values but would be getting 2D values
	# In other tests, using aggregate_frame instead of GroupByApply
	# would give correct values but incorrect dtypes
	# object vs float64 in test_cython_agg_empty_buckets
	# float64 vs int64 in test_category_order_apply
	result = self._aggregate_frame(func)

	result: dict[Hashable, NDFrame \| np.ndarray] = {}
	for name, grp_df in self._grouper.get_iterator(obj):
	fres = func(grp_df, args, *kwargs)
	result[name] = fres

	result_index = self._grouper.result_index
	out = self.obj._constructor(result, index=obj.columns, columns=result_index)
	out = out.T

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg #58225

If we do decide to support by-frame

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg #58225

Description

If we do decide to support by-frame

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions