API: Clarify difference between agg and apply for Series / DataFrame

Currently the use of `Series.apply` / `Series.agg` and `DataFrame.apply` / `DataFrame.agg` is confusing. In particular, sometimes the user calls `apply` and gets the results of `agg` or vice-versa:

 - `apply` with list- or dict-like arguments calls `agg`.
 - `DataFrame.agg` with a UDF calls `DataFrame.apply`.
 - `Series.agg` with a UDF calls `Series.apply`, and if this fails, attempts to pass the Series to the UDF.

If we are to change the current behavior, it will need to go through deprecation. This will be a bit tricky with the way the code paths switch between `agg` and `apply`, but I believe it can be done (see https://github.com/pandas-dev/pandas/pull/49672#issuecomment-1312775348).

In order to clarify the difference between agg and apply for users, I propose the following for a single argument:

 - (unchanged) `DataFrame.apply` will apply the function to each Series, the result shape will be inferred from the output.
 - (unchanged) `DataFrame.applymap` will apply the function to each cell.
 - (unchanged) `Series.apply` will apply the function to each row.
 - (changed) `DataFrame.agg` will act on each Series that makes up the DataFrame, the result will always be a Series. Currently the result shape is inferred from the output.
 - (changed) `Series.agg` will act on the Series, and the result will be whatever the return is. Currently `apply` is tried first and only when that fails will `agg` act on the Series.
 
And for multiples:

 - (changed) When given a list-like or dict-like, `agg` will call `agg` for each argument and `apply` will call `apply`. Currently `apply` will call `agg` in this case.

I've put up #49672 to show the implementation and the impact on our tests. Some examples:

```
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

# For reducers, apply and agg act the same on a DataFrame

print(df.apply(str))
# a    0    1\n1    2\n2    3\nName: a, dtype: int64
# b    0    4\n1    5\n2    6\nName: b, dtype: int64
# dtype: object

print(df.agg(str))
# a    0    1\n1    2\n2    3\nName: a, dtype: int64
# b    0    4\n1    5\n2    6\nName: b, dtype: int64
# dtype: object

# apply sees a Series output as not being a reducer, combines results with `concat(..., axis=1)`
# agg treats everything as a reducer. The result is a Series whose entries are themselves Series.

print(df.apply(lambda x: pd.concat([x, x])))
#    a  b
# 0  1  4
# 1  2  5
# 2  3  6
# 0  1  4
# 1  2  5
# 2  3  6

print(df.agg(lambda x: pd.concat([x, x])))
# a    0    1
# 1    2
# 2    3
# 0    1
# 1    2
# 2    3
# Name...
# b    0    4
# 1    5
# 2    6
# 0    4
# 1    5
# 2    6
# Name...
# dtype: object

# apply sees list output as not being a reducer, makes them into columns of the result (a no-op in this case)
# agg treats everything as a reducer

print(df.apply(lambda x: list(x)))
#    a  b
# 0  1  4
# 1  2  5
# 2  3  6

print(df.agg(lambda x: list(x)))
# a    [1, 2, 3]
# b    [4, 5, 6]
# dtype: object
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Clarify difference between agg and apply for Series / DataFrame #49673

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: Clarify difference between agg and apply for Series / DataFrame #49673

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions