Open
Description
Currently the use of Series.apply
/ Series.agg
and DataFrame.apply
/ DataFrame.agg
is confusing. In particular, sometimes the user calls apply
and gets the results of agg
or vice-versa:
apply
with list- or dict-like arguments callsagg
.DataFrame.agg
with a UDF callsDataFrame.apply
.Series.agg
with a UDF callsSeries.apply
, and if this fails, attempts to pass the Series to the UDF.
If we are to change the current behavior, it will need to go through deprecation. This will be a bit tricky with the way the code paths switch between agg
and apply
, but I believe it can be done (see #49672 (comment)).
In order to clarify the difference between agg and apply for users, I propose the following for a single argument:
- (unchanged)
DataFrame.apply
will apply the function to each Series, the result shape will be inferred from the output. - (unchanged)
DataFrame.applymap
will apply the function to each cell. - (unchanged)
Series.apply
will apply the function to each row. - (changed)
DataFrame.agg
will act on each Series that makes up the DataFrame, the result will always be a Series. Currently the result shape is inferred from the output. - (changed)
Series.agg
will act on the Series, and the result will be whatever the return is. Currentlyapply
is tried first and only when that fails willagg
act on the Series.
And for multiples:
- (changed) When given a list-like or dict-like,
agg
will callagg
for each argument andapply
will callapply
. Currentlyapply
will callagg
in this case.
I've put up #49672 to show the implementation and the impact on our tests. Some examples:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
# For reducers, apply and agg act the same on a DataFrame
print(df.apply(str))
# a 0 1\n1 2\n2 3\nName: a, dtype: int64
# b 0 4\n1 5\n2 6\nName: b, dtype: int64
# dtype: object
print(df.agg(str))
# a 0 1\n1 2\n2 3\nName: a, dtype: int64
# b 0 4\n1 5\n2 6\nName: b, dtype: int64
# dtype: object
# apply sees a Series output as not being a reducer, combines results with `concat(..., axis=1)`
# agg treats everything as a reducer. The result is a Series whose entries are themselves Series.
print(df.apply(lambda x: pd.concat([x, x])))
# a b
# 0 1 4
# 1 2 5
# 2 3 6
# 0 1 4
# 1 2 5
# 2 3 6
print(df.agg(lambda x: pd.concat([x, x])))
# a 0 1
# 1 2
# 2 3
# 0 1
# 1 2
# 2 3
# Name...
# b 0 4
# 1 5
# 2 6
# 0 4
# 1 5
# 2 6
# Name...
# dtype: object
# apply sees list output as not being a reducer, makes them into columns of the result (a no-op in this case)
# agg treats everything as a reducer
print(df.apply(lambda x: list(x)))
# a b
# 0 1 4
# 1 2 5
# 2 3 6
print(df.agg(lambda x: list(x)))
# a [1, 2, 3]
# b [4, 5, 6]
# dtype: object