Description
I've lately worked on making Series.map
simpler as part of implementing the na_action
on all ExtensionArray.map
methods. As part of that, I made #52033. That PR (and the current SeriesApply.apply_standard
more generally) very clearly shows how Series.apply
& Series.map
are very similar, but different enough for it to be confusing when it's a good idea to use one over the other and when Series.apply
especially is a bad idea to use.
I propose doing some changes in how Series.apply
work when given a single callable. This change is somewhat fundamental, so I understand that this can be controversial, but I believe that this change will be for the better for Pandas. I'm of course ready for discussion and possibly (but hopefully not 😄 ) disagreement. We'll see.
I'll show the proposal below. First I'll show what the similarities and differences are between the two methods, then what the problem is in my view with current API, and then my proposed solution.
Similarities and differences between Series.apply
and Series.map
The similarity between the methods is especially that they both fall back to use Series._map_values
and there use algorithms.map_array
or ExtensionArray.map
as relevant.
The differences are many, but each one is relative minor:
Series.apply
has aconvert_dtype
parameter, whichSeries.map
doesn'tSeries.map
has ana_action
parameter, whichSeries.apply
doesn'tSeries.apply
can take advantage of numpy ufuncs, whichSeries.map
can'tSeries.apply
can takeargs
and**kwargs
, whichSeries.map
can'tSeries.apply
will return a Dataframe, if its result is a listlike of Series, whichSeries.map
won'tSeries.apply
is more general and can take a string, e.g."sum"
, or lists or dicts of inputs whichSeries.map
can't.
Also, Series.apply
is a bit of a parent method of Series.agg
& Series.transform
.
The problems
The above similarities and many minor differences makes for (IMO) confusing and too complex rules for when its a good idea to use .apply
over .map
to do operations, and vica versa. I will show some examples below.
First some setup:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> small_ser = pd.Series([1, 2, 3])
>>> large_ser = pd.Series(range(100_000))
1: string vs numpy funcs in Series.apply
>>> small_ser.apply("sum")
6
>>> small_ser.apply(np.sum)
0 1
1 2
2 3
dtype: int64
It will surprise new users that these two give different results. Also, anyone using the second pattern is probably making a mistake.
Note that giving np.sum
to DataFrame.apply
aggregates properly:
>>> small_ser.to_frame().apply(np.sum)
0 6
dtype: int64
1.5 Callables vs. list/dict of callables (added 2023-04-07)
>>> small_ser.apply(np.sum)
0 1
1 2
2 3
dtype: int64
>>> small_ser.apply([np.sum])
sum 6
dtype: int64
Also with non-numpy callables:
>>> small_ser.apply(lambda x: x.sum())
AttributeError: 'int' object has no attribute 'sum'
>>> small_ser.apply([lambda x: x.sum()])
<lambda> 6
dtype: int64
In both cases above the difference is that Series.apply
operates element-wise, if given a callable, but series-wise if given a list/dict of callables.
2. Functions in Series.apply
(& Series.transform
)
The Series.apply
doc string have examples with using lambdas, but lambdas in Series.apply
is a bad practices because of bad performance:
>>> %timeit large_ser.apply(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop
Currently, Series
does not have a method that makes a callable operate on a series' data. Instead users need to use Series.pipe
for that operation in order for the operation to be efficient:
>>> %timeit large_ser.pipe(lambda x: x + 1)
44 µs ± 363 ns per loop
(The reason for the above performance differences is that apply gets called on each single element, while pipe
calls x.__add__(1)
, which operates on the whole array).
Note also that .pipe
operates on the Series
while apply
currently operates on each element in the data, so there is some differences that may have some consequence in some cases.
Also notice that Series.transform
has the same performance problems:
>>> %timeit large_ser.transform(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop
3. ufuncs in Series.apply
vs. in Series.map
Performance-wise, ufuncs are fine in Series.apply
, but not in Series.map
:
>>> %timeit large_ser.apply(np.sqrt)
71.6 µs ± 1.17 µs per loop
>>> %timeit large_ser.map(np.sqrt)
63.9 ms ± 69.5 µs per loop
It's difficult for users to understand why one is fast and the other slow (answer: only apply
correctly works with ufuncs).
It is also difficult to understand why ufuncs are fast in apply
, while other callables are slow in apply
(answer: it's because ufuncs operate on the whole array, while other callables operate elementwise).
4. callables in Series.apply
is bad, callables in SeriesGroupby.apply
is fine
I showed above that using (non-ufunc) callables in Series.apply
is bad performancewise. OTOH using them in SeriesGroupby.apply
is fine:
>>> %timeit large_ser.apply(lambda x: x + 1)
24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit large_ser.groupby(large_ser > 50_000).apply(lambda x: x + 1)
11.3 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Note that most of the time in the groupby was used doing groupby ops, so the actual difference in the apply
op is much larger, and similar to example 2 above.
Having callables being ok to use in the SeriesGroupby.apply
method, but not in the Series.Apply
is confusing IMO.
5: callables in Series.apply
that return Series transform data to a DataFrame
Series.apply
has an exception that if the callable returns a list-like of Series, the Series will be concatenated to a DataFrame. This op is very slow operation and hence generally a bad idea:
>>> small_ser.apply(lambda x: pd.Series([x, x+1], index["a", "b"]))
a b
0 0 1
1 1 2
2 2 3
>>> %timeit large_ser.apply(lambda x: pd.Series([x, x+1]))
# timing takes too long to measure
It's probably never a good idea to use this pattern, and e.g. .pipe
is much faster, so e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x+1}))
will be much faster. If we really do need operation on single element in that fashion it is still possible using pipe
, e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x.map(some_func)))
and also just directly pd.DataFrame({"a": large_ser, "b": large_ser.map(some_func)))
.
So giving callables that return Series
to Series.apply
is a bad pattern and should be discouraged. (If users really want to do that pattern, they should build the list of Series themselves and take responsibilty for the slowdown).
6. Series.apply
vs. Series.agg
The doc string for Series.agg
says about the method's func
parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:
>>> small_ser.apply(np.sum)
0 1
1 2
2 3
dtype: int64
>>> small_ser.agg(np.sum)
6
You could argue the doc string is correct (it doesn't raise...), but you could also argue it isn't (because the results are different). I'd personally expect "must work when passed to series.apply" would mean "gives the same result when passed to to agg
and to apply
".
7. dictlikes vs. listlikes in Series.apply
(added 2023-06-04)
Giving a list of transforming arguments to Series.apply
returns a DataFrame
:
>>> small_ser.apply(["sqrt", np.abs])
sqrt absolute
0 1.000000 1
1 1.414214 2
2 1.732051 3
But giving a dict of transforming arguments returns a Series
with a MultiIndex
:
>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
sqrt 0 1.000000
1 1.414214
2 1.732051
abs 0 1.000000
1 2.000000
2 3.000000
dtype: float64
These two should give same-shaped output for consistency. Using Series.transform
instead of Series.apply
, it returns a DataFrame
in both cases and I think the dictlike example above should return a DataFrame
similar to the listlike example.
Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using apply
.
Proposal
With the above in mind, I propose that:
Series.apply
takes callables that always operate on the series. I.e. letseries.apply(func)
be similar tofunc(series)
+ the needed additional functionality.Series.map
takes callables that operate on each element individually. I.e.series.map(func)
will be similar to the currentseries._map_values(func)
+ the needed additional functionality.- The parameter
convert_dtype
will be deprecated inSeries.apply
(already done in DEPR: Deprecate the convert_dtype param in Series.Apply #52257). - A parameter
convert_dtype
will NOT be added toSeries.map
(comment) by @rhshadrach). - The ability in
Series.apply
to convert alist[Series]
to a DataFrame will be deprecated (already done in DEPR: Deprecate returning a DataFrame in SeriesApply.apply_standard #52123). - The ability to convert a
list[Series]
to a DataFrame will NOT be added toSeries.map
. - The changes made to
Series.apply
will propagate toSeries.agg
andSeries.transform
.
The difference between Series.apply()
& Series.map()
will then be that:
Series.apply()
makes the passed-in callable operate on the series, similarly to how(DataFrame|SeriesGroupby|DataFrameGroupBy).apply.
operate on series. This is very fast and can do almost anything,Series.map()
makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used ifSeries.apply
can't do it.
so, IMO, this API change will help make Pandas Series.(apply|map)
API simpler without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas.
Deprecation process
The cumbersome part of the deprecation process will be to change Series.apply
to only work array-wise, ie. to do func(series._values)
always. This can be done by adding an array_ops_only
parameter to Series.apply
, so:
>>> def apply(self, ..., array_ops_only: bool | NoDefault=no_default, ...):
if array_ops_only is no_default:
warn("....")
array_ops_only = False
...
and then change the meaning of that parameter in pandas v3.0 again to make people remove from their code.
The other changes are more easy: convert_dtype
in Series.apply
will be deprecated just like you would normally for method parameters. The ability to convert a list of Series to a DataFrame will emit a deprecation warning, when that code path is encountered.