Skip to content

API: make the func in Series.apply always operate on the Series #52140

Open
@topper-123

Description

@topper-123

I've lately worked on making Series.map simpler as part of implementing the na_action on all ExtensionArray.map methods. As part of that, I made #52033. That PR (and the current SeriesApply.apply_standard more generally) very clearly shows how Series.apply & Series.map are very similar, but different enough for it to be confusing when it's a good idea to use one over the other and when Series.apply especially is a bad idea to use.

I propose doing some changes in how Series.apply work when given a single callable. This change is somewhat fundamental, so I understand that this can be controversial, but I believe that this change will be for the better for Pandas. I'm of course ready for discussion and possibly (but hopefully not 😄 ) disagreement. We'll see.

I'll show the proposal below. First I'll show what the similarities and differences are between the two methods, then what the problem is in my view with current API, and then my proposed solution.

Similarities and differences between Series.apply and Series.map

The similarity between the methods is especially that they both fall back to use Series._map_values and there use algorithms.map_array or ExtensionArray.map as relevant.

The differences are many, but each one is relative minor:

  1. Series.apply has a convert_dtype parameter, which Series.map doesn't
  2. Series.map has a na_action parameter, which Series.apply doesn't
  3. Series.apply can take advantage of numpy ufuncs, which Series.map can't
  4. Series.apply can take args and **kwargs, which Series.map can't
  5. Series.apply will return a Dataframe, if its result is a listlike of Series, which Series.map won't
  6. Series.apply is more general and can take a string, e.g. "sum", or lists or dicts of inputs which Series.map can't.

Also, Series.apply is a bit of a parent method of Series.agg & Series.transform.

The problems

The above similarities and many minor differences makes for (IMO) confusing and too complex rules for when its a good idea to use .apply over .map to do operations, and vica versa. I will show some examples below.

First some setup:

>>> import numpy as np
>>> import pandas as pd 
>>>
>>> small_ser = pd.Series([1, 2, 3])
>>> large_ser = pd.Series(range(100_000))

1: string vs numpy funcs in Series.apply

>>> small_ser.apply("sum")
6
>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64

It will surprise new users that these two give different results. Also, anyone using the second pattern is probably making a mistake.

Note that giving np.sum to DataFrame.apply aggregates properly:

>>> small_ser.to_frame().apply(np.sum)
0    6
dtype: int64

1.5 Callables vs. list/dict of callables (added 2023-04-07)

>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.apply([np.sum])
sum    6
dtype: int64

Also with non-numpy callables:

>>> small_ser.apply(lambda x: x.sum())
AttributeError: 'int' object has no attribute 'sum'
>>> small_ser.apply([lambda x: x.sum()])
<lambda>    6
dtype: int64

In both cases above the difference is that Series.apply operates element-wise, if given a callable, but series-wise if given a list/dict of callables.

2. Functions in Series.apply (& Series.transform)

The Series.apply doc string have examples with using lambdas, but lambdas in Series.apply is a bad practices because of bad performance:

>>> %timeit large_ser.apply(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop

Currently, Series does not have a method that makes a callable operate on a series' data. Instead users need to use Series.pipe for that operation in order for the operation to be efficient:

>>> %timeit large_ser.pipe(lambda x: x + 1)
44 µs ± 363 ns per loop

(The reason for the above performance differences is that apply gets called on each single element, while pipe calls x.__add__(1), which operates on the whole array).

Note also that .pipe operates on the Series while applycurrently operates on each element in the data, so there is some differences that may have some consequence in some cases.

Also notice that Series.transform has the same performance problems:

>>> %timeit large_ser.transform(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop

3. ufuncs in Series.apply vs. in Series.map

Performance-wise, ufuncs are fine in Series.apply, but not in Series.map:

>>> %timeit large_ser.apply(np.sqrt)
71.6 µs ± 1.17 µs per loop
>>> %timeit large_ser.map(np.sqrt)
63.9 ms ± 69.5 µs per loop

It's difficult for users to understand why one is fast and the other slow (answer: only apply correctly works with ufuncs).

It is also difficult to understand why ufuncs are fast in apply, while other callables are slow in apply (answer: it's because ufuncs operate on the whole array, while other callables operate elementwise).

4. callables in Series.apply is bad, callables in SeriesGroupby.apply is fine

I showed above that using (non-ufunc) callables in Series.apply is bad performancewise. OTOH using them in SeriesGroupby.apply is fine:

>>> %timeit large_ser.apply(lambda x: x + 1)
24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit large_ser.groupby(large_ser > 50_000).apply(lambda x: x + 1)
11.3 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note that most of the time in the groupby was used doing groupby ops, so the actual difference in the apply op is much larger, and similar to example 2 above.

Having callables being ok to use in the SeriesGroupby.apply method, but not in the Series.Apply is confusing IMO.

5: callables in Series.apply that return Series transform data to a DataFrame

Series.apply has an exception that if the callable returns a list-like of Series, the Series will be concatenated to a DataFrame. This op is very slow operation and hence generally a bad idea:

>>> small_ser.apply(lambda x: pd.Series([x, x+1], index["a", "b"]))
   a   b
0  0   1
1  1   2
2  2   3
>>> %timeit large_ser.apply(lambda x: pd.Series([x, x+1]))
# timing takes too long to measure

It's probably never a good idea to use this pattern, and e.g. .pipe is much faster, so e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x+1})) will be much faster. If we really do need operation on single element in that fashion it is still possible using pipe, e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x.map(some_func))) and also just directly pd.DataFrame({"a": large_ser, "b": large_ser.map(some_func))).

So giving callables that return Series to Series.apply is a bad pattern and should be discouraged. (If users really want to do that pattern, they should build the list of Series themselves and take responsibilty for the slowdown).

6. Series.apply vs. Series.agg

The doc string for Series.agg says about the method's func parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:

>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.agg(np.sum)
6

You could argue the doc string is correct (it doesn't raise...), but you could also argue it isn't (because the results are different). I'd personally expect "must work when passed to series.apply" would mean "gives the same result when passed to to agg and to apply".

7. dictlikes vs. listlikes in Series.apply (added 2023-06-04)

Giving a list of transforming arguments to Series.apply returns a DataFrame:

>>> small_ser.apply(["sqrt", np.abs])
       sqrt  absolute
0  1.000000         1
1  1.414214         2
2  1.732051         3

But giving a dict of transforming arguments returns a Series with a MultiIndex:

>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
sqrt  0    1.000000
      1    1.414214
      2    1.732051
abs   0    1.000000
      1    2.000000
      2    3.000000
dtype: float64

These two should give same-shaped output for consistency. Using Series.transform instead of Series.apply, it returns a DataFrame in both cases and I think the dictlike example above should return a DataFrame similar to the listlike example.

Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using apply.

Proposal

With the above in mind, I propose that:

  1. Series.apply takes callables that always operate on the series. I.e. let series.apply(func) be similar to func(series) + the needed additional functionality.
  2. Series.map takes callables that operate on each element individually. I.e. series.map(func) will be similar to the current series._map_values(func) + the needed additional functionality.
  3. The parameter convert_dtype will be deprecated in Series.apply (already done in DEPR: Deprecate the convert_dtype param in Series.Apply #52257).
  4. A parameter convert_dtype will NOT be added to Series.map (comment) by @rhshadrach).
  5. The ability in Series.apply to convert a list[Series] to a DataFrame will be deprecated (already done in DEPR: Deprecate returning a DataFrame in SeriesApply.apply_standard #52123).
  6. The ability to convert a list[Series] to a DataFrame will NOT be added to Series.map.
  7. The changes made to Series.applywill propagate to Series.agg and Series.transform.

The difference between Series.apply() & Series.map() will then be that:

  • Series.apply() makes the passed-in callable operate on the series, similarly to how (DataFrame|SeriesGroupby|DataFrameGroupBy).apply. operate on series. This is very fast and can do almost anything,
  • Series.map() makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used if Series.apply can't do it.

so, IMO, this API change will help make Pandas Series.(apply|map) API simpler without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas.

Deprecation process

The cumbersome part of the deprecation process will be to change Series.apply to only work array-wise, ie. to do func(series._values) always. This can be done by adding an array_ops_only parameter to Series.apply, so:

>>> def apply(self, ..., array_ops_only: bool | NoDefault=no_default, ...):
    if array_ops_only is no_default:
        warn("....")
        array_ops_only = False
    ...

and then change the meaning of that parameter in pandas v3.0 again to make people remove from their code.

The other changes are more easy: convert_dtype in Series.apply will be deprecated just like you would normally for method parameters. The ability to convert a list of Series to a DataFrame will emit a deprecation warning, when that code path is encountered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, Map

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions