Description
A discussion has been going on in #54747 (PDEP 13) about making Series.transform
and DataFrame.transform
always operate on Series. See #54747 (comment) and related comments. Opening a separate issue to separate that discussion from PDEP 13/#54747.
Currently, Series.transform
tries to operates on series element and if that fails it tries operating on the series. So it uses a fallback mechanism, which makes it difficult to use + the first choice (element-wise operations) is very slow. DataFrame.transform
operates on series (i.e. columns/rows) when given callables, but operates on elements, when given lists or dicts of callables, which is inconsistent. Examples:
>>> df = pd.DataFrame({"x":range(100_000)})
>>> %timeit df["x"].transform(lambda x: x + 1) # operates on elements, slow
15.5 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['x'].transform(np.sin) # ufunc, fast
784 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit df['x'].transform(lambda x: np.sin(x)). # non-ufunc, operates on elements, slow
86.6 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.transform(lambda x: x + 1) # operates on the columns/series, fast
142 µs ± 589 ns per loop
>>> %timeit df.transform([lambda x: x + 1]) # lists/dicts operate on the elements, slow
16.7 ms ± 165 µs per loop
All in all, the above is very inconsistent and difficult to reason about for users, similarly to the discussion regarding apply
in #54747/PDEP 13.
I propose to deprecate element-wise operations in (Series|DataFrame).transform
, so in Pandas v3.0 giving callables (and lists/dicts of callables) to (Series|DataFrame).transform
always operates on series. The benefit of this is that the (Series|DataFrame).transform
method will become much more predictable and faster. When users want to do element-wise operations, they should be directed to use (Series|DataFrame).map
. So no functionality is lost, but we get clearer separation between series-wise and element-wise operations.
The deprecation is proposed implemented in pandas v2.2, where we add a new keyword parameter series_ops_only
to (Series|DataFrame).transform
. When set to true, (Series|DataFrame).transform
will always operate on the whole series. When False, the old behavior will be kept, and a deprecation warning will be emitted. In pandas v3.0, the old behavior will be removed and (Series|DataFrame).transform
will only operate on series.
Related issues:
- PDEP-13: Deprecate the apply method on Series and DataFrame and make the agg and transform methods operate on series data #54747
- API: make the func in Series.apply always operate on the Series #52140
- DEPR: make Series.agg aggregate when possible #53325 (similar issue for
agg
, already implemented) - API: Signature of UDF methods #40112