Skip to content

Why does Series.transform() exist? #31937

Closed
@UchuuStranger

Description

@UchuuStranger

This is my first issue on GitHub, so apologies in advance if there's something wrong with the format.

My issue does not have any expected output, I just really want to understand if and why the Series.transform() method is not redundant. Overall, the transform() methods are very similar to apply() methods, and as I was trying to figure out what the difference between them is (this Stack Overflow topic was helpful), I managed to pinpoint 3 primary differences:

  1. When the DataFrame is grouped on several categories, apply() sends the entire sub-DataFrames within the function, while transform() sends each column of each sub-DataFrame separately. That's why columns can't access values in other columns within transform();
  2. When the input passed to the function is an iterable of a certain length, apply() can still have the output of any length, while transform() has a limitation of having to output an iterable of the same length as the input;
  3. When the function outputs a scalar, apply() returns that scalar, while transform() propagates that scalar to the iterable of the input length.

I conducted a series of experiments that test these three differences on each applicable pandas object type: Series, DataFrame, SeriesGroupBy, and DataFrameGroupBy. I can send my ipynb with the code and the results if necessary, but it would be sufficient to just look at the conclusion for the Series type:

1 – not applicable. In both cases the function has a scalar input.
2 – not applicable. No matter what the function returns, in both cases the result is assigned to the single cell, even if it means entire DataFrames within cells of a Seires.
3 – not applicable. The input length is always "1" (it's considered "1" even when it's an iterable), so there's no need to propagate.

Inapplicability of 1 is self-explanatory. But 2 was a surprise. Below is the code I tried:

import pandas as pd

df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

def return_df(x):
    return pd.DataFrame([[4, 5], [3, 2]])

def return_series(x):
    return pd.Series([1, 2])

df['a'].transform(return_df)
df['a'].transform(return_series)

If you try this code, you'll see that it doesn't matter what the function returns. Whatever it is, it will be put inside the single Series cell in its entirety. Is this behavior intentional? It results in the output size being predetermined by the input size, so all the size checks that Series.transform() has within itself become redundant. I can't imagine any situation where Series.transform() could behave in a different way from Series.apply(). And that raises the question I posed: why does Series.transform() exist?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions