Skip to content

REGR: groupby.transform with a UDF performance #55256

Open
@rhshadrach

Description

@rhshadrach
pd.options.mode.copy_on_write = False  # True
size = 10_000
df = pd.DataFrame(
    {
        'a': np.random.randint(0, 100, size),
        'b': np.random.randint(0, 100, size),
        'c': np.random.randint(0, 100, size),
    }
).set_index(['a', 'b']).sort_index()

gb = df.groupby(['a', 'b'])

%timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))

# 2.0.x - CoW=False
# 1.46 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# 2.0.x - CoW=True
# 1.47 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=False
# 4.35 s ± 50.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=True
# 9.11 s ± 76.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Encountered this trying to update some code to use CoW. The regression exists without CoW, but is also worse with it. Haven't done any investigation yet as to why.

cc @phofl @jorisvandenbossche

PS: This code have not been using transform with a UDF 😄

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions