Skip to content

ASV: reduce overall run time for GroupByMethods benchmarks #44604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 25, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -443,7 +443,7 @@ class GroupByMethods:
"var",
],
["direct", "transformation"],
[1, 2, 5, 10],
[1, 5],
]

def setup(self, dtype, method, application, ncols):
Expand All @@ -455,6 +455,7 @@ def setup(self, dtype, method, application, ncols):
raise NotImplementedError

if application == "transformation" and method in [
"describe",
"head",
"tail",
"unique",
Expand All @@ -464,7 +465,12 @@ def setup(self, dtype, method, application, ncols):
# DataFrameGroupBy doesn't have these methods
raise NotImplementedError

ngroups = 1000
if method == "describe" and ncols == 5:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do this without the ncols check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im imagining looking at results and seeing the ncols==5 case being faster than the ncols==1 case (or just much less than 5x slower) and being confused until i remember this special case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree or add a comment here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, certainly (I initially used ngroups = 100 for each of the three methods for both ncols parameters, but then only describe for ncols=5 was still on the slow side.
Using the smaller ngroups for both cases of describe will make the ncols=1 still faster, but that can't hurt I suppose.

ngroups = 20
elif method in ["describe", "mad", "skew"]:
ngroups = 100
else:
ngroups = 1000
size = ngroups * 2
rng = np.arange(ngroups).reshape(-1, 1)
rng = np.broadcast_to(rng, (len(rng), ncols))
Expand All @@ -491,9 +497,6 @@ def setup(self, dtype, method, application, ncols):
cols = cols[0]

if application == "transformation":
if method == "describe":
raise NotImplementedError

self.as_group_method = lambda: df.groupby("key")[cols].transform(method)
self.as_field_method = lambda: df.groupby(cols)["key"].transform(method)
else:
Expand Down