-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: optimize MultiIndex.from_product #7627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like a nice speedup, but could you please verify that #6439 (cartesian product of a DatetimeIndex) is still fixed? e.g.: import pandas as pd
idx = pd.MultiIndex.from_product([[1, 2], pd.date_range('2000-01-01', periods=2)]).values
print [x.day for _, x in idx]
# should print [1, 2, 1, 2] In retrospect, I should have added a test for MultiIndex.from_product in #6451. |
@shoyer You can always still add a test for that in a new PR |
@shoyer works for me In [24]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas as pd
:idx = pd.MultiIndex.from_product([[1, 2], pd.date_range('2000-01-01', periods=2)]).values
:print [x.day for _, x in idx]
:--
[1, 2, 1, 2] |
IIRC their are tests for using DatetimeIndex in MultiIndex.from_product. @immerrr if not enough test coverage, pls add (otherwise ok) |
@immerrr looks ok, pls verify test coverage then can merge |
Ok, added the test |
ok, ping when green |
good to go |
PERF: optimize MultiIndex.from_product
thansk! |
@immerrr Thanks for adding that test! |
@shoyer you're welcome :) |
This PR speeds up MultiIndex.from_product employing the fact that operating on categorical codes is faster than on the values themselves.
This yields about 2x improvement in the benchmark
It's only marginally slower in small size cases:
And this case came as a surprise because the cartesian product is blazingly fast both in old and new versions, but profiling showed that factorization is a lot faster when done on a smaller array: