Skip to content

PERF: optimize MultiIndex.from_product #7627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 1, 2014

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Jul 1, 2014

This PR speeds up MultiIndex.from_product employing the fact that operating on categorical codes is faster than on the values themselves.

This yields about 2x improvement in the benchmark

In [1]: import pandas.util.testing as tm

In [2]: data = [tm.makeStringIndex(10000), tm.makeFloatIndex(20)]

In [3]: %timeit pd.MultiIndex.from_product(data)
100 loops, best of 3: 10.6 ms per loop

In [4]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data))
10 loops, best of 3: 23.4 ms per loop

It's only marginally slower in small size cases:

In [1]: data = [np.arange(20).astype(object), np.arange(20)]

In [2]: %timeit pd.MultiIndex.from_product(data)
1000 loops, best of 3: 317 µs per loop

In [3]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data))
1000 loops, best of 3: 308 µs per loop

In [4]: data_int = [np.arange(20), np.arange(20)]

In [5]: %timeit pd.MultiIndex.from_product(data_int)
1000 loops, best of 3: 285 µs per loop

In [6]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data_int))
1000 loops, best of 3: 269 µs per loop

And this case came as a surprise because the cartesian product is blazingly fast both in old and new versions, but profiling showed that factorization is a lot faster when done on a smaller array:

In [7]: data_large = [np.arange(10000), np.arange(20)]

In [8]: %timeit pd.MultiIndex.from_arrays(pd.tools.util.cartesian_product(data_large))
100 loops, best of 3: 9.88 ms per loop

In [9]: %timeit pd.MultiIndex.from_product(data_large)
100 loops, best of 3: 2.74 ms per loop

@immerrr immerrr changed the title (WIP) PERF: optimize MultiIndex.from_product PERF: optimize MultiIndex.from_product Jul 1, 2014
@shoyer
Copy link
Member

shoyer commented Jul 1, 2014

Looks like a nice speedup, but could you please verify that #6439 (cartesian product of a DatetimeIndex) is still fixed?

e.g.:

import pandas as pd
idx = pd.MultiIndex.from_product([[1, 2], pd.date_range('2000-01-01', periods=2)]).values 
print [x.day for _, x in idx]
# should print [1, 2, 1, 2]

In retrospect, I should have added a test for MultiIndex.from_product in #6451.

@jorisvandenbossche
Copy link
Member

@shoyer You can always still add a test for that in a new PR

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

@shoyer works for me

In [24]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas as pd
:idx = pd.MultiIndex.from_product([[1, 2], pd.date_range('2000-01-01', periods=2)]).values 
:print [x.day for _, x in idx]
:--
[1, 2, 1, 2]

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

IIRC their are tests for using DatetimeIndex in MultiIndex.from_product. @immerrr if not enough test coverage, pls add (otherwise ok)

@jreback jreback added this to the 0.14.1 milestone Jul 1, 2014
@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

@immerrr looks ok, pls verify test coverage then can merge

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

Ok, added the test

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

ok, ping when green

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

good to go

jreback added a commit that referenced this pull request Jul 1, 2014
@jreback jreback merged commit c8a3eba into pandas-dev:master Jul 1, 2014
@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

thansk!

@immerrr immerrr deleted the perf-multiindex-fromproduct branch July 1, 2014 11:38
@shoyer
Copy link
Member

shoyer commented Jul 1, 2014

@immerrr Thanks for adding that test!

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

@shoyer you're welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants