Skip to content

Remove duplicate is_lexsorted function #19305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 19, 2018

Conversation

jbrockmendel
Copy link
Member

_libs.lib and _libs.algos have near-identical is_lexsorted functions. The only differences appear to be small optimizations/modernizations in the algos version. AFAICT the algos version is only used in tests.test_algos ATM. This PR removes the libs._lib version and changes the one usage (in indexes.multi) to use the algos version.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2018

no problem removing can you post a timeit with the perf comparison? (of the is_lexsorted of each version), do we have an asv to cover this?

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Clean labels Jan 18, 2018
@jbrockmendel
Copy link
Member Author

Weird results from timeit, using one of the examples in asv_bench.multiindex_object and mimicing the usage in MultiIndex.lexsort_depth:

import string
import numpy as np
import pandas as pd
from pandas.core.dtypes.common import _ensure_int64
from pandas._libs import lib, algos

mi_large = pd.MultiIndex.from_product(
            [np.arange(1000), np.arange(20), list(string.ascii_letters)],
            names=['one', 'two', 'three'])

self = mi_large
int64_labels = [_ensure_int64(lab) for lab in self.labels]

k = 3
%timeit lib.is_lexsorted(int64_labels[:k])
%timeit algos.is_lexsorted(int64_labels[:k])

For k=1, 2, the perf is effectively indistinguishable between the two versions. For k=3 the libalgos version (the one this PR prefers) is 5000 times slower:

In [48]: %timeit lib.is_lexsorted(int64_labels[:k])
The slowest run took 5.17 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.75 µs per loop

In [51]: %timeit algos.is_lexsorted(int64_labels[:k])
100 loops, best of 3: 9.36 ms per loop

AFAICT the difference is in the fact that in the k=3 case, lib version returns False, while the algos version calls free and then returns False. In the k=1,2 cases both versions call free.

I'd guess that calling free is the right thing to do here, emphasis on "guess".

@codecov
Copy link

codecov bot commented Jan 19, 2018

Codecov Report

Merging #19305 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19305      +/-   ##
==========================================
- Coverage   91.52%    91.5%   -0.03%     
==========================================
  Files         150      150              
  Lines       48875    48875              
==========================================
- Hits        44733    44721      -12     
- Misses       4142     4154      +12
Flag Coverage Δ
#multiple 89.87% <100%> (-0.03%) ⬇️
#single 41.66% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 96.22% <100%> (ø) ⬆️
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca2d261...b4d3e8f. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2018

hmm, that's a memory leak in lib.pyx, ok let's go with the algos one.

@jreback jreback added this to the 0.23.0 milestone Jan 19, 2018
@jreback jreback merged commit 0f1c9c5 into pandas-dev:master Jan 19, 2018
@jbrockmendel jbrockmendel deleted the is_lexsorted branch January 19, 2018 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants