Skip to content

PERF: performance regression in slicing a Series #33323

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

All of the slice indexing benchmarks are showing regressions, eg https://pandas.pydata.org/speed/pandas/#indexing.NumericSeriesIndexing.time_loc_slice?p-index_dtype=%3Cclass%20'pandas.core.indexes.numeric.Int64Index'%3E&p-index_structure='unique_monotonic_inc'&commits=80d37adc-4f89c261

Replicating one of them with a small snippet confirms this:

In [1]: pd.__version__   
Out[1]: '1.1.0.dev0+1122.gc7c640ec7'

In [2]: N = 10 ** 6  

In [3]: data = pd.Series(np.random.rand(N), index=pd.Int64Index(range(N))) 

In [4]: %timeit data.iloc[:800000] 
103 ms ± 8.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit data[:800000] 
97.2 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

vs

In [1]: pd.__version__
Out[1]: '1.0.3'

In [2]: N = 10 ** 6  

In [3]: data = pd.Series(np.random.rand(N), index=pd.Int64Index(range(N))) 

In [4]: %timeit data.iloc[:800000]  
55.7 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit data[:800000]   
69.6 µs ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Checking with snakeviz shows that on master, quasi all time the indexing operation takes is spent in setting the mgr_locs (creating the BlockPlacement object), which is certainly not as expected:

image

Putting a breakpoint in the Block init just before creating the BlockPlacement, showed that the value being passed on master is a range object, while this should be a slice object, I assume.
Going up the stack to see where this range object is created, points to:

block = blk.make_block_same_class(array, placement=range(len(array)))

which according to git blame is last touched by this PR: #32421. And indeed, a range object was added there instead of a slice (other places in the PR did use a slice, so this was probably kind of a typo).

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions