Skip to content

PERF: Speed up boolean masking on Series / index #30349

Closed
@TomAugspurger

Description

@TomAugspurger

On master, this takes a boolean mask on a 10,000 element series take me ~300us

In [2]: s = pd.Series(np.arange(10000))
   ...: m1 = np.zeros(len(s), dtype="bool")
   ...: m2 = pd.array(m1, dtype="boolean")

In [3]: %timeit s[m1]
275 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Much of that time is spent in a try / except inside Index.get_value

%load_ext line_profiler
%lprun -f pd.Index.get_value s[m1]

  4481         1          1.0      1.0      0.1          try:
  4482         1        626.0    626.0     63.9              return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  4483         1          1.0      1.0      0.1          except KeyError as e1:

That can never succeed for a boolean mask. By skipping that path entirely, we improve perf on this example by ~2x

In [3]: %timeit s[m1]
155 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We might be able to restructure Index.get_value or Series.__getitem__ a bit to not go down this path when we have a boolean ndarray as a mask.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions