Skip to content

Series.__getitem__ materializes Categorical to ndarray #19318

Closed
@TomAugspurger

Description

@TomAugspurger

For

In [12]: c = pd.Series(pd.Categorical(['a'] * 1000))

In [13]: c[0]

we hit

def get_value(self, series, key):
"""
Fast lookup of value from 1-dimensional ndarray. Only use this if you
know what you're doing
"""
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, Index) and is_scalar(key):
try:
return s[key]
except (IndexError, ValueError):
# invalid type as an indexer
pass
s = _values_from_object(series)
k = _values_from_object(key)

_values_from_object calls series.get_values(), which hits Categorical.get_values, which coerces to the ndarray of values.

I have a branch based on my ExtensionArray stuff that "fixes" this by seeing if s is an instance of ExtensionArray, which has the correct semantics for what we need here. But that's not necessarily the best fix here.

master:

In [3]: c = pd.Series(pd.Categorical(['a'] * 1000))

In [4]: %timeit c[0]
50.1 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

My branch:

In [4]: %timeit c[0]
5.76 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeIndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions