Skip to content

Index.difference performance #12044

Closed
Closed
@Winand

Description

@Winand

I need to append several big Series to a big categorical Series.
Trying to update categories FAST i've found out that Index.difference uses Python's set, which is slow on creating LARGE set (i have up to 500k categories and 1.3M values).
numpy's setdiff1 is more than an order of magnitude faster (as of datetime64 Categorical):

tmp_unique = tmp.unique()
new_cats = pd.Index(pd.np.setdiff1d(tmp_unique[~pd.isnull(tmp_unique)], to.cat.categories))

Not so fast:

new_cats = pd.Index(tmp_unique[~pd.isnull(tmp_unique)]).difference(to.cat.categories)

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions