Skip to content

Huge performance hit in get_indexer_non_unique #15364

Closed
@horta

Description

@horta

Let df0 and df1 be indexed data frames. Assume that at least one of them has a non-unique index. If I'm not mistaken, the function #L309 will always be called upon the operation:

df0.loc[df1.index]

Now imagine that both of indexes are reasonably large (lets say at least one millions of records). The operations at lines

will be of order O(n^2*log(n)^2) even when both indices are sorted.

Am I missing something? Let me know if the above reasoning is right so I can help with it. If both indexes are sorted, It looks like that function can be as fast as O(n) simply by looping over the values and targets arrays simultaneously. (Assuming that the index duplication is very small compared to n.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesMultiIndexPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions