Huge performance hit in get_indexer_non_unique

Let `df0` and `df1` be indexed data frames. Assume that at least one of them has a non-unique index. If I'm not mistaken, the function [#L309](https://github.com/pandas-dev/pandas/blob/3853fe6d2b884f186b93933ef53ff5e475c2d80c/pandas/index.pyx#L309) will always be called upon  the operation:

```python
df0.loc[df1.index]
```

Now imagine that both of indexes are reasonably large (lets say at least one millions of records). The operations at lines

- [#L341](https://github.com/pandas-dev/pandas/blob/3853fe6d2b884f186b93933ef53ff5e475c2d80c/pandas/index.pyx#L341)
- [#L342](https://github.com/pandas-dev/pandas/blob/3853fe6d2b884f186b93933ef53ff5e475c2d80c/pandas/index.pyx#L342)
- [#L351](https://github.com/pandas-dev/pandas/blob/3853fe6d2b884f186b93933ef53ff5e475c2d80c/pandas/index.pyx#L351)

will be of order `O(n^2*log(n)^2)` even when both indices are sorted.

Am I missing something? Let me know if the above reasoning is right so I can help with it. If both indexes are sorted, It looks like that function can be as fast as `O(n)` simply by looping over the `values` and `targets` arrays simultaneously. (Assuming that the index duplication is very small compared to `n`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge performance hit in get_indexer_non_unique #15364

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Huge performance hit in get_indexer_non_unique #15364

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions