New engine for MultiIndex?

Currently, ``MultiIndex.get_loc()`` and ``MultiIndex.get_indexer()`` both rely on an ``_engine`` which is either a ``MultiIndexObjectEngine`` or a ``MultiIndexHashEngine``: but both of these are thin layers over the flat ``ObjectEngine``. This means that the actual structure of labels and levels is completely discarded (except e.g. for partial indexing, see ``_get_level_indexer()``).

In principle, a completely different scheme could be used:
- first look for the key elements in ``levels``, and find the corresponding code
- then look for the code in the levels

In most cases, the second part should be the computationally expensive one. It would consist in running ``nlevels`` searches in arrays of ``dtype=int`` (the ``.labels``) rather than (as it is now) one search in an ``object`` array in which each element is actually a ``tuple`` of ``nlevels`` elements. My guess is that thanks to vectorization the former should be much faster than the latter.

Moreover (and maybe more importantly), with the current engine fixing a bug such as #18485 is a nightmare. And the same applies to
```
In [2]: (4, True) in pd.MultiIndex.from_tuples([(4, 1)])
Out[2]: True
```
and probably others. This is because even though levels are not mixed, the elements are compared as ``objects``.

One caveat is that the single levels would be very often non-unique, and I'm not sure what is the impact of this with the current implementation of hash tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New engine for MultiIndex? #18519

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New engine for MultiIndex? #18519

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions