Description
Currently, MultiIndex.get_loc()
and MultiIndex.get_indexer()
both rely on an _engine
which is either a MultiIndexObjectEngine
or a MultiIndexHashEngine
: but both of these are thin layers over the flat ObjectEngine
. This means that the actual structure of labels and levels is completely discarded (except e.g. for partial indexing, see _get_level_indexer()
).
In principle, a completely different scheme could be used:
- first look for the key elements in
levels
, and find the corresponding code - then look for the code in the levels
In most cases, the second part should be the computationally expensive one. It would consist in running nlevels
searches in arrays of dtype=int
(the .labels
) rather than (as it is now) one search in an object
array in which each element is actually a tuple
of nlevels
elements. My guess is that thanks to vectorization the former should be much faster than the latter.
Moreover (and maybe more importantly), with the current engine fixing a bug such as #18485 is a nightmare. And the same applies to
In [2]: (4, True) in pd.MultiIndex.from_tuples([(4, 1)])
Out[2]: True
and probably others. This is because even though levels are not mixed, the elements are compared as objects
.
One caveat is that the single levels would be very often non-unique, and I'm not sure what is the impact of this with the current implementation of hash tables.