Description
One area where the general ExtensionArray support is lacking is to store them in the index (right now they get converted to ndarray when storing in an Index
), and have efficient indexing operations (hashtable, index engines).
Several of our long-time extension dtypes have their own subclass (Categorical, Period, Datetime, IntervalIndex), but we need to solve this generally for ExtensionArrays (so it can also work for external EAs), and should also focus on solving it well for the new nullable ExtensionArrays (using masked arrays).
I think there are multiple aspects to this (probably more, but currently thinking of those):
1) Storing ExtensionArrays in an Index object
Supporting to "just" store EAs in the Index and support its methods (and eg falling back to ndarray for the indexing engine) is probably not that hard. There are PRs #34159 (storing EAs in base Index class) and #37869 (having specific ExtensionIndex subclass).
I think both approaches are technically not that different (put the required special cases in if
blocks in the base class vs in overridden methods in the subclass), but for me it's mainly a user API design discussion (summarized as "I don't think that end users should see an "ExtensionIndex").
So for this part, we should have that API discussion.
2) A protocol for specifying the values (ndarray) used for indexing operations
While for an initial version of support, we can use np.asarray(EA)
as the values passed to the IndexEngine
, we should ideally have a general method in the EA interface to be able to specify which values can be used for indexing.
There is some discussion related to this in #32586 and #33276 (eg can we re-use some of the existing _values_for_..
methods? ...). And we can probably continue this aspect over there.
A general method is mostly important for external EAs, because we will probably have special support for our own EAs: the existing Index subclasses already do this, and for the nullable EAs we need to add this (see next section below).
3) Support for masked arrays in the indexing operations (IndexEngine, HashTable, etc)
Specifically to have better support for the nullable dtypes (without needing to convert to ndarray), I think we should look into adding support for using masks in the low-level index operations (IndexEngine, HashTable, etc).
Some (not-index related) hashtable methods like HashTable.unique
already have optional support for masks.
I think this is technically the most challenging item, and needs to be worked out more in detail what this work item would entail.