ENH: Support ExtensionArray (and masked EAs speficially) in indexing

One area where the general ExtensionArray support is lacking is to store them in the index (right now they get converted to ndarray when storing in an `Index`), and have efficient indexing operations (hashtable, index engines). 

Several of our long-time extension dtypes have their own subclass (Categorical, Period, Datetime, IntervalIndex), but we need to solve this *generally* for ExtensionArrays (so it can also work for external EAs), and should also focus on solving it well for the new nullable ExtensionArrays (using masked arrays). 

I think there are multiple aspects to this (probably more, but currently thinking of those):

### 1) Storing ExtensionArrays in an Index object

Supporting to "just" store EAs in the Index and support its methods (and eg falling back to ndarray for the indexing engine) is probably not that hard. There are PRs https://github.com/pandas-dev/pandas/pull/34159 (storing EAs in base Index class) and https://github.com/pandas-dev/pandas/pull/37869 (having specific ExtensionIndex subclass). 

I think both approaches are *technically* not that different (put the required special cases in `if` blocks in the base class vs in overridden methods in the subclass), but for me it's mainly a **user API** design discussion (summarized as "I don't think that end users should see an "ExtensionIndex"). 

So for this part, we should have that API discussion.

### 2) A protocol for specifying the values (ndarray) used for indexing operations

While for an initial version of support, we can use `np.asarray(EA)` as the values passed to the `IndexEngine`, we should ideally have a general method in the EA interface to be able to specify which values can be used for indexing. 

There is some discussion related to this in https://github.com/pandas-dev/pandas/issues/32586 and https://github.com/pandas-dev/pandas/issues/33276 (eg can we re-use some of the existing `_values_for_..` methods? ...). And we can probably continue this aspect over there.

A general method is mostly important for external EAs, because we will probably have special support for our own EAs: the existing Index subclasses already do this, and for the nullable EAs we need to add this (see next section below).

### 3) Support for masked arrays in the indexing operations (IndexEngine, HashTable, etc)

Specifically to have better support for the nullable dtypes (without needing to convert to ndarray), I think we should look into adding support for using masks in the low-level index operations (IndexEngine, HashTable, etc).  

Some (not-index related) hashtable methods like `HashTable.unique` already have optional support for masks. 

I think this is technically the most challenging item, and needs to be worked out more in detail what this work item would entail. 

cc @jbrockmendel @TomAugspurger @jreback 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support ExtensionArray (and masked EAs speficially) in indexing #39133

1) Storing ExtensionArrays in an Index object

2) A protocol for specifying the values (ndarray) used for indexing operations

3) Support for masked arrays in the indexing operations (IndexEngine, HashTable, etc)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: Support ExtensionArray (and masked EAs speficially) in indexing #39133

Description

1) Storing ExtensionArrays in an Index object

2) A protocol for specifying the values (ndarray) used for indexing operations

3) Support for masked arrays in the indexing operations (IndexEngine, HashTable, etc)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions