ENH: Extending EAs

On a call yesterday with some of the cuDF maintainers, the question came up of why they haven't implemented an ExtensionArray.  They pointed to operations where we convert* to numpy (which is very expensive for their hypothetical EA), in particular groupby construction and merge.

\* Not actually doing EA.to_numpy(), but having EA.factorize or EA.argsort return ndarrays in these cases means moving everything from a GPU to CPU.  Potential modin or dask distributed EAs would have analogous pain points.

I said something to the effect of: "if you implemented EAs, the pandas team would be very-much on-board with helping make sure it worked".  In retrospect I should have spoken only for myself, so want to ask: how do folks feel about extending the EA interface in order to make GPU/Distributed EAs viable?  cc @pandas-dev/pandas-core 

Some thoughts on what this might entail:

1) groupby construction produces an ndarray[intp] of labels assigning each row (focusing only on axis=0) to a group.  
   - that construction has roughly one zillion cases.
   - in the simple case of `df.groupby(col)`, it is mostly `df[col].factorize(sort=False)`.
   - we'd need to let .factorize return something something other than an ndarray, probably another EA.
   - That ndarray[intp] currently lives in `BaseGrouper.group_info[0]`. So any place that currently uses that would need to be adapted.  e.g. the `ids` keyword in _groupby_op introduced in #51166. 
   - Some of that adaptation we're going to need to do anyway if we want to support pyarrow dtypes without converting to numpy.  (again xref #51166)

2) merge code I haven't looked into as closely
   - in a lot of it we convert to numpy and then call our libjoin functions.
   - so we could plausibly let EAs specify something other than those libjoin functions to use.

3) IndexEngine - we have a non-performant EA engine and a performant MaskedEngine. In principle we could allow EAs to bring their own.

4) Window - no idea what this would take.
   - I don't know this code that well, but my best vibes-based guess is that it mostly works with numpy dtypes.  If we're going to support pyarrow dtypes, can we support general EAs?
   - xref #50449 for masked dtypes

Some potential reasons _not_ to do this:

1) In the groupby case in particular, the data-locality (either for GPU or distributed) needs to be the same for your group labels and each of your columns if you want to be performant.  i.e. your columns need to be all-GPU or all-distributed.  Maybe EAs aren't the right abstraction for that?

2) Do we draw the line somewhere?  Plotting?  I/O?

3) Early on we wanted to keep the EA namespace limited.  This could make it significantly larger.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Extending EAs #51471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Extending EAs #51471

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions