Description
On a call yesterday with some of the cuDF maintainers, the question came up of why they haven't implemented an ExtensionArray. They pointed to operations where we convert* to numpy (which is very expensive for their hypothetical EA), in particular groupby construction and merge.
* Not actually doing EA.to_numpy(), but having EA.factorize or EA.argsort return ndarrays in these cases means moving everything from a GPU to CPU. Potential modin or dask distributed EAs would have analogous pain points.
I said something to the effect of: "if you implemented EAs, the pandas team would be very-much on-board with helping make sure it worked". In retrospect I should have spoken only for myself, so want to ask: how do folks feel about extending the EA interface in order to make GPU/Distributed EAs viable? cc @pandas-dev/pandas-core
Some thoughts on what this might entail:
-
groupby construction produces an ndarray[intp] of labels assigning each row (focusing only on axis=0) to a group.
- that construction has roughly one zillion cases.
- in the simple case of
df.groupby(col)
, it is mostlydf[col].factorize(sort=False)
. - we'd need to let .factorize return something something other than an ndarray, probably another EA.
- That ndarray[intp] currently lives in
BaseGrouper.group_info[0]
. So any place that currently uses that would need to be adapted. e.g. theids
keyword in _groupby_op introduced in REF: let EAs override WrappedCythonOp groupby implementations #51166. - Some of that adaptation we're going to need to do anyway if we want to support pyarrow dtypes without converting to numpy. (again xref REF: let EAs override WrappedCythonOp groupby implementations #51166)
-
merge code I haven't looked into as closely
- in a lot of it we convert to numpy and then call our libjoin functions.
- so we could plausibly let EAs specify something other than those libjoin functions to use.
-
IndexEngine - we have a non-performant EA engine and a performant MaskedEngine. In principle we could allow EAs to bring their own.
-
Window - no idea what this would take.
- I don't know this code that well, but my best vibes-based guess is that it mostly works with numpy dtypes. If we're going to support pyarrow dtypes, can we support general EAs?
- xref ENH: Add masked support for rolling operations #50449 for masked dtypes
Some potential reasons not to do this:
-
In the groupby case in particular, the data-locality (either for GPU or distributed) needs to be the same for your group labels and each of your columns if you want to be performant. i.e. your columns need to be all-GPU or all-distributed. Maybe EAs aren't the right abstraction for that?
-
Do we draw the line somewhere? Plotting? I/O?
-
Early on we wanted to keep the EA namespace limited. This could make it significantly larger.