Skip to content

ENH: Extending EAs #51471

Open
Open
@jbrockmendel

Description

@jbrockmendel

On a call yesterday with some of the cuDF maintainers, the question came up of why they haven't implemented an ExtensionArray. They pointed to operations where we convert* to numpy (which is very expensive for their hypothetical EA), in particular groupby construction and merge.

* Not actually doing EA.to_numpy(), but having EA.factorize or EA.argsort return ndarrays in these cases means moving everything from a GPU to CPU. Potential modin or dask distributed EAs would have analogous pain points.

I said something to the effect of: "if you implemented EAs, the pandas team would be very-much on-board with helping make sure it worked". In retrospect I should have spoken only for myself, so want to ask: how do folks feel about extending the EA interface in order to make GPU/Distributed EAs viable? cc @pandas-dev/pandas-core

Some thoughts on what this might entail:

  1. groupby construction produces an ndarray[intp] of labels assigning each row (focusing only on axis=0) to a group.

  2. merge code I haven't looked into as closely

    • in a lot of it we convert to numpy and then call our libjoin functions.
    • so we could plausibly let EAs specify something other than those libjoin functions to use.
  3. IndexEngine - we have a non-performant EA engine and a performant MaskedEngine. In principle we could allow EAs to bring their own.

  4. Window - no idea what this would take.

Some potential reasons not to do this:

  1. In the groupby case in particular, the data-locality (either for GPU or distributed) needs to be the same for your group labels and each of your columns if you want to be performant. i.e. your columns need to be all-GPU or all-distributed. Maybe EAs aren't the right abstraction for that?

  2. Do we draw the line somewhere? Plotting? I/O?

  3. Early on we wanted to keep the EA namespace limited. This could make it significantly larger.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ExtensionArrayExtending pandas with custom dtypes or arrays.Needs DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions