Skip to content

ENH/PERF: ExtensionArray should offer a duplicated function #48747

Closed
@tehunter

Description

@tehunter

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

In version 1.5.0, functions that use Series.duplicated (including DataFrame.duplicated with a single column subset and .drop_duplicates) go through pd.core.algorithms.duplicated which calls _ensure_data. Currently there is no method for an ExtensionArray to offer its own duplicated behavior, so when the program gets to pd.core.algorithms._ensure_data, it may be forced to fall back on the np.asarray(values, dtype=object) if the ExtensionArray is not a coerceable type. Here's the documentation for _ensure_data:

def _ensure_data(values: ArrayLike) -> np.ndarray:
    """
    routine to ensure that our data is of the correct
    input dtype for lower-level routines

    This will coerce:
    - ints -> int64
    - uint -> uint64
    - bool -> uint8
    - datetimelike -> i8
    - datetime64tz -> i8 (in local tz)
    - categorical -> codes

    Parameters
    ----------
    values : np.ndarray or ExtensionArray

    Returns
    -------
    np.ndarray
    """

This np.asarray call can be very expensive. For example, I have an ExtensionArray with several million rows backed by 10 Categorical/numerical arrays. The np.asarray function uses __iter__ to loop through my array and construct an np.array out of base objects. This is a very expensive task. Unfortunately, I have no means of hinting to pandas that I have much more efficient algorithms for computing the duplicates which can take advantage of vectorization.

Feature Description

Add a duplicated method to ExtensionArray with the following signature:

def duplicated(self, keep: Literal["first", "last", False]) -> npt.NDArray[np.bool_]:
    # Returns a boolean array indicating duplicated values in the ExtensionArray
    return pd.core.algorithms.duplicated(self._values, keep=keep)

Then have IndexOpsMixin call that function instead of directly calling pd.core.algorithms.duplicated:

    @final
    def _duplicated(
        self, keep: Literal["first", "last", False] = "first"
    ) -> npt.NDArray[np.bool_]:
        # Since self._values can be ExtensionArray or np.ndarray, may need to add a type check here
        # and fall back on pd.core.algorithms.duplicated if an np.ndarray
        return self._values.duplicated(keep=keep)

Alternative Solutions

Currently, for doing a duplicated with multiple columns subset, pandas routes each column through algorithms.factorize and then passes the whole thing through get_group_index. Possibly the Series duplicated function could implement an algorithm based on factorize for custom ExtensionArrays instead. This may eliminate the need for users to code up their own duplicated check if they already have a custom factorize in place.

Additional Context

This came out of PR #45534 as a result of #45236, so this might be viewed as a v1.5 regression; I hadn't implemented this ExtensionArray feature in my own code prior to 1.5 so I haven't backtested. It's also potentially related to #27035 and #15929 as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions