ENH/PERF: `ExtensionArray` should offer a `duplicated` function

### Feature Type

- [X] Adding new functionality to pandas

- [X] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

In version 1.5.0, functions that use `Series.duplicated` (including `DataFrame.duplicated` with a single column subset and `.drop_duplicates`) go through `pd.core.algorithms.duplicated` which calls `_ensure_data`. Currently there is no method for an `ExtensionArray` to offer its own `duplicated` behavior, so when the program gets to `pd.core.algorithms._ensure_data`, it may be forced to fall back on the `np.asarray(values, dtype=object)` if the `ExtensionArray` is not a coerceable type. Here's the documentation for `_ensure_data`:
```
def _ensure_data(values: ArrayLike) -> np.ndarray:
    """
    routine to ensure that our data is of the correct
    input dtype for lower-level routines

    This will coerce:
    - ints -> int64
    - uint -> uint64
    - bool -> uint8
    - datetimelike -> i8
    - datetime64tz -> i8 (in local tz)
    - categorical -> codes

    Parameters
    ----------
    values : np.ndarray or ExtensionArray

    Returns
    -------
    np.ndarray
    """
```

This `np.asarray` call can be very expensive. For example, I have an ExtensionArray with several million rows backed by 10 Categorical/numerical arrays. The `np.asarray` function uses `__iter__` to loop through my array and construct an `np.array` out of base objects. This is a very expensive task. Unfortunately, I have no means of hinting to pandas that I have much more efficient algorithms for computing the duplicates which can take advantage of vectorization.

### Feature Description

Add a `duplicated` method to ExtensionArray with the following signature:
```python
def duplicated(self, keep: Literal["first", "last", False]) -> npt.NDArray[np.bool_]:
    # Returns a boolean array indicating duplicated values in the ExtensionArray
    return pd.core.algorithms.duplicated(self._values, keep=keep)
```

Then have `IndexOpsMixin` call that function instead of directly calling `pd.core.algorithms.duplicated`:
```python
    @final
    def _duplicated(
        self, keep: Literal["first", "last", False] = "first"
    ) -> npt.NDArray[np.bool_]:
        # Since self._values can be ExtensionArray or np.ndarray, may need to add a type check here
        # and fall back on pd.core.algorithms.duplicated if an np.ndarray
        return self._values.duplicated(keep=keep)
```


### Alternative Solutions

Currently, for doing a `duplicated` with multiple columns subset, pandas routes each column through `algorithms.factorize` and then passes the whole thing through `get_group_index`. Possibly the Series `duplicated` function could implement an algorithm based on `factorize` for custom `ExtensionArrays` instead. This may eliminate the need for users to code up their own `duplicated` check if they already have a custom `factorize` in place.

### Additional Context

This came out of PR #45534 as a result of #45236, so this might be viewed as a v1.5 regression; I hadn't implemented this `ExtensionArray` feature in my own code prior to 1.5 so I haven't backtested. It's also potentially related to #27035 and #15929 as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/PERF: `ExtensionArray` should offer a `duplicated` function #48747

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH/PERF: ExtensionArray should offer a duplicated function #48747

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ENH/PERF: `ExtensionArray` should offer a `duplicated` function #48747