Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
In version 1.5.0, functions that use Series.duplicated
(including DataFrame.duplicated
with a single column subset and .drop_duplicates
) go through pd.core.algorithms.duplicated
which calls _ensure_data
. Currently there is no method for an ExtensionArray
to offer its own duplicated
behavior, so when the program gets to pd.core.algorithms._ensure_data
, it may be forced to fall back on the np.asarray(values, dtype=object)
if the ExtensionArray
is not a coerceable type. Here's the documentation for _ensure_data
:
def _ensure_data(values: ArrayLike) -> np.ndarray:
"""
routine to ensure that our data is of the correct
input dtype for lower-level routines
This will coerce:
- ints -> int64
- uint -> uint64
- bool -> uint8
- datetimelike -> i8
- datetime64tz -> i8 (in local tz)
- categorical -> codes
Parameters
----------
values : np.ndarray or ExtensionArray
Returns
-------
np.ndarray
"""
This np.asarray
call can be very expensive. For example, I have an ExtensionArray with several million rows backed by 10 Categorical/numerical arrays. The np.asarray
function uses __iter__
to loop through my array and construct an np.array
out of base objects. This is a very expensive task. Unfortunately, I have no means of hinting to pandas that I have much more efficient algorithms for computing the duplicates which can take advantage of vectorization.
Feature Description
Add a duplicated
method to ExtensionArray with the following signature:
def duplicated(self, keep: Literal["first", "last", False]) -> npt.NDArray[np.bool_]:
# Returns a boolean array indicating duplicated values in the ExtensionArray
return pd.core.algorithms.duplicated(self._values, keep=keep)
Then have IndexOpsMixin
call that function instead of directly calling pd.core.algorithms.duplicated
:
@final
def _duplicated(
self, keep: Literal["first", "last", False] = "first"
) -> npt.NDArray[np.bool_]:
# Since self._values can be ExtensionArray or np.ndarray, may need to add a type check here
# and fall back on pd.core.algorithms.duplicated if an np.ndarray
return self._values.duplicated(keep=keep)
Alternative Solutions
Currently, for doing a duplicated
with multiple columns subset, pandas routes each column through algorithms.factorize
and then passes the whole thing through get_group_index
. Possibly the Series duplicated
function could implement an algorithm based on factorize
for custom ExtensionArrays
instead. This may eliminate the need for users to code up their own duplicated
check if they already have a custom factorize
in place.
Additional Context
This came out of PR #45534 as a result of #45236, so this might be viewed as a v1.5 regression; I hadn't implemented this ExtensionArray
feature in my own code prior to 1.5 so I haven't backtested. It's also potentially related to #27035 and #15929 as well.