Skip to content

ENH: support 'duplicated' functionality for ExtensionArrays #27264

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

For the factorize, unique, groupby hashtable-based functionalities, we included a _values_for_factorize() / factorize() method on the ExtensionArray. So for those methods, it is working nicely. However, for some of the other hashtable-based methods such as duplicated() or drop_duplicates, this machinery is not used and the EA is still coerced to a numpy array before passing to the algos code.

Small illustration that this is the fact by patching the IntegerArray to print when being coerced to a numpy array:

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -364,6 +364,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
         the array interface, return my values
         We return an object array here to preserve our scalar values
         """
+        print("getting coerced to an array")
         return self._coerce_to_ndarray()
In [2]: s = pd.Series([1, 2, 1, 2, None], dtype='Int64') 

In [3]: s
Out[3]: getting coerced to an array

0      1
1      2
2      1
3      2
4    NaN
dtype: Int64

In [4]: s.duplicated()
getting coerced to an array
getting coerced to an array
Out[4]: 
0    False
1    False
2     True
3     True
4    False
dtype: bool

In [5]: s.unique()
Out[5]: 
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementExtensionArrayExtending pandas with custom dtypes or arrays.duplicatedduplicated, drop_duplicates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions