Closed
Description
For the factorize
, unique
, groupby
hashtable-based functionalities, we included a _values_for_factorize()
/ factorize()
method on the ExtensionArray. So for those methods, it is working nicely. However, for some of the other hashtable-based methods such as duplicated()
or drop_duplicates
, this machinery is not used and the EA is still coerced to a numpy array before passing to the algos code.
Small illustration that this is the fact by patching the IntegerArray to print when being coerced to a numpy array:
--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -364,6 +364,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
the array interface, return my values
We return an object array here to preserve our scalar values
"""
+ print("getting coerced to an array")
return self._coerce_to_ndarray()
In [2]: s = pd.Series([1, 2, 1, 2, None], dtype='Int64')
In [3]: s
Out[3]: getting coerced to an array
0 1
1 2
2 1
3 2
4 NaN
dtype: Int64
In [4]: s.duplicated()
getting coerced to an array
getting coerced to an array
Out[4]:
0 False
1 False
2 True
3 True
4 False
dtype: bool
In [5]: s.unique()
Out[5]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64