Description
I noticed in GeoPandas that filtering a series with a mask "densifies" the ExtensionArray (= converting the ExtesionArray to a materialized numpy array), which can potentially be very expensive (if the ExtensionArray doesn't store a numpy array of scalars under the hood).
This is quite problematic for such a basic operation (and problematic for the next version of GeoPandas, we will probably have to override __getitem__
for this).
Example with pandas itself to see this. I used this small edit to check what call converts to a numpy array:
--- a/pandas/tests/extension/decimal/array.py
+++ b/pandas/tests/extension/decimal/array.py
@@ -81,6 +81,13 @@ class DecimalArray(ExtensionArray, ExtensionScalarOpsMixin):
def _from_factorized(cls, values, original):
return cls(values)
+ def __array__(self, dtype=None):
+ print("__array__ being called from:")
+ import inspect
+ frames = inspect.getouterframes(inspect.currentframe())
+ for frame in frames[:7]:
+ print(" {0} from {1}".format(frame.function, frame.filename))
+ return self._data
+
and then you see this:
In [1]: from pandas.tests.extension.decimal import DecimalArray, make_data
In [2]: a = DecimalArray(make_data())
In [3]: s = pd.Series(a)
In [4]: mask = s > 0.5
In [5]: subset = s[mask]
__array__ being called from:
__array__ from /home/joris/scipy/pandas/pandas/tests/extension/decimal/array.py
asarray from /home/joris/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py
to_dense from /home/joris/scipy/pandas/pandas/core/internals/blocks.py
get_values from /home/joris/scipy/pandas/pandas/core/internals/managers.py
_internal_get_values from /home/joris/scipy/pandas/pandas/core/series.py
get_value from /home/joris/scipy/pandas/pandas/core/indexes/base.py
__getitem__ from /home/joris/scipy/pandas/pandas/core/series.py
(s.loc[mask]
does not have the problem)
So this comes from the fact that we first try index.get_value(..)
in __getitem__
before doing anything else:
Lines 1075 to 1079 in d134b47
And inside Index.get_value
, this is calling values_from_object
:
pandas/pandas/core/indexes/base.py
Lines 4620 to 4621 in d134b47