Skip to content

Indexing (__getitem__) of DataFrame/Series with ExtensionArray densifies the array #29708

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

I noticed in GeoPandas that filtering a series with a mask "densifies" the ExtensionArray (= converting the ExtesionArray to a materialized numpy array), which can potentially be very expensive (if the ExtensionArray doesn't store a numpy array of scalars under the hood).

This is quite problematic for such a basic operation (and problematic for the next version of GeoPandas, we will probably have to override __getitem__ for this).

Example with pandas itself to see this. I used this small edit to check what call converts to a numpy array:

--- a/pandas/tests/extension/decimal/array.py
+++ b/pandas/tests/extension/decimal/array.py
@@ -81,6 +81,13 @@ class DecimalArray(ExtensionArray, ExtensionScalarOpsMixin):
     def _from_factorized(cls, values, original):
         return cls(values)
 
+    def __array__(self, dtype=None):
+        print("__array__ being called from:")
+        import inspect
+        frames = inspect.getouterframes(inspect.currentframe())
+        for frame in frames[:7]:
+           print("  {0} from {1}".format(frame.function, frame.filename))
+        return self._data
+

and then you see this:

In [1]: from pandas.tests.extension.decimal import DecimalArray, make_data 

In [2]: a = DecimalArray(make_data())   

In [3]: s = pd.Series(a)    

In [4]: mask = s > 0.5   

In [5]: subset = s[mask]   
__array__ being called from:
  __array__ from /home/joris/scipy/pandas/pandas/tests/extension/decimal/array.py
  asarray from /home/joris/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py
  to_dense from /home/joris/scipy/pandas/pandas/core/internals/blocks.py
  get_values from /home/joris/scipy/pandas/pandas/core/internals/managers.py
  _internal_get_values from /home/joris/scipy/pandas/pandas/core/series.py
  get_value from /home/joris/scipy/pandas/pandas/core/indexes/base.py
  __getitem__ from /home/joris/scipy/pandas/pandas/core/series.py

(s.loc[mask] does not have the problem)

So this comes from the fact that we first try index.get_value(..) in __getitem__ before doing anything else:

pandas/pandas/core/series.py

Lines 1075 to 1079 in d134b47

def __getitem__(self, key):
key = com.apply_if_callable(key, self)
try:
result = self.index.get_value(self, key)

And inside Index.get_value, this is calling values_from_object:

s = com.values_from_object(series)
k = com.values_from_object(key)

Metadata

Metadata

Assignees

No one assigned

    Labels

    ExtensionArrayExtending pandas with custom dtypes or arrays.IndexingRelated to indexing on series/frames, not to indexes themselves

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions