Skip to content

BUG: Infinite recursion when creating Series with arrow-backed extension dtype and no data #41377

Open
@mosalx

Description

@mosalx
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Problem description

I am implementing arrow-backed extension arrays with non-primitive data types. I got it to work well: almost all relevant unit tests from pandas.tests.extension.base are passing. One of the remaining issues is instantiation of Series objects with missing data, which results in infinite recursion with dict-like arrow data types. Consider this example of an ExtensionArray (I am not implementing all abstract methods: only showing what is necessary for demonstrating the issue)

import numpy as np
import pyarrow as pa
from pandas.api.extensions import ExtensionDtype, ExtensionArray

class MyDtype(ExtensionDtype):
    type = pa.StructType
    name = 'arrow_struct_dtype'
    arrow_type = pa.struct([('a', pa.int32())])
    na_value = pa.scalar(None, arrow_type)

    @classmethod
    def construct_array_type(cls):
        return MyArray


class MyArray(ExtensionArray):
    dtype = MyDtype()

    def __init__(self, array):
        self.data = array

    def __array__(self, *args, **kwargs):
        """Convert to numpy array"""
        # can't use np.asarray here because it produces a 2-dimensional
        # array when scalars are iterable and have the same same length
        result = np.empty(shape=(len(self),), dtype='object')
        for i, scalar in enumerate(self):
            result[i] = scalar
        return result

    def __len__(self):
        return len(self.data)

    def __getitem__(self, item):
        # assuming item is an integer
        return self.data.__getitem__(item)

    @classmethod
    def _from_sequence(cls, scalars, *, dtype=None, copy=False):
        # assuming scalars is Iterable[pyarrow.Scalar]
        return cls(pa.array([s.as_py() for s in scalars], 
                            type=cls.dtype.arrow_type))

dtype = MyDtype()
valid_value = pa.scalar({'a': 5}, MyDtype.arrow_type)
na_value = MyDtype.na_value

Now if we try to create a Series object, all 3 cases below result in infinite recursion:

pd.Series(data=na_value, index=[1,2], dtype=dtype)
pd.Series(data={}, index=[1,2], dtype=dtype)
pd.Series(data=None, index=[1,2], dtype=dtype)

Being more explicit (to avoid broadcasting) works fine

>>> pd.Series(data=[na_value] * 2, index=[1,2], dtype=dtype)
1    None
2    None
dtype: arrow_struct_dtype

>>> pd.Series(data=[na_value, valid_value], index=[1,2], dtype=dtype)
1        None
2    {'a': 5}
dtype: arrow_struct_dtype

>>> pd.Series(data=[valid_value] * 2, index=[1,2], dtype=dtype)
1    {'a': 5}
2    {'a': 5}
dtype: arrow_struct_dtype

The root case seems to be that pandas attempts to analyze internal structure of pyarrow.Scalar instances in Series constructor. Specifically:

>>> from pandas.core.dtypes.common import is_dict_like
>>> is_dict_like(MyDtype.na_value)
True

Expected Output

I believe some special treatment is needed for all instances of pyarrow.Scalar and its subclasses because the type itself communicates that the object represents a "primitive-like" data container, and pandas should not use its internal structure to decide how to treat it. For example, this seems reasonable:

  • is_scalar should return True
  • is_dict_like, is_list_like and is_array_like should all return False

Although pyarrow itself has some inconsistencies in how it treats scalar objects (see ARROW-12695), I think in this case pyarrow is not causing this issue.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.9.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.4.0
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.23.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.04.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : 0.53.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugConstructorsSeries/DataFrame/Index/pd.array ConstructorsExtensionArrayExtending pandas with custom dtypes or arrays.Nested DataData where the values are collections (lists, sets, dicts, objects, etc.).

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions