Skip to content

BUG: _repr_*_ methods should not iterate over Sequence when dtype=object #44799

Open
@randolf-scholz

Description

@randolf-scholz

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
d = {0: range(3), 1: range(100)}
s = pd.Series(d, dtype=object)
print(s)

Gives

0                                            (0, 1, 2)
1    (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
dtype: object

Instead of

0      range(0, 3)
1    range(0, 100)
dtype: object

Issue Description

When displaying a Series/DataFrame of dtype=object whose content conform to the collections.abc.Sequence protocol, pandas tries to iterate over these objects.

To see why this is generally a bad idea, consider the following example:

import pandas as pd
from time import sleep

class MyIterator:
    def __len__(self):
        return 100

    def __getitem__(self, key):
        print(f"computing {key=} ...")
        sleep(3)

d = {0: MyIterator(), 1: range(100)}
s = pd.Series(d, dtype=object)
print(s)

Where would this occur in practice? Well in my case I tried to store some torch.utils.data.DataLoader objects in a Series in order to leverage the powerful pandas Multi-Indexing over hierarchical cross-validation splits. In this case, printing the Series in a Jupyter Notebook would take 5+ minutes, whereas instantiating it was practically instantaneous. This is especially problematic when using Jupyter with %config InteractiveShell.ast_node_interactivity='last_expr_or_assign' mode.

Expected Behavior

When dtype=object then pandas should use the repr method of the object in order to get a string representation and not try to do something fancy. Possibly one can make some exception / special cases for python bulit-ins such as tuple and list. (I presume the current behaviour is the way it is to deal with these two when they hold lots of items)

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python           : 3.9.7.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.11.0-41-generic
Version          : #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.4
numpy            : 1.21.4
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 59.4.0
Cython           : None
pytest           : 6.2.5
hypothesis       : None
sphinx           : 4.3.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 3.0.3
IPython          : 7.30.1
pandas_datareader: None
bs4              : 4.10.0
bottleneck       : None
fsspec           : 2021.11.1
fastparquet      : 0.7.2
gcsfs            : None
matplotlib       : 3.5.0
numexpr          : 2.7.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 6.0.1
pyxlsb           : None
s3fs             : None
scipy            : 1.7.3
sqlalchemy       : 1.4.27
tables           : 3.6.1
tabulate         : 0.8.9
xarray           : 0.20.1
xlrd             : None
xlwt             : None
numba            : 0.53.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds DiscussionRequires discussion from core team before further actionNested DataData where the values are collections (lists, sets, dicts, objects, etc.).Output-Formatting__repr__ of pandas objects, to_string

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions