Open

Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from sklearn.datasets import fetch_openml
data = fetch_openml('mnist_784')
Issue Description
PyPy doesn't like this usage of getsizeof:
<!--StartFragment-->
---------------------------------------------------------------------------
--
| TypeError Traceback (most recent call last)
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in wrapper(*args, **kw)
| 60 try:
| ---> 61 return f(*args, **kw)
| 62 except HTTPError:
|
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _load_arff_response(url, data_home, return_type, encode_nominal, parse_arff, md5_checksum)
| 528
| --> 529 parsed_arff = parse_arff(arff)
| 530
|
| ~/mambaforge-pypy3/lib_pypy/_functools.py in __call__(self, *fargs, **fkeywords)
| 79 fkeywords = dict(self._keywords, **fkeywords)
| ---> 80 return self._func(*(self._args + fargs), **fkeywords)
| 81
|
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _convert_arff_data_dataframe(arff, columns, features_dict)
| 353
| --> 354 row_bytes = first_df.memory_usage(deep=True).sum()
| 355 chunksize = get_chunk_n_rows(row_bytes)
|
| ~/mambaforge-pypy3/site-packages/pandas/core/frame.py in memory_usage(self, index, deep)
| 3223 result = self._constructor_sliced(
| -> 3224 self.index.memory_usage(deep=deep), index=["Index"]
| 3225 ).append(result)
|
| ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in memory_usage(self, deep)
| 344 """
| --> 345 return self.nbytes
| 346
|
| ~/mambaforge-pypy3/site-packages/pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
|
| ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in nbytes(self)
| 316 rng = self._range
| --> 317 return getsizeof(rng) + sum(
| 318 getsizeof(getattr(rng, attr_name))
|
| TypeError: getsizeof(...)
| getsizeof(object, default) -> int
|
| Return the size of object in bytes.
|
| sys.getsizeof(object, default) will always return default on PyPy, and
| raise a TypeError if default is not provided.
|
| First note that the CPython documentation says that this function may
| raise a TypeError, so if you are seeing it, it means that the program
| you are using is not correctly handling this case.
|
| On PyPy, though, it always raises TypeError. Before looking for
| alternatives, please take a moment to read the following explanation as
| to why it is the case. What you are looking for may not be possible.
|
| A memory profiler using this function is most likely to give results
| inconsistent with reality on PyPy. It would be possible to have
| sys.getsizeof() return a number (with enough work), but that may or
| may not represent how much memory the object uses. It doesn't even
| make really sense to ask how much *one* object uses, in isolation
| with the rest of the system. For example, instances have maps,
| which are often shared across many instances; in this case the maps
| would probably be ignored by an implementation of sys.getsizeof(),
| but their overhead is important in some cases if they are many
| instances with unique maps. Conversely, equal strings may share
| their internal string data even if they are different objects---or
| empty containers may share parts of their internals as long as they
| are empty. Even stranger, some lists create objects as you read
| them; if you try to estimate the size in memory of range(10**6) as
| the sum of all items' size, that operation will by itself create one
| million integer objects that never existed in the first place.
|
|
| During handling of the above exception, another exception occurred:
|
| TypeError Traceback (most recent call last)
| /tmp/ipykernel_18403/3646006393.py in <module>
| ----> 1 data = fetch_openml('mnist_784')
|
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in fetch_openml(name, version, data_id, data_home, target_column, cache, return_X_y, as_frame)
| 965 target_columns=target_columns,
| 966 data_columns=data_columns,
| --> 967 md5_checksum=data_description["md5_checksum"],
| 968 )
| 969
|
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _download_data_to_bunch(url, sparse, data_home, as_frame, features_list, data_columns, target_columns, shape, md5_checksum)
| 659 encode_nominal=not as_frame,
| 660 parse_arff=parse_arff,
| --> 661 md5_checksum=md5_checksum,
| 662 )
| 663 X, y, frame, nominal_attributes = postprocess(*out)
|
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in wrapper(*args, **kw)
| 67 if os.path.exists(local_path):
| 68 os.unlink(local_path)
| ---> 69 return f(*args, **kw)
| 70
| 71 return wrapper
|
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _load_arff_response(url, data_home, return_type, encode_nominal, parse_arff, md5_checksum)
| 527 )
| 528
| --> 529 parsed_arff = parse_arff(arff)
| 530
| 531 # consume remaining stream, if early exited
|
| ~/mambaforge-pypy3/lib_pypy/_functools.py in __call__(self, *fargs, **fkeywords)
| 78 if self._keywords:
| 79 fkeywords = dict(self._keywords, **fkeywords)
| ---> 80 return self._func(*(self._args + fargs), **fkeywords)
| 81
| 82 @_recursive_repr()
|
| ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _convert_arff_data_dataframe(arff, columns, features_dict)
| 352 first_df = pd.DataFrame([first_row], columns=arff_columns)
| 353
| --> 354 row_bytes = first_df.memory_usage(deep=True).sum()
| 355 chunksize = get_chunk_n_rows(row_bytes)
| 356
|
| ~/mambaforge-pypy3/site-packages/pandas/core/frame.py in memory_usage(self, index, deep)
| 3222 if index:
| 3223 result = self._constructor_sliced(
| -> 3224 self.index.memory_usage(deep=deep), index=["Index"]
| 3225 ).append(result)
| 3226 return result
|
| ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in memory_usage(self, deep)
| 343 numpy.ndarray.nbytes
| 344 """
| --> 345 return self.nbytes
| 346
| 347 @property
|
| ~/mambaforge-pypy3/site-packages/pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
|
| ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in nbytes(self)
| 315 """
| 316 rng = self._range
| --> 317 return getsizeof(rng) + sum(
| 318 getsizeof(getattr(rng, attr_name))
| 319 for attr_name in ["start", "stop", "step"]
|
| TypeError: getsizeof(...)
| getsizeof(object, default) -> int
|
| Return the size of object in bytes.
|
| sys.getsizeof(object, default) will always return default on PyPy, and
| raise a TypeError if default is not provided.
|
| First note that the CPython documentation says that this function may
| raise a TypeError, so if you are seeing it, it means that the program
| you are using is not correctly handling this case.
|
| On PyPy, though, it always raises TypeError. Before looking for
| alternatives, please take a moment to read the following explanation as
| to why it is the case. What you are looking for may not be possible.
|
| A memory profiler using this function is most likely to give results
| inconsistent with reality on PyPy. It would be possible to have
| sys.getsizeof() return a number (with enough work), but that may or
| may not represent how much memory the object uses. It doesn't even
| make really sense to ask how much *one* object uses, in isolation
| with the rest of the system. For example, instances have maps,
| which are often shared across many instances; in this case the maps
| would probably be ignored by an implementation of sys.getsizeof(),
| but their overhead is important in some cases if they are many
| instances with unique maps. Conversely, equal strings may share
| their internal string data even if they are different objects---or
| empty containers may share parts of their internals as long as they
| are empty. Even stranger, some lists create objects as you read
| them; if you try to estimate the size in memory of range(10**6) as
| the sum of all items' size, that operation will by itself create one
| million integer objects that never existed in the first place.
<!--EndFragment-->
Expected Behavior
Here's an example of avoiding the usage of getsizeof elsewhere in pandas:
https://github.com/pandas-dev/pandas/blob/main/pandas/core/indexes/multi.py#L1266-L1267
Installed Versions
To replicate this bug use mambaforge-pypy