Skip to content

getsizeof usage for memory utilization estimation is incompatible with PyPy #46176

Open
@ghost

Description

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from sklearn.datasets import fetch_openml
data = fetch_openml('mnist_784')

Issue Description

PyPy doesn't like this usage of getsizeof:


<!--StartFragment-->

---------------------------------------------------------------------------
--
  | TypeError                                 Traceback (most recent call last)
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in wrapper(*args, **kw)
  | 60             try:
  | ---> 61                 return f(*args, **kw)
  | 62             except HTTPError:
  |  
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _load_arff_response(url, data_home, return_type, encode_nominal, parse_arff, md5_checksum)
  | 528
  | --> 529         parsed_arff = parse_arff(arff)
  | 530
  |  
  | ~/mambaforge-pypy3/lib_pypy/_functools.py in __call__(self, *fargs, **fkeywords)
  | 79             fkeywords = dict(self._keywords, **fkeywords)
  | ---> 80         return self._func(*(self._args + fargs), **fkeywords)
  | 81
  |  
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _convert_arff_data_dataframe(arff, columns, features_dict)
  | 353
  | --> 354     row_bytes = first_df.memory_usage(deep=True).sum()
  | 355     chunksize = get_chunk_n_rows(row_bytes)
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/core/frame.py in memory_usage(self, index, deep)
  | 3223             result = self._constructor_sliced(
  | -> 3224                 self.index.memory_usage(deep=deep), index=["Index"]
  | 3225             ).append(result)
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in memory_usage(self, deep)
  | 344         """
  | --> 345         return self.nbytes
  | 346
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in nbytes(self)
  | 316         rng = self._range
  | --> 317         return getsizeof(rng) + sum(
  | 318             getsizeof(getattr(rng, attr_name))
  |  
  | TypeError: getsizeof(...)
  | getsizeof(object, default) -> int
  |  
  | Return the size of object in bytes.
  |  
  | sys.getsizeof(object, default) will always return default on PyPy, and
  | raise a TypeError if default is not provided.
  |  
  | First note that the CPython documentation says that this function may
  | raise a TypeError, so if you are seeing it, it means that the program
  | you are using is not correctly handling this case.
  |  
  | On PyPy, though, it always raises TypeError.  Before looking for
  | alternatives, please take a moment to read the following explanation as
  | to why it is the case.  What you are looking for may not be possible.
  |  
  | A memory profiler using this function is most likely to give results
  | inconsistent with reality on PyPy.  It would be possible to have
  | sys.getsizeof() return a number (with enough work), but that may or
  | may not represent how much memory the object uses.  It doesn't even
  | make really sense to ask how much *one* object uses, in isolation
  | with the rest of the system.  For example, instances have maps,
  | which are often shared across many instances; in this case the maps
  | would probably be ignored by an implementation of sys.getsizeof(),
  | but their overhead is important in some cases if they are many
  | instances with unique maps.  Conversely, equal strings may share
  | their internal string data even if they are different objects---or
  | empty containers may share parts of their internals as long as they
  | are empty.  Even stranger, some lists create objects as you read
  | them; if you try to estimate the size in memory of range(10**6) as
  | the sum of all items' size, that operation will by itself create one
  | million integer objects that never existed in the first place.
  |  
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | TypeError                                 Traceback (most recent call last)
  | /tmp/ipykernel_18403/3646006393.py in <module>
  | ----> 1 data = fetch_openml('mnist_784')
  |  
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in fetch_openml(name, version, data_id, data_home, target_column, cache, return_X_y, as_frame)
  | 965         target_columns=target_columns,
  | 966         data_columns=data_columns,
  | --> 967         md5_checksum=data_description["md5_checksum"],
  | 968     )
  | 969
  |  
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _download_data_to_bunch(url, sparse, data_home, as_frame, features_list, data_columns, target_columns, shape, md5_checksum)
  | 659         encode_nominal=not as_frame,
  | 660         parse_arff=parse_arff,
  | --> 661         md5_checksum=md5_checksum,
  | 662     )
  | 663     X, y, frame, nominal_attributes = postprocess(*out)
  |  
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in wrapper(*args, **kw)
  | 67                 if os.path.exists(local_path):
  | 68                     os.unlink(local_path)
  | ---> 69                 return f(*args, **kw)
  | 70
  | 71         return wrapper
  |  
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _load_arff_response(url, data_home, return_type, encode_nominal, parse_arff, md5_checksum)
  | 527         )
  | 528
  | --> 529         parsed_arff = parse_arff(arff)
  | 530
  | 531         # consume remaining stream, if early exited
  |  
  | ~/mambaforge-pypy3/lib_pypy/_functools.py in __call__(self, *fargs, **fkeywords)
  | 78         if self._keywords:
  | 79             fkeywords = dict(self._keywords, **fkeywords)
  | ---> 80         return self._func(*(self._args + fargs), **fkeywords)
  | 81
  | 82     @_recursive_repr()
  |  
  | ~/mambaforge-pypy3/site-packages/sklearn/datasets/_openml.py in _convert_arff_data_dataframe(arff, columns, features_dict)
  | 352     first_df = pd.DataFrame([first_row], columns=arff_columns)
  | 353
  | --> 354     row_bytes = first_df.memory_usage(deep=True).sum()
  | 355     chunksize = get_chunk_n_rows(row_bytes)
  | 356
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/core/frame.py in memory_usage(self, index, deep)
  | 3222         if index:
  | 3223             result = self._constructor_sliced(
  | -> 3224                 self.index.memory_usage(deep=deep), index=["Index"]
  | 3225             ).append(result)
  | 3226         return result
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in memory_usage(self, deep)
  | 343         numpy.ndarray.nbytes
  | 344         """
  | --> 345         return self.nbytes
  | 346
  | 347     @property
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
  |  
  | ~/mambaforge-pypy3/site-packages/pandas/core/indexes/range.py in nbytes(self)
  | 315         """
  | 316         rng = self._range
  | --> 317         return getsizeof(rng) + sum(
  | 318             getsizeof(getattr(rng, attr_name))
  | 319             for attr_name in ["start", "stop", "step"]
  |  
  | TypeError: getsizeof(...)
  | getsizeof(object, default) -> int
  |  
  | Return the size of object in bytes.
  |  
  | sys.getsizeof(object, default) will always return default on PyPy, and
  | raise a TypeError if default is not provided.
  |  
  | First note that the CPython documentation says that this function may
  | raise a TypeError, so if you are seeing it, it means that the program
  | you are using is not correctly handling this case.
  |  
  | On PyPy, though, it always raises TypeError.  Before looking for
  | alternatives, please take a moment to read the following explanation as
  | to why it is the case.  What you are looking for may not be possible.
  |  
  | A memory profiler using this function is most likely to give results
  | inconsistent with reality on PyPy.  It would be possible to have
  | sys.getsizeof() return a number (with enough work), but that may or
  | may not represent how much memory the object uses.  It doesn't even
  | make really sense to ask how much *one* object uses, in isolation
  | with the rest of the system.  For example, instances have maps,
  | which are often shared across many instances; in this case the maps
  | would probably be ignored by an implementation of sys.getsizeof(),
  | but their overhead is important in some cases if they are many
  | instances with unique maps.  Conversely, equal strings may share
  | their internal string data even if they are different objects---or
  | empty containers may share parts of their internals as long as they
  | are empty.  Even stranger, some lists create objects as you read
  | them; if you try to estimate the size in memory of range(10**6) as
  | the sum of all items' size, that operation will by itself create one
  | million integer objects that never existed in the first place.

<!--EndFragment-->

Expected Behavior

Here's an example of avoiding the usage of getsizeof elsewhere in pandas:

https://github.com/pandas-dev/pandas/blob/main/pandas/core/indexes/multi.py#L1266-L1267

Installed Versions

To replicate this bug use mambaforge-pypy

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions