Skip to content

BUG: pandas.SparseDtype from pandas.CategoricalDtype fails #39874

Open
@PetarMPetrov

Description

@PetarMPetrov

Code Sample

import numpy as np
import pandas as pd
from scipy import sparse as sp_sparse

# Create categorical type and sparse type from it.
custom_type = pd.CategoricalDtype(categories=['Zero', 'One'])
categorical_sparse_type = pd.SparseDtype(dtype=custom_type, fill_value='Zero')

# Create sparse type from string type
string_sparse_type = pd.SparseDtype(dtype='str', fill_value='Zero')

# Dummy Data
data = np.array([['Zero', 'Zero'],
                 ['One', 'Zero']])

# Create sparse data frame from categorical sparse type
categorical_sparse_df = pd.DataFrame(
    data=data,
    columns=list('AB'),
).astype(categorical_sparse_type)

# Create sparse data frame from string sparse type
string_sparse_df = pd.DataFrame(
    data=data,
    columns=list('AB'),
).astype(string_sparse_type)

The following operation causes an error .

dense_df = categorical_sparse_df.sparse.to_dense()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-95-364b3ddaf122> in <module>
----> 1 dense_df = categorical_sparse_df.sparse.to_dense()

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/accessor.py in to_dense(self)
    302         from pandas import DataFrame
    303 
--> 304         data = {k: v.array.to_dense() for k, v in self._parent.items()}
    305         return DataFrame(data, index=self._parent.index, columns=self._parent.columns)
    306 

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/accessor.py in <dictcomp>(.0)
    302         from pandas import DataFrame
    303 
--> 304         data = {k: v.array.to_dense() for k, v in self._parent.items()}
    305         return DataFrame(data, index=self._parent.index, columns=self._parent.columns)
    306 

~/anaconda3/lib/python3.7/site-packages/pandas/core/arrays/sparse/array.py in to_dense(self)
   1132         arr : NumPy array
   1133         """
-> 1134         return np.asarray(self, dtype=self.sp_values.dtype)
   1135 
   1136     _internal_get_values = to_dense

~/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: data type not understood

In addition, the following does not raise an error, but changes the "Zero"-only column in an unexpected way when groupby is applied.

string_sparse_df.groupby(level=0).apply(lambda x:x)
  A B
0 Zero Z
1 One Z

If the dense version of the data frame is used, the outcome is as expected.

string_sparse_df.sparse.to_dense().groupby(level=0).apply(lambda x:x)
  A B
0 Zero Zero
1 One Zero

Problem description

From the description of pandas.SparseDtype, my understanding is that the dtype argument can be of type ExtensionDtype, which is consistent with CategoricalDtype. However, doing certain operations (example above) with a sparse data frame of such type causes an TypeError.

In addition, replacing the CategoricalDtype with a str type seems to partially fix the problem. However, it still causes issues with groupby when a column consists of only the fill_value.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 7d32926
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-22-generic
Version : #23~18.04.1-Ubuntu SMP Thu Jun 6 08:37:25 UTC 2019
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.2.2
numpy : 1.18.1
pytz : 2020.4
dateutil : 2.8.1
pip : 21.0.1
setuptools : 46.4.0.post20200518
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.0
bottleneck : 1.2.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : 2.7.1
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : 1.3.5
tables : 3.5.2
tabulate : None
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.44.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCategoricalCategorical Data TypeExtensionArrayExtending pandas with custom dtypes or arrays.SparseSparse Data Type

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions