Skip to content

Behaviour of Categorical inputs to sparse data structures #19278

Open
@jnothman

Description

@jnothman

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> c = pd.Categorical(list('abcabc'))
>>> c
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
>>> pd.Series(c).dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> pd.Series(c).to_sparse().dtype
dtype('O')
>>> pd.SparseArray(c)
[a, b, c, a, b, c]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3, 4, 5], dtype=int32)

>>> pd.SparseArray(c).dtype
dtype('O')
>>> pd.SparseSeries(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joel/anaconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/core/sparse/series.py", line 175, in __init__
    length = len(index)
TypeError: object of type 'NoneType' has no len()
>>> pd.DataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: category
Categories (3, object): [a, b, c]
>>> pd.SparseDataFrame({'a': c})['a']
0    a
1    b
2    c
3    a
4    b
5    c
Name: a, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([6], dtype=int32)

Problem description

  • Categoricals are upcast to object dtype when put into SparseArray and SparseDataFrame (or when calling Series.to_sparse()). This is inconsistent with the categorical dtype retained by dense Series and DataFrame.
  • SparseSeries raises an error when constructed with a categorical argument. This is inconsistent with the SparseArray and SparseDataFrame behaviour.

Expected Output

SparseDataFrame({'a': c})['a'].dtype == SparseSeries(c).dtype == SparseArray(c).dtype == Series(c).dtype

or at a minimum:

SparseSeries(c) raises no error, and produces object dtype.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0+unknown
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    ConstructorsSeries/DataFrame/Index/pd.array ConstructorsEnhancementSparseSparse Data Type

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions