Description
Code Sample, a copy-pastable example if possible
import scipy.sparse as sp
import pandas as pd
import numpy as np
shape = (500000, 50000)
data = np.repeat(1, 10000)
i = np.random.choice(shape[0], 10000, replace=False)
j = np.random.choice(shape[1], 10000, replace=False)
X = sp.coo_matrix((data, (i, j)), shape=shape)
# this works fine
df = pd.SparseDataFrame(X, index=np.arange(shape[0]))
df.index = np.arange(shape[0]).astype(str)
# this requires 400GB of memory and takes an hour
df = pd.SparseDataFrame(X, index=np.arange(shape[0]).astype(str))
Problem description
pd.SparseDataFrame
densifies its input if it is handed a string index. This is extremely undesirable and very confusing for the user.
Expected Output
The data frame should be created in a matter of seconds, without coercing to a dense matrix.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.3-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.7.3
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.7
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.4
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.6.0