Description
Code Sample
This is an example taken directly from the docs, only that I've changed the sparsity of the arrays from 90% to 99%.
import pandas as pd
from scipy.sparse import csr_matrix
import numpy as np
arr = np.random.random(size=(1000, 5))
arr[arr < .99] = 0
sp_arr = csr_matrix(arr)
%timeit sdf = pd.SparseDataFrame(sp_arr)
4.78 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now, here's what happens when I increase the number of columns from 5 to 2000:
import pandas as pd
from scipy.sparse import csr_matrix
import numpy as np
arr = np.random.random(size=(1000, 2000))
arr[arr < .99] = 0
sp_arr = csr_matrix(arr)
%timeit sdf = pd.SparseDataFrame(sp_arr)
8.69 s ± 208 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that initializing a the scipy.sparse.csr_matrix
object itself is way (!!!) faster:
import pandas as pd
from scipy.sparse import csr_matrix
import numpy as np
arr = np.random.random(size=(1000, 2000))
arr[arr < .99] = 0
%timeit sp_arr = csr_matrix(arr)
13 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem description
The construction of a SparseDataFrame with many columns is ridiculously slow. I've traced the problem to this line in the SparseDataFrame._init_dict()
function. I don't know why the data frame is constructed by assigning individual columns of a DataFrame
object. I think the DataFrame._init_dict
method uses a much more efficient method.
Output of pd.show_versions()
pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.1
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None