Skip to content

SparseDataFrame constructor has horrible performance for df with many columns #16773

Closed
@flo-compbio

Description

@flo-compbio

Code Sample

This is an example taken directly from the docs, only that I've changed the sparsity of the arrays from 90% to 99%.

import pandas as pd
from scipy.sparse import csr_matrix
import numpy as np

arr = np.random.random(size=(1000, 5))
arr[arr < .99] = 0
sp_arr = csr_matrix(arr)
%timeit sdf = pd.SparseDataFrame(sp_arr)
 4.78 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now, here's what happens when I increase the number of columns from 5 to 2000:

import pandas as pd
from scipy.sparse import csr_matrix
import numpy as np

arr = np.random.random(size=(1000, 2000))
arr[arr < .99] = 0
sp_arr = csr_matrix(arr)
%timeit sdf = pd.SparseDataFrame(sp_arr)
8.69 s ± 208 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that initializing a the scipy.sparse.csr_matrix object itself is way (!!!) faster:

import pandas as pd
from scipy.sparse import csr_matrix
import numpy as np

arr = np.random.random(size=(1000, 2000))
arr[arr < .99] = 0
%timeit sp_arr = csr_matrix(arr)
13 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

The construction of a SparseDataFrame with many columns is ridiculously slow. I've traced the problem to this line in the SparseDataFrame._init_dict() function. I don't know why the data frame is constructed by assigning individual columns of a DataFrame object. I think the DataFrame._init_dict method uses a much more efficient method.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-24-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.1
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceSparseSparse Data Type

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions