Skip to content

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

Open
@melkonyan

Description

@melkonyan

Code to reproduce:

import pandas as pd
from scipy import sparse as sc
import numpy as np

np.random.seed(42)
vals = np.random.randint(0, 10, size=(1000, 1000))
keep = vals > 3
vals[keep] = 0
sparse_mtx = sc.coo_matrix(vals)
sparse_pd = pd.DataFrame.sparse.from_spmatrix(sparse_mtx)

num_tries = 30
t1 = timeit.timeit(lambda: sparse_pd.to_csv('sparse_pd.csv'), number=num_tries)
t2 = timeit.timeit(lambda: sparse_pd.sparse.to_dense().to_csv('sparse_pd.csv'), number=num_tries)

overhead = t1/t2

print(t1, t2, overhead)

Output:

56.591012510471046 3.7841985523700714 14.954556883657089

Versions:

  • python == 3.9.2
  • pandas == 1.2.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvPerformanceMemory or execution speed performanceSparseSparse Data Type

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions