Skip to content

PERF: sparse to_csv #49066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

PERF: sparse to_csv #49066

wants to merge 9 commits into from

Conversation

rtlee9
Copy link
Contributor

@rtlee9 rtlee9 commented Oct 13, 2022

Improves NDFrame.to_csv performance for sparse dataframe by casting to dense before initializing DataFrameFormatter. Results in many fewer calls to to_native_types which saves time. Added a new ASV benchmark based on the example provided by OP in #41023.

Benchmark results:

[100.00%] ··· io.csv.ToCSVSparse.time_sparse_to_dense_to_csv                                        1.16±0.02s       before           after         ratio
     [56d82a9b]       [945a3525]
     <main~1>         <main>
-       13.6±0.1s       1.22±0.04s     0.09  io.csv.ToCSVSparse.time_sparse_to_csv

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@mroeschke mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv Sparse Sparse Data Type labels Oct 13, 2022
Improves to_csv performance for sparse matric by casting to dense
before initializing DataFrameFormatter. Results in many fewer calls to
`to_native_types` which saves time.
@rtlee9
Copy link
Contributor Author

rtlee9 commented Oct 27, 2022

Revised asv benchmarks (upstream/main vs pr) after moving materialization to the chunk level to preserve memory. Chunk-level materialization takes longer than all-at-once materialization but is still a significant improvement over upstream/main.

[100.00%] ··· io.csv.ToCSVSparse.time_sparse_to_dense_to_csv                                         87.2±3ms
       before           after         ratio
     [9c9789c5]       [6d092ee6]
     <main^2>         <main>
-         641±4ms          285±3ms     0.44  io.csv.ToCSVSparse.time_sparse_to_csv

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@phofl
Copy link
Member

phofl commented Oct 27, 2022

Please don't wait for reviewers to resolve conversations

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 24, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance Sparse Sparse Data Type Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version
4 participants