PERF: sparse to_csv #49066

rtlee9 · 2022-10-13T05:16:54Z

Improves NDFrame.to_csv performance for sparse dataframe by casting to dense before initializing DataFrameFormatter. Results in many fewer calls to to_native_types which saves time. Added a new ASV benchmark based on the example provided by OP in #41023.

Benchmark results:

[100.00%] ··· io.csv.ToCSVSparse.time_sparse_to_dense_to_csv                                        1.16±0.02s       before           after         ratio
     [56d82a9b]       [945a3525]
     <main~1>         <main>
-       13.6±0.1s       1.22±0.04s     0.09  io.csv.ToCSVSparse.time_sparse_to_csv

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

closes BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pandas/core/generic.py

Improves to_csv performance for sparse matric by casting to dense before initializing DataFrameFormatter. Results in many fewer calls to `to_native_types` which saves time.

doc/source/whatsnew/v2.0.0.rst

pandas/io/formats/csvs.py

asv_bench/benchmarks/io/csv.py

…atter`

This should improve memory consumption by only materializing one chunk at a time

rtlee9 · 2022-10-27T05:38:11Z

Revised asv benchmarks (upstream/main vs pr) after moving materialization to the chunk level to preserve memory. Chunk-level materialization takes longer than all-at-once materialization but is still a significant improvement over upstream/main.

[100.00%] ··· io.csv.ToCSVSparse.time_sparse_to_dense_to_csv                                         87.2±3ms
       before           after         ratio
     [9c9789c5]       [6d092ee6]
     <main^2>         <main>
-         641±4ms          285±3ms     0.44  io.csv.ToCSVSparse.time_sparse_to_csv

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

phofl · 2022-10-27T07:20:24Z

Please don't wait for reviewers to resolve conversations

github-actions · 2022-12-24T00:05:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-01-04T01:50:11Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

phofl reviewed Oct 13, 2022

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv Sparse Sparse Data Type labels Oct 13, 2022

rtlee9 added 2 commits October 13, 2022 19:22

PERF: sparse to_csv

71f92a9

Improves to_csv performance for sparse matric by casting to dense before initializing DataFrameFormatter. Results in many fewer calls to `to_native_types` which saves time.

Move optimization deeper in the call stack

4e3e1fa

rtlee9 force-pushed the main branch from 945a352 to 4e3e1fa Compare October 14, 2022 02:28

Merge remote-tracking branch 'upstream/main'

5d67058

mzeitlin11 reviewed Oct 20, 2022

View reviewed changes

doc/source/whatsnew/v2.0.0.rst Outdated Show resolved Hide resolved

mzeitlin11 reviewed Oct 20, 2022

View reviewed changes

pandas/io/formats/csvs.py Outdated Show resolved Hide resolved

mzeitlin11 reviewed Oct 20, 2022

View reviewed changes

asv_bench/benchmarks/io/csv.py Outdated Show resolved Hide resolved

mzeitlin11 reviewed Oct 20, 2022

View reviewed changes

asv_bench/benchmarks/io/csv.py Outdated Show resolved Hide resolved

rtlee9 added 5 commits October 22, 2022 20:49

Update whatsnew to reference user-facing to_csv instead of `CSVForm…

c0b794f

…atter`

Reduce test frame size and use self.fname for asv test case

6fa18f6

Merge remote-tracking branch 'upstream/main'

6ca039a

Move sparse conversion deeper in the call stack

c6c50d1

This should improve memory consumption by only materializing one chunk at a time

Merge remote-tracking branch 'upstream/main'

6d092ee

Merge remote-tracking branch 'upstream/main'

f0060e4

github-actions bot added the Stale label Dec 24, 2022

mroeschke closed this Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: sparse to_csv #49066

PERF: sparse to_csv #49066

rtlee9 commented Oct 13, 2022

rtlee9 commented Oct 27, 2022

phofl commented Oct 27, 2022

github-actions bot commented Dec 24, 2022

mroeschke commented Jan 4, 2023

PERF: sparse to_csv #49066

PERF: sparse to_csv #49066

Conversation

rtlee9 commented Oct 13, 2022

rtlee9 commented Oct 27, 2022

phofl commented Oct 27, 2022

github-actions bot commented Dec 24, 2022

mroeschke commented Jan 4, 2023