Skip to content

to_sql() performance regression (#19664) when DF contains many columns #21146

Closed
@schettino72

Description

@schettino72

After #19664 to_sql() performance is highly dependent on the number of columns.

Problem description

For a single column 0.23 is faster than 0.22. Adding more columns has little influence on 0.22 but quickly degrades the performance on 0.23. I.e. for 20 columns 0.23 takes 1.5 times longer than 0.22

Performance degradation happens regardless of parameter chunksize adjustment.

import time

import numpy as np
import pandas as pd
from sqlalchemy import create_engine


start = time.time()

N_COLS = 20
df = pd.DataFrame({n: np.arange(0,20_000,1) for n in range(N_COLS)})

#create the engine to connect pandas with sqlite3
engine = create_engine('postgresql://user:@localhost/db')
#create connection
conn = engine.connect()

# convert df to sql table
df.to_sql('test',engine, if_exists='replace',chunksize=1_000)
result = conn.execute("select * from test")
conn.close()
print('WRITE: {}'.format(time.time() - start))

Tested using linux, postgres 9.6, python 3.6, SQLAlchemy 1.2.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions