Closed
Description
Code Sample, a copy-pastable example if possible
import sys
m = int(sys.argv[1])
n = int(sys.argv[2])
with open('df.csv', 'wt') as f:
for i in range(n-1):
f.write('c' + str(i) + ',')
f.write('c' + str(n-1) + '\n')
for j in range(m):
for i in range(n-1):
f.write('1,')
f.write('1\n')
import psutil
print(psutil.Process().memory_info().rss / 1024**2)
import pandas as pd
df = pd.read_csv('df.csv')
print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)
import gc
del df
gc.collect()
print(psutil.Process().memory_info().rss / 1024**2)
Problem description
$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375
$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25
$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!
$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375
$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!
Two issues:
- There is a "standard" leak after reading any CSV OR just creating by
pd.DataFrame()
- ~53Mb. - We see a large leak in some other cases.
cc @gfyoung
Output of pd.show_versions()
(same for 0.21, 0.22, 0.23)
pandas: 0.23.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None