Description
Code Sample (copy-pastable)
from __future__ import division, print_function
import pandas as pd
import numpy as np
import os
import gc
import psutil
def log_memory(label):
for i in xrange(3):
gc.collect(i)
process = psutil.Process(os.getpid())
mem_usage = process.memory_info().rss / float(2 ** 20)
print("[Memory usage] {:<25s} {:12.1f} MB".format(
label, mem_usage
))
def generate_test_data(num_partitions=20):
for i in range(num_partitions):
N = 10 * 1000 * 1000
# randomness required, identical files don't have the issue
df = pd.DataFrame({
"A": np.random.uniform(0, 1, size=N),
})
df.to_msgpack("/tmp/pd_test_{:02d}.msg".format(i), compress='zlib')
def load_msgpack(f):
data = open(f).read()
df = pd.read_msgpack(data)
return df
def load_partitions_sequentially(num_partitions=20):
for i in range(num_partitions):
fn = "/tmp/pd_test_{:02d}.msg".format(i)
df = load_msgpack(fn)
del df
log_memory("After partition {}".format(i+1))
log_memory("At initialization")
generate_test_data()
log_memory("After data generation")
load_partitions_sequentially()
Problem description
There is a memory leak in pandas.read_msgpack
when reading from a string. Calling pandas.read_msgpack(str_data)
increases the ref count of str_data
if and only if read_msgpack
sees the content of str_data
for the first time. This implies that there is a memory leak, but only when reading different files -- when reading the same file over and over again str_data
will only leak once.
The problem does not exist when reading from file handles or BytesIO
.
Output of above example
The output clearly shows the effect of the memory leak when loading data frame partitions sequentially:
[Memory usage] At initialization 39.4 MB
[Memory usage] After data generation 39.9 MB
[Memory usage] After partition 1 185.9 MB
[Memory usage] After partition 2 329.8 MB
[Memory usage] After partition 3 473.7 MB
[Memory usage] After partition 4 617.6 MB
[Memory usage] After partition 5 761.5 MB
[Memory usage] After partition 6 905.4 MB
[Memory usage] After partition 7 1049.3 MB
[Memory usage] After partition 8 1193.2 MB
[Memory usage] After partition 9 1337.1 MB
[Memory usage] After partition 10 1481.0 MB
[Memory usage] After partition 11 1624.9 MB
[Memory usage] After partition 12 1768.8 MB
[Memory usage] After partition 13 1912.7 MB
[Memory usage] After partition 14 2056.6 MB
[Memory usage] After partition 15 2200.4 MB
[Memory usage] After partition 16 2344.3 MB
[Memory usage] After partition 17 2488.2 MB
[Memory usage] After partition 18 2631.7 MB
[Memory usage] After partition 19 2775.6 MB
[Memory usage] After partition 20 2919.5 MB
Output of pd.show_versions()
pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.0
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None