Skip to content

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

Closed
@xgdgsc

Description

@xgdgsc

Code Sample, a copy-pastable example if possible

http://stackoverflow.com/questions/32750970/python-pandas-merge-causing-memory-overflow

# coding: utf-8
import pandas as pd
data = pd.read_csv('https://gist.githubusercontent.com/xgdgsc/8671a22136e1da937f1046a5f211c0ff/raw/d261706a6e7d1d7014e45e47122ead71e7159ef4/small.csv', index_col='<Date>')
print(data.shape)
another = data[[ ' <Open>']]
joined = data.join([another])
print(joined.shape)

Problem description

Currently having index with duplicate keys when joining dataframes would cause severe memory overflow, sometimes freezes the computer and user has to hard reboot, which can be disastrous for unsaved work.

Expected Output

Adding a simple checking before joining/merging , stop the operation and warn the user would be enough.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions