Skip to content

BUG: Segmentation faults in pd.read_csv for large files #35051

Open
@Krytic

Description

@Krytic
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

def from_file(filepath):
    name_hints = []
    name_hints.extend(['m1','m2','a0','e0'])
    name_hints.extend(['weight','evolution_age','rejuvenation_age'])

    df = pd.read_csv(filepath,
                     nrows=None,
                     names=name_hints,
                     sep=r'\s+',
                     engine='python',
                     dtype=np.float64)

    return df

df = from_file('data/big_data_set.dat.gz')

Problem description

When loading sufficiently big files (mine are >800 MB on average), pd.read_csv fails with a Bus Error. I am loading in .gz files as I believe pandas is able to decompress these automatically.

I cannot reproduce this with a smaller dataset (have tried on files with ~364,000 lines). I have run this code with gdb and obtain the following stack trace:

Thread 0x00002aaab53d0700 (most recent call first): File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 306 in wait File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 558 in wait File "/home/sric560/.local/lib/python3.8/site-packages/tqdm/_monitor.py", line 69 in run File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00002aaaaaaea9c0 (most recent call first):
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1763 in _infer_types
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1708 in _convert_to_ndarrays
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2528 in _convert_data
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2464 in read
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133 in read
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 454 in _read
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 676 in parser_f
File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 14 in from_file
File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 35 in from_directory
/var/spool/slurm/job13442044/slurm_script: line 10: 67748 Segmentation fault (core dumped) gdb -ex r --args python BigData.py
File "BigData.py", line 11 in

Expected Output

I would expect the data to be read successfully.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-693.2.2.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_NZ.UTF-8
LOCALE : en_NZ.UTF-8

pandas : 1.0.5
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 45.2.0
Cython : 0.29.15
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : 0.15.0
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvSegfaultNon-Recoverable Error

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions