Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
def from_file(filepath):
name_hints = []
name_hints.extend(['m1','m2','a0','e0'])
name_hints.extend(['weight','evolution_age','rejuvenation_age'])
df = pd.read_csv(filepath,
nrows=None,
names=name_hints,
sep=r'\s+',
engine='python',
dtype=np.float64)
return df
df = from_file('data/big_data_set.dat.gz')
Problem description
When loading sufficiently big files (mine are >800 MB on average), pd.read_csv
fails with a Bus Error. I am loading in .gz
files as I believe pandas is able to decompress these automatically.
I cannot reproduce this with a smaller dataset (have tried on files with ~364,000 lines). I have run this code with gdb
and obtain the following stack trace:
Current thread 0x00002aaaaaaea9c0 (most recent call first):
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1763 in _infer_types
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1708 in _convert_to_ndarrays
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2528 in _convert_data
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2464 in read
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133 in read
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 454 in _read
File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 676 in parser_f
File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 14 in from_file
File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 35 in from_directory
/var/spool/slurm/job13442044/slurm_script: line 10: 67748 Segmentation fault (core dumped) gdb -ex r --args python BigData.py
File "BigData.py", line 11 in
Expected Output
I would expect the data to be read successfully.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-693.2.2.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_NZ.UTF-8
LOCALE : en_NZ.UTF-8
pandas : 1.0.5
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 45.2.0
Cython : 0.29.15
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : 0.15.0
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0