Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
file_path = "\\path\\to\\outpattern_user_cost.sas7bdat" #Failure.
#file_path = "\\path\\to\\outpattern_user_cost_fixed.sas7bdat" #Success!
df = pd.read_sas(filepath_or_buffer=file_path, encoding="infer")
print(df)
Issue Description
With the latest (2.2.3) release of Pandas my team noticed when calling read_sas()
on a particular '.sas7bdat' file, the read fails with an IndexError: list index out of range
in _process_columnname_subheader()
when processing the table's metadata.
The files in question can be found in: sas_data_files.zip. The archive contains 2 files: outpattern_user_cost.sas7bdat
(the original data file causing the IndexError) and outpattern_user_cost_fixed.sas7bdat
(the same table after being read into a new table in SAS and exported [see below], does not cause the error).
It produces the following trace result:
Traceback
Traceback (most recent call last):
File "C:\Users\XXXXXXX\Desktop\examples\my_tests\pandas_sas_issue.py", line 11, in <module>
df = pd.read_sas(filepath_or_buffer=file_path, encoding="infer")
File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sasreader.py", line 164, in read_sas
reader = SAS7BDATReader(
File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 228, in __init__
self._parse_metadata()
File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 384, in _parse_metadata
done = self._process_page_meta()
File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 390, in _process_page_meta
self._process_page_metadata()
File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 453, in _process_page_metadata
subheader_processor(subheader_offset, subheader_length)
File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 573, in _process_columnname_subheader
name_raw = self.column_names_raw[idx]
IndexError: list index out of range
However, in SAS, if the SAS data is saved to a new table, with no changes made to it:
data outpattern_user_cost_fixed;
set outpattern_user_cost;
run;
and then the new table is exported as outpattern_user_cost_fixed.sas7bdat
, this new file is now read in by read_sas()
correctly.
We use SAS to output the metadata of these test files via:
proc contents data="C:\Users\XXXXXX\Desktop\test\gconfid\pandas-ticket\outpattern_user_cost.sas7bdat";
title 'Not Fixed';
run;
proc contents data="C:\Users\XXXXXX\Desktop\test\gconfid\pandas-ticket\outpattern_user_cost_fixed.sas7bdat";
title 'Fixed';
run;
As shown above, the only difference reported is that the "Number of Data Set Pages" for the original erroneous table (outpattern_user_cost.sas7bdat
) is 2 (vs 1 for outpattern_user_cost_fixed.sas7bdat
) and the filesize is larger. All other reported metadata, including variable names, types and lengths, are the same.
Expected Behavior
The outpattern_user_cost.sas7bdat
dataset should be loaded without error.
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.10.11
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : AMD64 Family 25 Model 1 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252
pandas : 2.2.3
numpy : 1.26.4
pytz : 2023.3
dateutil : 2.8.2
pip : 24.0
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
blosc : None
bottleneck : 1.3.8
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.2
lxml.etree : 4.9.3
matplotlib : None
numba : 0.60.0
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 7.4.2
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None