Skip to content

BUG: IndexError in _process_columnname_subheader() when using read_sas() #60809

Open
@andrdom

Description

@andrdom

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

file_path = "\\path\\to\\outpattern_user_cost.sas7bdat" #Failure.
#file_path = "\\path\\to\\outpattern_user_cost_fixed.sas7bdat" #Success!

df = pd.read_sas(filepath_or_buffer=file_path, encoding="infer")
print(df)

Issue Description

With the latest (2.2.3) release of Pandas my team noticed when calling read_sas() on a particular '.sas7bdat' file, the read fails with an IndexError: list index out of range in _process_columnname_subheader() when processing the table's metadata.

The files in question can be found in: sas_data_files.zip. The archive contains 2 files: outpattern_user_cost.sas7bdat (the original data file causing the IndexError) and outpattern_user_cost_fixed.sas7bdat (the same table after being read into a new table in SAS and exported [see below], does not cause the error).

It produces the following trace result:

Traceback
   Traceback (most recent call last):
  File "C:\Users\XXXXXXX\Desktop\examples\my_tests\pandas_sas_issue.py", line 11, in <module>
    df = pd.read_sas(filepath_or_buffer=file_path, encoding="infer")
  File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sasreader.py", line 164, in read_sas
    reader = SAS7BDATReader(
  File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 228, in __init__
    self._parse_metadata()
  File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 384, in _parse_metadata
    done = self._process_page_meta()
  File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 390, in _process_page_meta
    self._process_page_metadata()
  File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 453, in _process_page_metadata
    subheader_processor(subheader_offset, subheader_length)
  File "C:\Users\XXXXXXX\Desktop\processorenv\lib\site-packages\pandas\io\sas\sas7bdat.py", line 573, in _process_columnname_subheader
    name_raw = self.column_names_raw[idx]
IndexError: list index out of range

However, in SAS, if the SAS data is saved to a new table, with no changes made to it:

data outpattern_user_cost_fixed;
	set outpattern_user_cost;
run;

and then the new table is exported as outpattern_user_cost_fixed.sas7bdat, this new file is now read in by read_sas() correctly.

We use SAS to output the metadata of these test files via:

proc contents data="C:\Users\XXXXXX\Desktop\test\gconfid\pandas-ticket\outpattern_user_cost.sas7bdat";
	title 'Not Fixed';
run;

proc contents data="C:\Users\XXXXXX\Desktop\test\gconfid\pandas-ticket\outpattern_user_cost_fixed.sas7bdat";
	title 'Fixed';
run;
Image Image

As shown above, the only difference reported is that the "Number of Data Set Pages" for the original erroneous table (outpattern_user_cost.sas7bdat) is 2 (vs 1 for outpattern_user_cost_fixed.sas7bdat) and the filesize is larger. All other reported metadata, including variable names, types and lengths, are the same.

Expected Behavior

The outpattern_user_cost.sas7bdat dataset should be loaded without error.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.10.11
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : AMD64 Family 25 Model 1 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 2.2.3
numpy : 1.26.4
pytz : 2023.3
dateutil : 2.8.2
pip : 24.0
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
blosc : None
bottleneck : 1.3.8
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.2
lxml.etree : 4.9.3
matplotlib : None
numba : 0.60.0
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 7.4.2
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions