Skip to content

BUG: ValueError: Unknown subheader signature in _get_subheader_index() when using read_sas() #47041

Open
@jmarinero

Description

@jmarinero

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.read_sas(path_to_file)

where path_to_file is the path to file whose load had been working till recently

The error is:

Traceback (most recent call last):
  File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 204, in __init__
    self._parse_metadata()
  File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 409, in _parse_metadata
    done = self._process_page_meta()
  File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 415, in _process_page_meta
    self._process_page_metadata()
  File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 447, in _process_page_metadata
    subheader_index = self._get_subheader_index(
  File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 462, in _get_subheader_index
    raise ValueError("Unknown subheader signature")
ValueError: Unknown subheader signature
python-BaseException

Issue Description

It seems to me that the problem is related to the encoding of the sas file. The encoding of this file that doesn't work, and also of files that work is latin9 European (ISO) according to SAS.

The error is produced when trying to find the index coresponding to a certain signature in the instruction:

index = const.subheader_signature_to_index.get(signature)

of the function _get_subheader_index. The variable signature, in my case has the value b'\xf4\x81\xf0?\xf4\x81\xf0?' which is not a key in the dictionary const.subheader_signature_to_index, and so the exception is raised. The signature of another file which works (this is an analogue file, but from a previous year) is b'\xf7\xf7\xf7\xf7\x00\x00\x00\x00' which is in the dictionary.

The content of the dictionary is:

 b'\xf7\xf7\xf7\xf7' = {int} 0
 b'\x00\x00\x00\x00\xf7\xf7\xf7\xf7' = {int} 0
 b'\xf7\xf7\xf7\xf7\x00\x00\x00\x00' = {int} 0
 b'\xf7\xf7\xf7\xf7\xff\xff\xfb\xfe' = {int} 0
 b'\xf6\xf6\xf6\xf6' = {int} 1
 b'\x00\x00\x00\x00\xf6\xf6\xf6\xf6' = {int} 1
 b'\xf6\xf6\xf6\xf6\x00\x00\x00\x00' = {int} 1
 b'\xf6\xf6\xf6\xf6\xff\xff\xfb\xfe' = {int} 1
 b'\x00\xfc\xff\xff' = {int} 2
 b'\xff\xff\xfc\x00' = {int} 2
 b'\x00\xfc\xff\xff\xff\xff\xff\xff' = {int} 2
 b'\xff\xff\xff\xff\xff\xff\xfc\x00' = {int} 2
 b'\xfd\xff\xff\xff' = {int} 3
 b'\xff\xff\xff\xfd' = {int} 3
 b'\xfd\xff\xff\xff\xff\xff\xff\xff' = {int} 3
 b'\xff\xff\xff\xff\xff\xff\xff\xfd' = {int} 3
 b'\xff\xff\xff\xff' = {int} 4
 b'\xff\xff\xff\xff\xff\xff\xff\xff' = {int} 4
 b'\xfc\xff\xff\xff' = {int} 5
 b'\xff\xff\xff\xfc' = {int} 5
 b'\xfc\xff\xff\xff\xff\xff\xff\xff' = {int} 5
 b'\xff\xff\xff\xff\xff\xff\xff\xfc' = {int} 5
 b'\xfe\xfb\xff\xff' = {int} 6
 b'\xff\xff\xfb\xfe' = {int} 6
 b'\xfe\xfb\xff\xff\xff\xff\xff\xff' = {int} 6
 b'\xff\xff\xff\xff\xff\xff\xfb\xfe' = {int} 6
 b'\xfe\xff\xff\xff' = {int} 7
 b'\xff\xff\xff\xfe' = {int} 7
 b'\xfe\xff\xff\xff\xff\xff\xff\xff' = {int} 7
 b'\xff\xff\xff\xff\xff\xff\xff\xfe' = {int} 7

This signature is produced in the function _read_bytes at its instruction return self._cached_page[offset : offset + length]

The variable offset has value 130034 and comes from the object pointer which is:

result = {_SubheaderPointer} <pandas.io.sas.sas7bdat._SubheaderPointer object at 0x000000013FD9EBB0>
 compression = {int} 4
 length = {int} 1038
 offset = {int} 130034
 ptype = {int} 1

The pointer object of the file which loads properly is:

result = {_SubheaderPointer} <pandas.io.sas.sas7bdat._SubheaderPointer object at 0x0000001DAAAAB610>
 compression = {int} 0
 length = {int} 808
 offset = {int} 130264
 ptype = {int} 1

which has a different value for compression (trying to load this file in the package pyreadstat which was my previous approach, since I only want to file some columns, triggers an error 'Unsupported compression scheme' but I have not been able to debug into the code of this package)

while the variable length has value 8 and is in the object self which has the values:

result = {SAS7BDATReader} <pandas.io.sas.sas7bdat.SAS7BDATReader object at 0x000000013FD952E0>
 U64 = {bool} True
 blank_missing = {bool} True
 byte_order = {str} '<'
 chunksize = {int} 10000
 column_formats = {list: 0} []
 column_names = {list: 0} []
 column_names_strings = {list: 0} []
 columns = {list: 0} []
 compression = {bytes: 0} b''
 convert_dates = {bool} True
 convert_header_text = {bool} True
 convert_text = {bool} True
 date_created = {datetime} 2022-02-23 08:52:49.773133
 date_modified = {datetime} 2022-05-13 14:17:11.306824
 default_encoding = {str} 'latin-1'
 encoding = {str} 'ISO 8859-1'
 file_encoding = {str} 'unknown (code=40)'
 file_type = {str} 'DATA'
 handles = {IOHandles} IOHandles(handle=<_io.BufferedReader name='C:\\Users\\XXXXXXX\\Desktop\\f_cal_echi_detalle_tratado_2022.sas7bdat'>, compression={'method': None}, created_handles=[], is_wrapped=False, is_mmap=False)
 header_length = {int} 131072
 index = {NoneType} None
 name = {str} ''
 os_name = {str} 'x86_64'
 os_version = {str} '4.18.0-240.10.1.'
 platform = {str} 'unix'
 sas_release = {str} '9.0401M7'
 server_type = {str} 'Linux'
  _abc_impl = {_abc_data} <_abc._abc_data object at 0x000000013FDB0DC0>
  _cached_page = {bytes: 131072} b'Y\xae\xa8\x95\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xc9\xe22\x02\x00\x00\x00\x00\x83\x02\x00\x00\x00\x00\x00\x00\x00\x00z\x00z\x00\x00\x00\xf2\xfb\x01\x00\x00\x00\x00\x00\x0e\x04\x00\x00\x00\x00\x00\x00\x04\x01\x00\x00\x00\x00\x00\x00r\xf7\x01\
  _column_data_lengths = {list: 0} []
  _column_data_offsets = {list: 0} []
  _column_types = {list: 0} []
  _current_page_block_count = {int} 122
  _current_page_data_subheader_pointers = {list: 0} []
  _current_page_subheaders_count = {int} 122
  _current_page_type = {int} 0
  _current_row_in_file_index = {int} 0
  _current_row_on_page_index = {int} 0
  _int_length = {int} 8
  _page_bit_offset = {int} 32
  _page_count = {int} 365786
  _page_length = {int} 131072
  _path_or_buf = {BufferedReader} <_io.BufferedReader name='C:\\Users\\XXXXXXX\\Desktop\\f_cal_echi_detalle_tratado_2022.sas7bdat'>
  _subheader_pointer_length = {int} 24

Expected Behavior

The dataset should be loaded

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 8.1
Version : 6.3.9600
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : es_ES.cp1252
pandas : 1.4.2
numpy : 1.20.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
brotli :
fastparquet : None
fsspec : 2021.10.1
gcsfs : None
markupsafe : 1.1.1
matplotlib : 3.4.3
numba : 0.54.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : 1.1.5
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions