Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.read_sas(path_to_file)
where path_to_file is the path to file whose load had been working till recently
The error is:
Traceback (most recent call last):
File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 204, in __init__
self._parse_metadata()
File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 409, in _parse_metadata
done = self._process_page_meta()
File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 415, in _process_page_meta
self._process_page_metadata()
File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 447, in _process_page_metadata
subheader_index = self._get_subheader_index(
File "C:\Users\XXXXXXX\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py", line 462, in _get_subheader_index
raise ValueError("Unknown subheader signature")
ValueError: Unknown subheader signature
python-BaseException
Issue Description
It seems to me that the problem is related to the encoding of the sas file. The encoding of this file that doesn't work, and also of files that work is latin9 European (ISO) according to SAS.
The error is produced when trying to find the index coresponding to a certain signature in the instruction:
index = const.subheader_signature_to_index.get(signature)
of the function _get_subheader_index
. The variable signature, in my case has the value b'\xf4\x81\xf0?\xf4\x81\xf0?'
which is not a key in the dictionary const.subheader_signature_to_index
, and so the exception is raised. The signature of another file which works (this is an analogue file, but from a previous year) is b'\xf7\xf7\xf7\xf7\x00\x00\x00\x00'
which is in the dictionary.
The content of the dictionary is:
b'\xf7\xf7\xf7\xf7' = {int} 0
b'\x00\x00\x00\x00\xf7\xf7\xf7\xf7' = {int} 0
b'\xf7\xf7\xf7\xf7\x00\x00\x00\x00' = {int} 0
b'\xf7\xf7\xf7\xf7\xff\xff\xfb\xfe' = {int} 0
b'\xf6\xf6\xf6\xf6' = {int} 1
b'\x00\x00\x00\x00\xf6\xf6\xf6\xf6' = {int} 1
b'\xf6\xf6\xf6\xf6\x00\x00\x00\x00' = {int} 1
b'\xf6\xf6\xf6\xf6\xff\xff\xfb\xfe' = {int} 1
b'\x00\xfc\xff\xff' = {int} 2
b'\xff\xff\xfc\x00' = {int} 2
b'\x00\xfc\xff\xff\xff\xff\xff\xff' = {int} 2
b'\xff\xff\xff\xff\xff\xff\xfc\x00' = {int} 2
b'\xfd\xff\xff\xff' = {int} 3
b'\xff\xff\xff\xfd' = {int} 3
b'\xfd\xff\xff\xff\xff\xff\xff\xff' = {int} 3
b'\xff\xff\xff\xff\xff\xff\xff\xfd' = {int} 3
b'\xff\xff\xff\xff' = {int} 4
b'\xff\xff\xff\xff\xff\xff\xff\xff' = {int} 4
b'\xfc\xff\xff\xff' = {int} 5
b'\xff\xff\xff\xfc' = {int} 5
b'\xfc\xff\xff\xff\xff\xff\xff\xff' = {int} 5
b'\xff\xff\xff\xff\xff\xff\xff\xfc' = {int} 5
b'\xfe\xfb\xff\xff' = {int} 6
b'\xff\xff\xfb\xfe' = {int} 6
b'\xfe\xfb\xff\xff\xff\xff\xff\xff' = {int} 6
b'\xff\xff\xff\xff\xff\xff\xfb\xfe' = {int} 6
b'\xfe\xff\xff\xff' = {int} 7
b'\xff\xff\xff\xfe' = {int} 7
b'\xfe\xff\xff\xff\xff\xff\xff\xff' = {int} 7
b'\xff\xff\xff\xff\xff\xff\xff\xfe' = {int} 7
This signature is produced in the function _read_bytes
at its instruction return self._cached_page[offset : offset + length]
The variable offset
has value 130034 and comes from the object pointer
which is:
result = {_SubheaderPointer} <pandas.io.sas.sas7bdat._SubheaderPointer object at 0x000000013FD9EBB0>
compression = {int} 4
length = {int} 1038
offset = {int} 130034
ptype = {int} 1
The pointer object of the file which loads properly is:
result = {_SubheaderPointer} <pandas.io.sas.sas7bdat._SubheaderPointer object at 0x0000001DAAAAB610>
compression = {int} 0
length = {int} 808
offset = {int} 130264
ptype = {int} 1
which has a different value for compression (trying to load this file in the package pyreadstat which was my previous approach, since I only want to file some columns, triggers an error 'Unsupported compression scheme' but I have not been able to debug into the code of this package)
while the variable length
has value 8 and is in the object self
which has the values:
result = {SAS7BDATReader} <pandas.io.sas.sas7bdat.SAS7BDATReader object at 0x000000013FD952E0>
U64 = {bool} True
blank_missing = {bool} True
byte_order = {str} '<'
chunksize = {int} 10000
column_formats = {list: 0} []
column_names = {list: 0} []
column_names_strings = {list: 0} []
columns = {list: 0} []
compression = {bytes: 0} b''
convert_dates = {bool} True
convert_header_text = {bool} True
convert_text = {bool} True
date_created = {datetime} 2022-02-23 08:52:49.773133
date_modified = {datetime} 2022-05-13 14:17:11.306824
default_encoding = {str} 'latin-1'
encoding = {str} 'ISO 8859-1'
file_encoding = {str} 'unknown (code=40)'
file_type = {str} 'DATA'
handles = {IOHandles} IOHandles(handle=<_io.BufferedReader name='C:\\Users\\XXXXXXX\\Desktop\\f_cal_echi_detalle_tratado_2022.sas7bdat'>, compression={'method': None}, created_handles=[], is_wrapped=False, is_mmap=False)
header_length = {int} 131072
index = {NoneType} None
name = {str} ''
os_name = {str} 'x86_64'
os_version = {str} '4.18.0-240.10.1.'
platform = {str} 'unix'
sas_release = {str} '9.0401M7'
server_type = {str} 'Linux'
_abc_impl = {_abc_data} <_abc._abc_data object at 0x000000013FDB0DC0>
_cached_page = {bytes: 131072} b'Y\xae\xa8\x95\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xc9\xe22\x02\x00\x00\x00\x00\x83\x02\x00\x00\x00\x00\x00\x00\x00\x00z\x00z\x00\x00\x00\xf2\xfb\x01\x00\x00\x00\x00\x00\x0e\x04\x00\x00\x00\x00\x00\x00\x04\x01\x00\x00\x00\x00\x00\x00r\xf7\x01\
_column_data_lengths = {list: 0} []
_column_data_offsets = {list: 0} []
_column_types = {list: 0} []
_current_page_block_count = {int} 122
_current_page_data_subheader_pointers = {list: 0} []
_current_page_subheaders_count = {int} 122
_current_page_type = {int} 0
_current_row_in_file_index = {int} 0
_current_row_on_page_index = {int} 0
_int_length = {int} 8
_page_bit_offset = {int} 32
_page_count = {int} 365786
_page_length = {int} 131072
_path_or_buf = {BufferedReader} <_io.BufferedReader name='C:\\Users\\XXXXXXX\\Desktop\\f_cal_echi_detalle_tratado_2022.sas7bdat'>
_subheader_pointer_length = {int} 24
Expected Behavior
The dataset should be loaded
Installed Versions
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 8.1
Version : 6.3.9600
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : es_ES.cp1252
pandas : 1.4.2
numpy : 1.20.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
brotli :
fastparquet : None
fsspec : 2021.10.1
gcsfs : None
markupsafe : 1.1.1
matplotlib : 3.4.3
numba : 0.54.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : 1.1.5
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None