Description
Code Sample, a copy-pastable example if possible
here is the sas7bdat files for test (Chinese names end with special characters),
issue.sas7bdat has Chinese values, can be correctly imported by pandas.
issue1.sas7bdat has Chinese variables
# Your code here
df1=pd.read_sas('issue1.sas7bdat',encoding='GBK')
Problem description
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-bcd7a0b4819c> in <module>()
----> 1 df1=pd.read_sas('issue1.sas7bdat',encoding='GBK')
~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
59 reader = SAS7BDATReader(filepath_or_buffer, index=index,
60 encoding=encoding,
---> 61 chunksize=chunksize)
62 else:
63 raise ValueError('unknown SAS format')
~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text)
96
97 self._get_properties()
---> 98 self._parse_metadata()
99
100 def close(self):
~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _parse_metadata(self)
276 raise ValueError(
277 "Failed to read a meta data page from the SAS file.")
--> 278 done = self._process_page_meta()
279
280 def _process_page_meta(self):
~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_page_meta(self)
282 pt = [const.page_meta_type, const.page_amd_type] + const.page_mix_types
283 if self._current_page_type in pt:
--> 284 self._process_page_metadata()
285 return ((self._current_page_type in [256] + const.page_mix_types) or
286 (self._current_page_data_subheader_pointers is not None))
~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_page_metadata(self)
312 self._get_subheader_index(subheader_signature,
313 pointer.compression, pointer.ptype))
--> 314 self._process_subheader(subheader_index, pointer)
315
316 def _get_subheader_index(self, signature, compression, ptype):
~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_subheader(self, subheader_index, pointer)
382 raise ValueError("unknown subheader index")
383
--> 384 processor(offset, length)
385
386 def _process_rowsize_subheader(self, offset, length):
~/.pyenv/versions/3.6.4/envs/ts/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py in _process_columntext_subheader(self, offset, length)
432
433 if self.convert_header_text:
--> 434 cname = cname.decode(self.encoding or self.default_encoding)#cname.decode(self.encoding or self.default_encoding,'ignore')
435 self.column_names_strings.append(cname)
436
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 0: illegal multibyte sequence
[this should explain why the current behaviour is a problem and why the expected output is a better solution.]
Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
Note: Many problems can be resolved by simply upgrading pandas
to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master
addresses this issue, but that is not necessary.
For documentation-related issues, you can check the latest versions of the docs on master
here:
https://pandas-docs.github.io/pandas-docs-travis/
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
pandas: 0.22.0
pytest: None
pip: 9.0.2
setuptools: 28.8.0
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
the code fired issue,
if self.convert_header_text:
cname = cname.decode(self.encoding or self.default_encoding)
if change to
if self.convert_header_text:
cname = cname.decode(self.encoding or self.default_encoding,'ignore')
if change to
# if self.convert_header_text:
# cname = cname.decode(self.encoding or self.default_encoding)
and decode columns with the following code,
col=df1.columns.tolist()
col = [x.decode('GBK', 'ignore') for x in col]
df1.columns=pd.Index(col)
btw,
sas7bdat works well.