Description
Code Sample
import os
import pandas as pd
directory = '/Data'
# more than 300 files
for d, dirs, files in os.walk(directory):
# i can not use read_excel function because i do not know how many sheets in each xls
# i use instance of ExcelFile
# in fact it returns xlrd.Book instance (because engine: Acceptable values are None or ``xlrd``)
try
current_file = pd.ExcelFile(os.path.join(d,f))
#do not work. My files use cp1251 charset. See more in output
except UnicodeDecodeError:
pass
sheet_names_ = current_file.sheet_names()
for sn in current_file.sheet_names():
# this Data Frame i shoud parse and import to my Pstgresql DB
current_file.parse(sheet_name=sn)
Problem description
I have a lot of XLS files encoded with cp1251. Each file contains 1 or more sheets. I need to import them into my Postgesql DB. But Pandas can not do it simple. I can load one sheet (list of, but i don`t have it) at a time in any encoding using pandas.read_excel(file,encoding=my_encoding). Or I can load all sheets from all files in UTF encoding using instance of pandas.ExcelFile(file). It looks like some bad recursion.
I can use xlrd to read my files with specific charset to get xlrd.Book instance. Then pass it to pandas.read_excel(). But I do not like this decision.
If pandas.ExcelFile() would have an attribute of the encoding, then it could be passed to any "engine". I saw a new version of excel package. The same problem.
(I dont have MS Excel and can not create "bad" file for test)
Sorry for my english(((
Expected Output
Output pandas/xlrd Error
*** No CODEPAGE record, no encoding_override: will use 'ascii'
*** No CODEPAGE record, no encoding_override: will use 'ascii'
Traceback (most recent call last):
File "TableLoader.py", line 13, in
c = ExelLoader()
File "TableLoader.py", line 10, in ExelLoader
my_xls_x = pandas.read_excel('Data/qq.xls')
File "/home/dz/.local/lib/python3.6/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/home/dz/.local/lib/python3.6/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/home/dz/.local/lib/python3.6/site-packages/pandas/io/excel.py", line 350, in read_excel
io = ExcelFile(io, engine=engine)
File "/home/dz/.local/lib/python3.6/site-packages/pandas/io/excel.py", line 653, in init
self._reader = self._enginesengine
File "/home/dz/.local/lib/python3.6/site-packages/pandas/io/excel.py", line 424, in init
self.book = xlrd.open_workbook(filepath_or_buffer)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/init.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/book.py", line 120, in open_workbook_xls
bk.get_sheets()
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/book.py", line 723, in get_sheets
self.get_sheet(sheetno)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/book.py", line 714, in get_sheet
sh.read(self)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/sheet.py", line 820, in read
strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/biffh.py", line 250, in unpack_string
return unicode(data[pos:pos+nchars], encoding)
File "/home/dz/.local/lib/python3.6/site-packages/xlrd/timemachine.py", line 31, in
unicode = lambda b, enc: b.decode(enc)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcd in position 0: ordinal not in range(128)
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 32
OS: Linux
OS-release: 4.15.0-45-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: ru_RU.UTF-8
LOCALE: ru_RU.UTF-8
pandas: 0.24.1
pytest: None
pip: 9.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None