Description
Code Sample, a copy-pastable example if possible
import pandas
df = pandas.read_csv("s3://noaa-global-hourly-pds/2019/72047299999.csv",
parse_dates=["DATE"],
dtype=dict.fromkeys(["STATION", "NAME", "REPORT_TYPE",
"QUALITY_CONTROL", "WND", "CIG", "VIS", "TMP", "DEW", "SLP", "AA1",
"AA2", "AY1", "AY2", "GF1", "MW1", "REM"], pandas.StringDtype()))
df.to_hdf("/tmp/fubar.h5", "root")
pandas.read_hdf("/tmp/fubar.h5", "root")
Problem description
The code fails to write the HDF5. It fails with an exception AttributeError: 'StringArray' object has no attribute 'size'
:
Traceback (most recent call last):
File "mwe15.py", line 7, in <module>
df.to_hdf("/tmp/fubar.h5", "root")
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/core/generic.py", line 2504, in to_hdf
encoding=encoding,
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 282, in to_hdf
f(store)
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 274, in <lambda>
encoding=encoding,
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1042, in put
errors=errors,
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1709, in _write_to_group
data_columns=data_columns,
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 3100, in write
self.write_array(f"block{i}_values", blk.values, items=blk_items)
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 2906, in write_array
empty_array = value.size == 0
AttributeError: 'StringArray' object has no attribute 'size'
It does write a file /tmp/fubar.h5
, but pandas cannot read it. Commenting out the writing bit and leaving only the reading bit results in:
Traceback (most recent call last):
File "mwe15.py", line 8, in <module>
pandas.read_hdf("/tmp/fubar.h5", "root")
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 428, in read_hdf
auto_close=auto_close,
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 814, in select
return it.get_result()
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1829, in get_result
results = self.func(self.start, self.stop, where)
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 798, in func
return s.read(start=_start, stop=_stop, where=_where, columns=columns)
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 3068, in read
blk_items = self.read_index(f"block{i}_items")
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 2752, in read_index
variety = _ensure_decoded(getattr(self.attrs, f"{key}_variety"))
File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/tables/attributeset.py", line 301, in __getattr__
"'%s'" % (name, self._v__nodepath))
AttributeError: Attribute 'block0_items_variety' does not exist in node: '/root'
Closing remaining open files:/tmp/fubar.h5...done
Expected Output
I expect a well-formed HDF5 file to be written, and the pandas.read_hdf
command to return a DataFrame
identical to the one written to disk just before.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.12.14-lp150.12.82-default
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.0.0rc0
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191101
Cython : 0.29.14
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.0
s3fs : 0.4.0
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.46.0