Skip to content

Writing HDF5 with StringDtype columns fails with AttributeError: 'StringArray' object has no attribute 'size' #31199

Open
@gerritholl

Description

@gerritholl

Code Sample, a copy-pastable example if possible

import pandas
df = pandas.read_csv("s3://noaa-global-hourly-pds/2019/72047299999.csv",
        parse_dates=["DATE"],
        dtype=dict.fromkeys(["STATION", "NAME", "REPORT_TYPE",
            "QUALITY_CONTROL", "WND", "CIG", "VIS", "TMP", "DEW", "SLP", "AA1",
            "AA2", "AY1", "AY2", "GF1", "MW1", "REM"], pandas.StringDtype()))
df.to_hdf("/tmp/fubar.h5", "root")
pandas.read_hdf("/tmp/fubar.h5", "root")

Problem description

The code fails to write the HDF5. It fails with an exception AttributeError: 'StringArray' object has no attribute 'size':

Traceback (most recent call last):
  File "mwe15.py", line 7, in <module>
    df.to_hdf("/tmp/fubar.h5", "root")
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/core/generic.py", line 2504, in to_hdf
    encoding=encoding,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 282, in to_hdf
    f(store)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 274, in <lambda>
    encoding=encoding,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1042, in put
    errors=errors,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1709, in _write_to_group
    data_columns=data_columns,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 3100, in write
    self.write_array(f"block{i}_values", blk.values, items=blk_items)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 2906, in write_array
    empty_array = value.size == 0
AttributeError: 'StringArray' object has no attribute 'size'

It does write a file /tmp/fubar.h5, but pandas cannot read it. Commenting out the writing bit and leaving only the reading bit results in:

Traceback (most recent call last):
  File "mwe15.py", line 8, in <module>
    pandas.read_hdf("/tmp/fubar.h5", "root")
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 428, in read_hdf
    auto_close=auto_close,
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 814, in select
    return it.get_result()
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 1829, in get_result
    results = self.func(self.start, self.stop, where)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 798, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 3068, in read
    blk_items = self.read_index(f"block{i}_items")
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/pandas/io/pytables.py", line 2752, in read_index
    variety = _ensure_decoded(getattr(self.attrs, f"{key}_variety"))
  File "/media/nas/x21324/miniconda3/envs/py37e/lib/python3.7/site-packages/tables/attributeset.py", line 301, in __getattr__
    "'%s'" % (name, self._v__nodepath))
AttributeError: Attribute 'block0_items_variety' does not exist in node: '/root'
Closing remaining open files:/tmp/fubar.h5...done

Expected Output

I expect a well-formed HDF5 file to be written, and the pandas.read_hdf command to return a DataFrame identical to the one written to disk just before.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.12.14-lp150.12.82-default
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.0rc0
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191101
Cython : 0.29.14
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.0
s3fs : 0.4.0
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.46.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugExtensionArrayExtending pandas with custom dtypes or arrays.IO HDF5read_hdf, HDFStoreStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions