Automatic detection of HDF5 dataset identifier fails when data contains categoricals

We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that allows omitting the key when a HDF file contains a single Pandas object is very nice for our workflow.

However, said feature doesn't work when the dataframe saved contains one or more categorical columns:

```
import pandas as pd

df = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'a']})

# This works fine.
df.to_hdf('no_cat.hdf5', 'data', format='table')
df2 = pd.read_hdf('no_cat.hdf5')
print((df == df2).all().all())

# But this produces an exception.
df.assign(col2=pd.Categorical(df.col2)).to_hdf('cat.hdf5', 'data', format='table')
df3 = pd.read_hdf('cat.hdf5')

# ValueError: key must be provided when HDF file contains multiple datasets.
```

It looks like this is because pandas.read_hdf() doesn't ignore the metadata used to store the categorical codes:

```
print(pd.HDFStore('cat.hdf5'))

<class 'pandas.io.pytables.HDFStore'>
File path: cat.hdf5
/data                                     frame_table  (typ->appendable,nrows->3,ncols->2,indexers->[index])             
/data/meta/values_block_1/meta            series_table (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[values])
```

it'd be nice if this feature worked even when some of the columns are categoricals. It should be possible to ignore that metadata that pandas creates when looking if there is only one dataset stored, no?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions