Skip to content

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

Closed
@chrish42

Description

@chrish42

We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that allows omitting the key when a HDF file contains a single Pandas object is very nice for our workflow.

However, said feature doesn't work when the dataframe saved contains one or more categorical columns:

import pandas as pd

df = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'a']})

# This works fine.
df.to_hdf('no_cat.hdf5', 'data', format='table')
df2 = pd.read_hdf('no_cat.hdf5')
print((df == df2).all().all())

# But this produces an exception.
df.assign(col2=pd.Categorical(df.col2)).to_hdf('cat.hdf5', 'data', format='table')
df3 = pd.read_hdf('cat.hdf5')

# ValueError: key must be provided when HDF file contains multiple datasets.

It looks like this is because pandas.read_hdf() doesn't ignore the metadata used to store the categorical codes:

print(pd.HDFStore('cat.hdf5'))

<class 'pandas.io.pytables.HDFStore'>
File path: cat.hdf5
/data                                     frame_table  (typ->appendable,nrows->3,ncols->2,indexers->[index])             
/data/meta/values_block_1/meta            series_table (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[values])

it'd be nice if this feature worked even when some of the columns are categoricals. It should be possible to ignore that metadata that pandas creates when looking if there is only one dataset stored, no?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions