Closed
Description
We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that allows omitting the key when a HDF file contains a single Pandas object is very nice for our workflow.
However, said feature doesn't work when the dataframe saved contains one or more categorical columns:
import pandas as pd
df = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'a']})
# This works fine.
df.to_hdf('no_cat.hdf5', 'data', format='table')
df2 = pd.read_hdf('no_cat.hdf5')
print((df == df2).all().all())
# But this produces an exception.
df.assign(col2=pd.Categorical(df.col2)).to_hdf('cat.hdf5', 'data', format='table')
df3 = pd.read_hdf('cat.hdf5')
# ValueError: key must be provided when HDF file contains multiple datasets.
It looks like this is because pandas.read_hdf() doesn't ignore the metadata used to store the categorical codes:
print(pd.HDFStore('cat.hdf5'))
<class 'pandas.io.pytables.HDFStore'>
File path: cat.hdf5
/data frame_table (typ->appendable,nrows->3,ncols->2,indexers->[index])
/data/meta/values_block_1/meta series_table (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[values])
it'd be nice if this feature worked even when some of the columns are categoricals. It should be possible to ignore that metadata that pandas creates when looking if there is only one dataset stored, no?