Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
import os
df = pd.read_csv('covid-variants.csv')
# saving the entire dataframe with a single key as hdf5 format
df.to_hdf('covid-variants-flat.h5',key='flat',complib='blosc',complevel=9)
# break down by country and save each country data with country name as key
d = list(df.groupby('location'))
for i in range(len(d)):
dftmp = d[i][1].copy()
keyvalue = d[i][0].replace(' ','_').replace('(','').replace(')','')
# compressions tried: ‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’
dftmp.to_hdf('covid-variants-by-country.h5',key=keyvalue,complib='blosc',complevel=9)
flat_size = os.path.getsize('covid-variants-flat.h5')
multi_size = os.path.getsize('covid-variants-by-country.h5')
print('size of flat hdf5 =',flat_size )
print('size of multi key hdf5 =',multi_size )
print('size ratio =', str(int(multi_size/flat_size))+'x')
Issue Description
pandas to_hdf works as expected when saving one dataframe with one key.
When saving a dataframe with multiple keys, the file saved is not compressing.
I have the demo code using a public dataset from kaggle:
covid-variants.csv
in the example, I save the dataframe as a single key flat hdf5 format.
I then break it down by country and save the data by country.
The overall data is exactly the same between the 2 files.
The hdf5 file with multiple keys is 65x larger in size than the flat hdf5 file.
zip file includes covid-variants.csv and a py and .ipynb formats for the example
pandas-hdf-bug-multi-key-compression.zip
Expected Behavior
When saving a dataframe as mutliple smaller dataframes, there is an overhead so the file size should be bigger than when dataframe is stored with a single key.
However in the above example, the dataframe with multiple keys is 65x larger than the flat dataframe.
There is something very wrong with compression in to_hdf when there are mutliple keys in the data
Installed Versions
Replace this line with the output of pd.show_versions()