Skip to content

BUG: pandas to_hdf has a problem with compressing multi key storage #45286

Open
@afshinmoshrefi

Description

@afshinmoshrefi

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import os

df = pd.read_csv('covid-variants.csv')

# saving the entire dataframe with a single key as hdf5 format
df.to_hdf('covid-variants-flat.h5',key='flat',complib='blosc',complevel=9)

# break down by country and save each country data with country name as key
d = list(df.groupby('location'))
for i in range(len(d)):
    dftmp = d[i][1].copy()
    keyvalue = d[i][0].replace(' ','_').replace('(','').replace(')','')
    # compressions tried: ‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’
    dftmp.to_hdf('covid-variants-by-country.h5',key=keyvalue,complib='blosc',complevel=9)
    

flat_size  = os.path.getsize('covid-variants-flat.h5')
multi_size = os.path.getsize('covid-variants-by-country.h5')
print('size of flat hdf5      =',flat_size  )
print('size of multi key hdf5 =',multi_size   )
print('size ratio             =', str(int(multi_size/flat_size))+'x')

Issue Description

pandas to_hdf works as expected when saving one dataframe with one key.
When saving a dataframe with multiple keys, the file saved is not compressing.
I have the demo code using a public dataset from kaggle:
covid-variants.csv

in the example, I save the dataframe as a single key flat hdf5 format.
I then break it down by country and save the data by country.
The overall data is exactly the same between the 2 files.
The hdf5 file with multiple keys is 65x larger in size than the flat hdf5 file.

zip file includes covid-variants.csv and a py and .ipynb formats for the example
pandas-hdf-bug-multi-key-compression.zip

Expected Behavior

When saving a dataframe as mutliple smaller dataframes, there is an overhead so the file size should be bigger than when dataframe is stored with a single key.

However in the above example, the dataframe with multiple keys is 65x larger than the flat dataframe.

There is something very wrong with compression in to_hdf when there are mutliple keys in the data

Installed Versions

Replace this line with the output of pd.show_versions()

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions