Skip to content

PERF: Saving many datasets in a single group slows down with each new addition #58248

Closed
@Ieremie

Description

@Ieremie

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I found a strange behaviour that seems to appear only for PyTabels (Pandas).

Saving many datasets within a single group becomes progressively slower.

import pandas as pd

df = pd.DataFrame({'A': [1.0] * 1000})  
df = pd.concat([df] * 13, axis=1, ignore_index=True)

size = 5000
timings = []
for i in tqdm.tqdm(range(0, size), total=size):
    key = ''.join(random.choices(string.ascii_uppercase, k=20))

    start = time.time()
    df.to_hdf('test.h5', key=key, mode='a', complevel=9) 
    timings.append(time.time() - start)

plt.plot(timings[10:])
image

This is not the case for h5py, which can easily write up to 100 more datasets without slowing down.

import h5py

timings = []
size  = 500000
with h5py.File('test2.h5', 'w', libver='latest') as hf:
    group = hf.create_group(f'group')
    for i in tqdm.tqdm(range(0, size), total=size):
        key = ''.join(random.choices(string.ascii_uppercase, k=20))

        start = time.time()
        group.create_dataset(key, data=df.values, compression="gzip", compression_opts=9) 
        timings.append(time.time() - start)

plt.plot(timings[10:])
image

Installed Versions

Replace this line with the output of pd.show_versions()

Prior Performance

I have raised this issue in the Pytables repo and it seems it is actually an issue with pandas https://github.com/PyTables/PyTables/issues/1155

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions