PERF: Difference in using zipped pickle files

### Pandas version checks

- [X] I have checked that this issue has not already been reported.

- [X] I have confirmed this issue exists on the [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas.

- [ ] I have confirmed this issue exists on the main branch of pandas.


### Reproducible Example

The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.

```
#!/usr/bin/env python3
import io
import zipfile
from datetime import datetime
import pandas as pd
import numpy as np

FN = 'df.pickle.zip'


def create_and_zip_pickle_data():
    num_rows = 1_000_000
    num_cols = 10

    print('Create data frame')

    int_data = np.random.randint(0, 100, size=(num_rows, num_cols // 2))
    str_choices = np.array(['Troi', 'Crusher', 'Yar', 'Guinan'])
    str_data = np.random.choice(str_choices, size=(num_rows, num_cols // 2))
    columns = [f'col_{i}' for i in range(num_cols)]

    df = pd.DataFrame(np.hstack((int_data, str_data)), columns=columns)
    df_one = df.copy()

    for _ in range(20):
        df = pd.concat([df, df_one])

    df = df.reset_index()
    df['col_2'] = df['col_2'].astype('Int16')
    df['col_4'] = df['col_4'].astype('Int16')
    df['col_5'] = df['col_5'].astype('category')
    df['col_7'] = df['col_7'].astype('category')
    df['col_9'] = df['col_9'].astype('category')

    print(df.head())

    print(f'Pickle {len(df):n} rows')
    df.to_pickle(FN)


def unpickle_via_pandas():
    timestamp = datetime.now()
    print('Unpickle with pandas')

    df = pd.read_pickle(FN)
    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')


def unpickle_from_memory():
    timestamp = datetime.now()
    print('Unpickle after unzipped into RAM')

    # Unzip into RAM
    print('Unzip into RAM')
    with zipfile.ZipFile(FN) as zf:
        stream = io.BytesIO(zf.read(zf.namelist()[0]))

    # Unpickle from RAM
    print('Unpickle from RAM')
    df = pd.read_pickle(stream)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

def unpickle_zip_filehandle():
    timestamp = datetime.now()
    print('Unpickle with zip filehandle')

    with zipfile.ZipFile(FN) as zf:
        with zf.open('df.pickle') as handle:
            print('Unpickle from filehandle')
            df = pd.read_pickle(handle)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

if __name__ == '__main__':
    print(f'{pd.__version__=}')
    # create_and_zip_pickle_data()
    print('-'*20)
    unpickle_from_memory()
    print('-'*20)
    unpickle_via_pandas()
    print('-'*20)
    unpickle_zip_filehandle()
    print('-'*20)
    print('FIN')

```

### Installed Versions

<details>

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.11.5.final.0
python-bits           : 64
OS                    : Windows
OS-release            : 10
Version               : 10.0.19045
machine               : AMD64
processor             : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : de_DE.cp1252

pandas                : 2.2.2
numpy                 : 1.26.4
pytz                  : 2023.3.post1
dateutil              : 2.8.2
setuptools            : 65.5.0
pip                   : 24.0
Cython                : None
pytest                : 7.4.2
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : 4.9.3
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.2
IPython               : None
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : 3.8.0
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : 3.1.2
pandas_gbq            : None
pyarrow               : 15.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.11.3
sqlalchemy            : None
tables                : None
tabulate              : 0.9.0
xarray                : 2023.10.1
xlrd                  : None
zstandard             : None
tzdata                : 2023.3
qtpy                  : None
pyqt5                 : None

</details>


### Prior Performance

I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.

Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an `io.BytesIO()` object and using this with `pandas.read_pickle()` (6sec in my example).

In the example code below the function `unpickle_from_memory()` demonstrate the fast way.
The slower one is `unpickle_via_pandas()` and `unpickle_zip_filehandle()`. The later might be an example about how pandas work internally with that zip file.

Here is the output from the script:
```
pd.__version__='2.2.2'
--------------------
Unpickle after unzipped into RAM
Unzip into RAM
Unpickle from RAM
21000000 rows. Duration 0:00:06.289123.
--------------------
Unpickle with pandas
21000000 rows. Duration 0:01:51.749488.
--------------------
Unpickle with zip filehandle
Unpickle from filehandle
21000000 rows. Duration 0:01:50.909909.
--------------------
FIN
```

My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in `unpickle_from_memory()`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Difference in using zipped pickle files #59279

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PERF: Difference in using zipped pickle files #59279

Description

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions