Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.
#!/usr/bin/env python3
import io
import zipfile
from datetime import datetime
import pandas as pd
import numpy as np
FN = 'df.pickle.zip'
def create_and_zip_pickle_data():
num_rows = 1_000_000
num_cols = 10
print('Create data frame')
int_data = np.random.randint(0, 100, size=(num_rows, num_cols // 2))
str_choices = np.array(['Troi', 'Crusher', 'Yar', 'Guinan'])
str_data = np.random.choice(str_choices, size=(num_rows, num_cols // 2))
columns = [f'col_{i}' for i in range(num_cols)]
df = pd.DataFrame(np.hstack((int_data, str_data)), columns=columns)
df_one = df.copy()
for _ in range(20):
df = pd.concat([df, df_one])
df = df.reset_index()
df['col_2'] = df['col_2'].astype('Int16')
df['col_4'] = df['col_4'].astype('Int16')
df['col_5'] = df['col_5'].astype('category')
df['col_7'] = df['col_7'].astype('category')
df['col_9'] = df['col_9'].astype('category')
print(df.head())
print(f'Pickle {len(df):n} rows')
df.to_pickle(FN)
def unpickle_via_pandas():
timestamp = datetime.now()
print('Unpickle with pandas')
df = pd.read_pickle(FN)
duration = datetime.now() - timestamp
print(f'{len(df):n} rows. Duration {duration}.')
def unpickle_from_memory():
timestamp = datetime.now()
print('Unpickle after unzipped into RAM')
# Unzip into RAM
print('Unzip into RAM')
with zipfile.ZipFile(FN) as zf:
stream = io.BytesIO(zf.read(zf.namelist()[0]))
# Unpickle from RAM
print('Unpickle from RAM')
df = pd.read_pickle(stream)
duration = datetime.now() - timestamp
print(f'{len(df):n} rows. Duration {duration}.')
def unpickle_zip_filehandle():
timestamp = datetime.now()
print('Unpickle with zip filehandle')
with zipfile.ZipFile(FN) as zf:
with zf.open('df.pickle') as handle:
print('Unpickle from filehandle')
df = pd.read_pickle(handle)
duration = datetime.now() - timestamp
print(f'{len(df):n} rows. Duration {duration}.')
if __name__ == '__main__':
print(f'{pd.__version__=}')
# create_and_zip_pickle_data()
print('-'*20)
unpickle_from_memory()
print('-'*20)
unpickle_via_pandas()
print('-'*20)
unpickle_zip_filehandle()
print('-'*20)
print('FIN')
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.11.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252
pandas : 2.2.2
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2023.10.1
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Prior Performance
I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.
Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an io.BytesIO()
object and using this with pandas.read_pickle()
(6sec in my example).
In the example code below the function unpickle_from_memory()
demonstrate the fast way.
The slower one is unpickle_via_pandas()
and unpickle_zip_filehandle()
. The later might be an example about how pandas work internally with that zip file.
Here is the output from the script:
pd.__version__='2.2.2'
--------------------
Unpickle after unzipped into RAM
Unzip into RAM
Unpickle from RAM
21000000 rows. Duration 0:00:06.289123.
--------------------
Unpickle with pandas
21000000 rows. Duration 0:01:51.749488.
--------------------
Unpickle with zip filehandle
Unpickle from filehandle
21000000 rows. Duration 0:01:50.909909.
--------------------
FIN
My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in unpickle_from_memory()
?