Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import os
if __name__ == "__main__":
for n in [10, 1e2, 1e3, 1e4, 1e5]:
for n_col in [1, 10, 100, 1000, 10000]:
input = pd.DataFrame([{"{i}": f"{i}_cat" for col in range(n_col)} for i in range(int(n))])
input.iloc[0:100].to_parquet("a.parquet")
for col in input.columns:
input[col] = input[col].astype("category")
input.iloc[0:100].to_parquet("b.parquet")
a_size_mb = os.stat("a.parquet").st_size / (1024 * 1024)
b_size_mb = os.stat("b.parquet").st_size / (1024 * 1024)
print(f"{n} {n_col} {a_size_mb} {b_size_mb} {100*b_size_mb/a_size_mb:.2f}")
Issue Description
It seems that when saving a data frame with a categorical column inside the size can grow exponentially.
This seems to happen because when we save the categorical data to parquet, we are saving the data + all the categories existing in the original data. This happens even when the categories are not present in the original data.
To reproduce the bug, it is enough to run the script above.
That produces this output:
10 1 0.0015506744384765625 0.001689910888671875 108.98
10 10 0.0015506744384765625 0.001689910888671875 108.98
10 100 0.0015506744384765625 0.001689910888671875 108.98
10 1000 0.0015506744384765625 0.001689910888671875 108.98
10 10000 0.0015506744384765625 0.001689910888671875 108.98
100.0 1 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10 0.0019960403442382812 0.0021104812622070312 105.73
100.0 100 0.0019960403442382812 0.0021104812622070312 105.73
100.0 1000 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10000 0.0019960403442382812 0.0021104812622070312 105.73
1000.0 1 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 100 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 1000 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10000 0.0019960403442382812 0.0053577423095703125 268.42
10000.0 1 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 100 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 1000 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10000 0.0019960403442382812 0.042061805725097656 2107.26
100000.0 1 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 100 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 1000 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10000 0.0019960403442382812 0.43596935272216797 21841.71
Expected Behavior
In my opinion either:
- The two file should have (almost) the same size
- There should be warning telling the user that such difference in size is possible
Installed Versions
pandas : 2.1.0
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.4
pytest : 7.4.3
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader : 0.10.0
bs4 : 4.11.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.22
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None