Skip to content

BUG: Parquet size grows exponential for categorical data  #55776

Closed
@aseganti

Description

@aseganti

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import os

if __name__ == "__main__":
    for n in [10, 1e2, 1e3, 1e4, 1e5]:
        for n_col in [1, 10, 100, 1000, 10000]:
            input = pd.DataFrame([{"{i}": f"{i}_cat" for col in range(n_col)} for i in range(int(n))])
            input.iloc[0:100].to_parquet("a.parquet")
            for col in input.columns:
                input[col] = input[col].astype("category")
            input.iloc[0:100].to_parquet("b.parquet")
            a_size_mb = os.stat("a.parquet").st_size / (1024 * 1024)
            b_size_mb = os.stat("b.parquet").st_size / (1024 * 1024)
            print(f"{n} {n_col} {a_size_mb} {b_size_mb} {100*b_size_mb/a_size_mb:.2f}")

Issue Description

It seems that when saving a data frame with a categorical column inside the size can grow exponentially.

This seems to happen because when we save the categorical data to parquet, we are saving the data + all the categories existing in the original data. This happens even when the categories are not present in the original data.

To reproduce the bug, it is enough to run the script above.

That produces this output:

10 1 0.0015506744384765625 0.001689910888671875 108.98
10 10 0.0015506744384765625 0.001689910888671875 108.98
10 100 0.0015506744384765625 0.001689910888671875 108.98
10 1000 0.0015506744384765625 0.001689910888671875 108.98
10 10000 0.0015506744384765625 0.001689910888671875 108.98
100.0 1 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10 0.0019960403442382812 0.0021104812622070312 105.73
100.0 100 0.0019960403442382812 0.0021104812622070312 105.73
100.0 1000 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10000 0.0019960403442382812 0.0021104812622070312 105.73
1000.0 1 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 100 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 1000 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10000 0.0019960403442382812 0.0053577423095703125 268.42
10000.0 1 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 100 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 1000 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10000 0.0019960403442382812 0.042061805725097656 2107.26
100000.0 1 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 100 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 1000 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10000 0.0019960403442382812 0.43596935272216797 21841.71

Expected Behavior

In my opinion either:

  1. The two file should have (almost) the same size
  2. There should be warning telling the user that such difference in size is possible

Installed Versions

INSTALLED VERSIONS ------------------ commit : ba1cccd python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.120+ Version : #1 SMP Wed Aug 30 11:19:59 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.4
pytest : 7.4.3
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader : 0.10.0
bs4 : 4.11.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.22
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions