Skip to content

BUG: Parquet roundtrip fails with numerical categorical dtype #60491

Open
@adrienpacifico

Description

@adrienpacifico

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> df=pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df = df.astype({'A':'category'})
>>> print(df.dtypes)
A    category
B       int64
dtype: object
>>> df.to_parquet('test.parquet')
>>> df_roundtrip = pd.read_parquet('test.parquet')
>>> print(df_roundtrip.dtypes)
A    int64
B    int64
dtype: object
>>> assert df_roundtrip.dtypes.equals(df.dtypes)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Issue Description

Roundtrip does not work.

Expected Behavior

df_roundtrip has the same dtypes as df.dtypes

Hot-Fix

df=pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.astype({'A':'str'}).astype({'A':'category'})
print(df.dtypes)
df.to_parquet('test.parquet')
df_roundtrip = pd.read_parquet('test.parquet')
print(df_roundtrip.dtypes)

assert df_roundtrip.dtypes.equals(df.dtypes)

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.2
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:16:51 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8103
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 2.2.3
numpy : 2.0.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.0
Cython : None
sphinx : None
IPython : 8.24.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : 2024.11.0
fsspec : 2024.10.0
html5lib : 1.1
hypothesis : 6.122.1
gcsfs : 2024.10.0
jinja2 : 3.1.4
lxml.etree : 5.3.0
matplotlib : 3.9.3
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : 0.24.0
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 18.1.0
pyreadstat : 1.2.8
pytest : 8.3.4
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.10.0
scipy : 1.14.1
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.2
qtpy : 2.4.2
pyqt5 : None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions