Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(True, 'B'), (False, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # fails
# now save out with multi-index on index instead of columns:
df.T.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds
# now save out with int instead of bool index:
df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(1, 'B'), (0, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds
Issue Description
Parquet IO with multi-index indices or columns is supported. However, if the multi-index contains a level with bools and if that multi-index is on the columns, then while the parquet can be written with the pyarrow
engine, it cannot be read back in using pyarrow
.
The traceback I get is below:
Traceback (most recent call last):
File "<python-input-0>", line 5, in <module>
pd.read_parquet('test.parquet', engine='pyarrow') # fails
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 649, in read_parquet
return impl.read(
~~~~~~~~~^
path,
^^^^^
...<6 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 270, in read
result = arrow_table_to_pandas(
pa_table,
dtype_backend=dtype_backend,
to_pandas_kwargs=to_pandas_kwargs,
)
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/_util.py", line 86, in arrow_table_to_pandas
df = table.to_pandas(types_mapper=types_mapper, **to_pandas_kwargs)
File "pyarrow/array.pxi", line 887, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 5132, in pyarrow.lib.Table._to_pandas
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 790, in table_to_dataframe
columns = _deserialize_column_index(table, all_columns, column_indexes)
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 928, in _deserialize_column_index
columns = _reconstruct_columns_from_metadata(columns, column_indexes)
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 1145, in _reconstruct_columns_from_metadata
return pd.MultiIndex(new_levels, labels, names=columns.names)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 341, in __new__
new_codes = result._verify_integrity()
File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 427, in _verify_integrity
raise ValueError(
f"Level values must be unique: {list(level)} on level {i}"
)
ValueError: Level values must be unique: [True, True] on level 0
Further note that the fastparquet
can neither read nor write such dataframes. There are a panoply of different errors on read/write with multi-index with fastparquet
depending on whether the multi-index is on the index or columns, and whether the index has level names or not. I (or someone) should probably open separate bugs on that...
NB. the issue repros in a clean environment with only python, pip, pandas (dev), and pyarrow/fastparquet directly installed.
Expected Behavior
Parquet IO should support bool multi-index levels on columns.
Installed Versions
INSTALLED VERSIONS
------------------
commit : a36c44e129bd2f70c25d5dec89cb2893716bdbf6
python : 3.13.1
python-bits : 64
OS : Darwin
OS-release : 23.6.0
Version : Darwin Kernel Version 23.6.0: Wed Jul 31 20:50:00 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T6031
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+1757.ga36c44e129
numpy : 2.1.3
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : 2024.11.0
fsspec : 2024.10.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : None
python-calamine : None
pytz : 2024.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None