Skip to content

BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

Open
@zpincus

Description

@zpincus

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(True, 'B'), (False, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # fails

# now save out with multi-index on index instead of columns:
df.T.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds

# now save out with int instead of bool index:
df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(1, 'B'), (0, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds

Issue Description

Parquet IO with multi-index indices or columns is supported. However, if the multi-index contains a level with bools and if that multi-index is on the columns, then while the parquet can be written with the pyarrow engine, it cannot be read back in using pyarrow.

The traceback I get is below:

Traceback (most recent call last):
  File "<python-input-0>", line 5, in <module>
    pd.read_parquet('test.parquet', engine='pyarrow') # fails
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 649, in read_parquet
    return impl.read(
           ~~~~~~~~~^
        path,
        ^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 270, in read
    result = arrow_table_to_pandas(
        pa_table,
        dtype_backend=dtype_backend,
        to_pandas_kwargs=to_pandas_kwargs,
    )
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/_util.py", line 86, in arrow_table_to_pandas
    df = table.to_pandas(types_mapper=types_mapper, **to_pandas_kwargs)
  File "pyarrow/array.pxi", line 887, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 5132, in pyarrow.lib.Table._to_pandas
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 790, in table_to_dataframe
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 928, in _deserialize_column_index
    columns = _reconstruct_columns_from_metadata(columns, column_indexes)
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 1145, in _reconstruct_columns_from_metadata
    return pd.MultiIndex(new_levels, labels, names=columns.names)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 341, in __new__
    new_codes = result._verify_integrity()
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 427, in _verify_integrity
    raise ValueError(
        f"Level values must be unique: {list(level)} on level {i}"
    )
ValueError: Level values must be unique: [True, True] on level 0

Further note that the fastparquet can neither read nor write such dataframes. There are a panoply of different errors on read/write with multi-index with fastparquet depending on whether the multi-index is on the index or columns, and whether the index has level names or not. I (or someone) should probably open separate bugs on that...

NB. the issue repros in a clean environment with only python, pip, pandas (dev), and pyarrow/fastparquet directly installed.

Expected Behavior

Parquet IO should support bool multi-index levels on columns.

Installed Versions

INSTALLED VERSIONS
------------------
commit                : a36c44e129bd2f70c25d5dec89cb2893716bdbf6
python                : 3.13.1
python-bits           : 64
OS                    : Darwin
OS-release            : 23.6.0
Version               : Darwin Kernel Version 23.6.0: Wed Jul 31 20:50:00 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T6031
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+1757.ga36c44e129
numpy                 : 2.1.3
dateutil              : 2.9.0.post0
pip                   : 24.3.1
Cython                : None
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : None
fastparquet           : 2024.11.0
fsspec                : 2024.10.0
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
psycopg2              : None
pymysql               : None
pyarrow               : 18.1.0
pyreadstat            : None
pytest                : None
python-calamine       : None
pytz                  : 2024.1
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions