Skip to content

DataFrame.set_index when setting a duplicate name now raises #30965

Open
@TomAugspurger

Description

@TomAugspurger

As part of #30588, we now raise when trying to create a 2D index. This introduces a behavior change when you call DataFrame.set_index with duplicate data.

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3]], columns=['a', 'a', 'b'])

In [3]: result = df.set_index('a')

On pandas 0.25.3, that gives back a DataFrame with a broken Index. Some DataFrame operations will work, but even things like printing the repr will fail

# 0.25.3
In [17]: type(result)
Out[17]: pandas.core.frame.DataFrame

In [18]: result.shape
Out[18]: (1, 1)

With 1.0.0rc0, that raises

~/sandbox/pandas/pandas/core/indexes/numeric.py in __new__(cls, data, dtype, copy, name)
     76         if subarr.ndim > 1:
     77             # GH#13601, GH#20285, GH#27125
---> 78             raise ValueError("Index data must be 1-dimensional")
     79
     80         name = maybe_extract_name(name, data, cls)

ValueError: Index data must be 1-dimensional

Problem description

The old output is clearly broken, so I wouldn't consider this a (major) regression. And I don't think people should be doing this in the first place. But I wanted to ask, should DataFrame.set_index(scalar) return a MultiIndex when scalar is a duplicate label?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions