Closed
Description
Opening a new issue so this isn't lost.
In #18882 banned duplicate names in a MultiIndex. I think this is a good change since allowing duplicates hit a lot of edge cases when you went to actually do something. I want to make sure we understand all the cases that actually produce duplicate names in the MI though, specifically groupby.apply.
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
...: 'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]},
...: index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a")
...:
...:
In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)
Another, more realistic example: groupwise drop_duplicates:
In [18]: df = pd.DataFrame({"B": [0, 0, 0, 1, 1, 1, 2, 2, 2]}, index=pd.Index([0, 1, 1, 2, 2, 2, 0, 0, 1], name='a'))
In [19]: df
Out[19]:
B
a
0 0
1 0
1 0
2 1
2 1
2 1
0 2
0 2
1 2
In [20]: df.groupby('a').apply(pd.DataFrame.drop_duplicates)
Out[20]:
B
a a
0 0 0
0 2
1 1 0
1 2
2 2 1
Is it possible to throw a warning on this for now, in case duplicate names are more common than we thought?