Description
As far as I can tell, both methods of reading an Excel sheet (through ExcelFile and read_excel) do not have the option to avoid mangling duplicate columns, which exists for read_csv (#3468).
I have an Excel file that looks like this:
foo foo bar bar baz baz
A B A B A B
123 456 789 12 345 678
901 234 567 890 123 456
789 12 345 678 901 234
I initially encountered this problem while trying to find a workaround for not being able to specify a multirow-header (#4679). Reading in this Excel file with header = 0 (default) mangles the column names:
>>> df = pd.read_excel("dupe_cols.xlsx", header = 0)
>>> print df
foo foo.1 bar bar.1 baz baz.1
0 A B A B A B
1 123 456 789 12 345 678
2 901 234 567 890 123 456
3 789 12 345 678 901 234
Unlike read_csv
, read_excel
does not have the option to avoid mangling duplicate columns (using ExcelFile.parse works the same as far as I can see). Specifying header = None in read_excel
and then assigning the column names to the first row will effectively allow you to avoid mangling the column names. You could also read in the Excel sheet with header = None, save a .csv with header = False and index = False, then read that csv and specify mangle_dupe_cols = False to get the dataframe you want. (Incidentally, this also allows you to specify multiple rows as the header, which is the behavior I was originally trying to emulate.)