Skip to content

Reading from Excel mangles columns #10523

Closed
@atheyjohnc

Description

@atheyjohnc

As far as I can tell, both methods of reading an Excel sheet (through ExcelFile and read_excel) do not have the option to avoid mangling duplicate columns, which exists for read_csv (#3468).

I have an Excel file that looks like this:

  foo  foo  bar  bar  baz  baz
    A    B    A    B    A    B
  123  456  789   12  345  678
  901  234  567  890  123  456
  789   12  345  678  901  234

I initially encountered this problem while trying to find a workaround for not being able to specify a multirow-header (#4679). Reading in this Excel file with header = 0 (default) mangles the column names:

>>> df = pd.read_excel("dupe_cols.xlsx", header = 0)
>>> print df
   foo foo.1  bar bar.1  baz baz.1
0    A     B    A     B    A     B
1  123   456  789    12  345   678
2  901   234  567   890  123   456
3  789    12  345   678  901   234

Unlike read_csv, read_excel does not have the option to avoid mangling duplicate columns (using ExcelFile.parse works the same as far as I can see). Specifying header = None in read_excel and then assigning the column names to the first row will effectively allow you to avoid mangling the column names. You could also read in the Excel sheet with header = None, save a .csv with header = False and index = False, then read that csv and specify mangle_dupe_cols = False to get the dataframe you want. (Incidentally, this also allows you to specify multiple rows as the header, which is the behavior I was originally trying to emulate.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions