Skip to content

BUG: ExcelFile.parse() skiprows arg doesn't play nice #4903

Closed
@nehalecky

Description

@nehalecky

related/dup #4340

Was building on this example on SO, and found a bug in ExcelFile().parse option skiprows, when passed an index.

In [2]: xls = pd.ExcelFile('example.xlsx')
In [3]: df = xls.parse(xls.sheet_names[0])
In [4]: print df
              Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
0   2011-10-01 00:00:00  1892213     E      8111            29      ALTOU
1   2011-11-01 00:00:00  1892213     E      8111            29      ALTOU
2                   NaN      NaN   NaN       NaN           NaN        NaN
3                   NaN      NaN   NaN       NaN           NaN        NaN
4             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
5   2011-10-01 00:00:00   553961     G      8111            29        GN3
6   2011-11-01 00:00:00   553961     G      8111            29        GN3
7                   NaN      NaN   NaN       NaN           NaN        NaN
8             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
9   2011-10-01 00:00:00  6322158     E                      29          A
10  2011-11-01 00:00:00  6322158     E                      29          A

Wanting to avoid the repeated headers and empty rows, I parse with skiprows, but no dice (no difference with previously parsed sheet):

In [5]: skip_idx = np.array([2,3,4,7,8])
In [6]: df = xls.parse(xls.sheet_names[0], skiprows=skip_idx+1)
In [8]: df
Out[8]: 
              Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
0   2011-10-01 00:00:00  1892213     E      8111            29      ALTOU
1   2011-11-01 00:00:00  1892213     E      8111            29      ALTOU
2                   NaN      NaN   NaN       NaN           NaN        NaN
3                   NaN      NaN   NaN       NaN           NaN        NaN
4             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
5   2011-10-01 00:00:00   553961     G      8111            29        GN3
6   2011-11-01 00:00:00   553961     G      8111            29        GN3
7                   NaN      NaN   NaN       NaN           NaN        NaN
8             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
9   2011-10-01 00:00:00  6322158     E                      29          A
10  2011-11-01 00:00:00  6322158     E                      29          A

Meanwhile, our little friend pd.read_csv(), doesn't seem to have the same hangups. Creating a csv directly from the original .xlsx and performing the same operations:

In [9]: df = pd.read_csv('example.csv')
In [10]: df
Out[10]: 
    Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
0      Oct-11  1892213     E      8111            29      ALTOU
1      Nov-11  1892213     E      8111            29      ALTOU
2         NaN      NaN   NaN       NaN           NaN        NaN
3         NaN      NaN   NaN       NaN           NaN        NaN
4   Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
5      Oct-11   553961     G      8111            29        GN3
6      Nov-11   553961     G      8111            29        GN3
7         NaN      NaN   NaN       NaN           NaN        NaN
8   Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
9      Oct-11  6322158     E                      29          A
10     Nov-11  6322158     E                      29          A

Looks fine, and now passing the skiprows option, and I get a correctly parsed df with all the guilty lines missing:

In [11]: df = pd.read_csv('example.csv', skiprows=skip_id+1)
In [12]: df
Out[12]: 
  Bill Date  Meter # Type SIC Code  Billing Days Rate Code
0    Oct-11  1892213    E     8111            29     ALTOU
1    Nov-11  1892213    E     8111            29     ALTOU
2    Oct-11   553961    G     8111            29       GN3
3    Nov-11   553961    G     8111            29       GN3
4    Oct-11  6322158    E                     29         A
5    Nov-11  6322158    E                     29         A

Started tracing for troubleshooting, but after the 5th handoff of the skiprows parameter all over in .io, I gave up. :(

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO DataIO issues that don't fit into a more specific labelIO Excelread_excel, to_excel

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions