Closed
Description
related/dup #4340
Was building on this example on SO, and found a bug in ExcelFile().parse
option skiprows
, when passed an index.
In [2]: xls = pd.ExcelFile('example.xlsx')
In [3]: df = xls.parse(xls.sheet_names[0])
In [4]: print df
Bill Date Meter # Type SIC Code Billing Days Rate Code
0 2011-10-01 00:00:00 1892213 E 8111 29 ALTOU
1 2011-11-01 00:00:00 1892213 E 8111 29 ALTOU
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 Bill Date Meter # Type SIC Code Billing Days Rate Code
5 2011-10-01 00:00:00 553961 G 8111 29 GN3
6 2011-11-01 00:00:00 553961 G 8111 29 GN3
7 NaN NaN NaN NaN NaN NaN
8 Bill Date Meter # Type SIC Code Billing Days Rate Code
9 2011-10-01 00:00:00 6322158 E 29 A
10 2011-11-01 00:00:00 6322158 E 29 A
Wanting to avoid the repeated headers and empty rows, I parse with skiprows
, but no dice (no difference with previously parsed sheet):
In [5]: skip_idx = np.array([2,3,4,7,8])
In [6]: df = xls.parse(xls.sheet_names[0], skiprows=skip_idx+1)
In [8]: df
Out[8]:
Bill Date Meter # Type SIC Code Billing Days Rate Code
0 2011-10-01 00:00:00 1892213 E 8111 29 ALTOU
1 2011-11-01 00:00:00 1892213 E 8111 29 ALTOU
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 Bill Date Meter # Type SIC Code Billing Days Rate Code
5 2011-10-01 00:00:00 553961 G 8111 29 GN3
6 2011-11-01 00:00:00 553961 G 8111 29 GN3
7 NaN NaN NaN NaN NaN NaN
8 Bill Date Meter # Type SIC Code Billing Days Rate Code
9 2011-10-01 00:00:00 6322158 E 29 A
10 2011-11-01 00:00:00 6322158 E 29 A
Meanwhile, our little friend pd.read_csv()
, doesn't seem to have the same hangups. Creating a csv directly from the original .xlsx
and performing the same operations:
In [9]: df = pd.read_csv('example.csv')
In [10]: df
Out[10]:
Bill Date Meter # Type SIC Code Billing Days Rate Code
0 Oct-11 1892213 E 8111 29 ALTOU
1 Nov-11 1892213 E 8111 29 ALTOU
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 Bill Date Meter # Type SIC Code Billing Days Rate Code
5 Oct-11 553961 G 8111 29 GN3
6 Nov-11 553961 G 8111 29 GN3
7 NaN NaN NaN NaN NaN NaN
8 Bill Date Meter # Type SIC Code Billing Days Rate Code
9 Oct-11 6322158 E 29 A
10 Nov-11 6322158 E 29 A
Looks fine, and now passing the skiprows
option, and I get a correctly parsed df
with all the guilty lines missing:
In [11]: df = pd.read_csv('example.csv', skiprows=skip_id+1)
In [12]: df
Out[12]:
Bill Date Meter # Type SIC Code Billing Days Rate Code
0 Oct-11 1892213 E 8111 29 ALTOU
1 Nov-11 1892213 E 8111 29 ALTOU
2 Oct-11 553961 G 8111 29 GN3
3 Nov-11 553961 G 8111 29 GN3
4 Oct-11 6322158 E 29 A
5 Nov-11 6322158 E 29 A
Started tracing for troubleshooting, but after the 5th handoff of the skiprows
parameter all over in .io
, I gave up. :(
Thoughts?