Description
It seems to me we should either fix ExcelFile.parse
or deprecate it entirely, and I lean toward the latter. pandas originally started out with just ExcelFile
but now has the top-level read_excel
. The signatures started the same, but now read_excel
has gained and modified parameters that have not been added/changed in ExcelFile.parse
. For example:
ExcelFile.parse
lacks adtype
parameterExcelFile.parse
has a**kwds
argument that is passed on to pandas internals with no documentation on what can be included. Invalid arguments are just ignored (e.g. BUG: xl.parse index_col ignoring skiprows #50953)
It appears to me that pd.ExcelFile(...).parse(...)
offers no advantage over pd.read_excel(pd.ExcelFile(...))
, and so rather than fixing parse
we can deprecate it and make it internal.
Edit: I no longer think deprecating ExcelFile
entirely as mentioned below is a good option. See #58247 (comment).
Another option is to deprecate ExcelFile
entirely. The one thing ExcelFile
still provides that isn't available elsewhere is to get the underlying book
or sheet_names
without reading the entire file.
df = pd.DataFrame(np.zeros((100, 100)))
with pd.ExcelWriter("test.xlsx") as writer:
for e in range(10):
df.to_excel(writer, sheet_name=str(e))
%timeit pd.ExcelFile("test.xlsx").sheet_names
# 14.1 ms ± 76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_excel("test.xlsx", sheet_name=None)
# 411 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can somewhat work around this by using nrows
, but it's clunky.
%timeit pd.read_excel("test.xlsx", sheet_name=None, nrows=0).keys()
# 57.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)