Closed
Description
Is your feature request related to a problem?
I wish pandas read_excel could be faster.
Describe the solution you'd like
pandas.read_excel should get faster if we use engines iterators, here i created an offset variable for testing, that can go deeper to the engines, so for theese tests, there is a new parameter offset
that goes deeply to the engine iterators functions.
API breaking implications
It should not break the actual API.
Describe alternatives you've considered
For now, this is the only way i found.
Additional context
Basically, the offset can keep the columns name and be use like this :
LIVE DEMO
BENCHMARKS
And as benchmarks are runned with :
import pandas as pd
from timeit import default_timer
def bench_mark_func():
for ext in ["xls", "xlsx"]:
print(f"\n[{ext}] old way, no nrows, nor skiprows :")
start = default_timer()
for i in range(100):
df_xls = pd.read_excel(f"benchmark_5000.{ext}")
print(f"[{ext}] done in {default_timer() - start}")
print("*" * 30)
print(f"\n[{ext}] actual way, with nrows and sometime skiprows :")
start = default_timer()
for i in range(100):
df_xls = pd.read_excel(
f"benchmark_5000.{ext}", nrows=50 + i, skiprows=100 * (++i)
)
print(f"[{ext}] done in {default_timer() - start}")
print("*" * 30)
print(f"\n[{ext}] new way, with nrows and offset :")
start = default_timer()
for i in range(100):
df_xls = pd.read_excel(
f"benchmark_5000.{ext}", nrows=50 + i, offset=100 * (++i)
)
print(f"[{ext}] done in {default_timer() - start}")
print("*" * 30)
print("==" * 50)
if __name__ == "__main__":
bench_mark_func()
OUTPUT
===============================================================================================
[xls] old way, no nrows, nor skiprows :
[xls] done in 7.315936979022808
******************************
[xls] actual way, with nrows and skiprows :
[xls] done in 6.850192359997891
******************************
[xls] new way, with nrows and offset :
[xls] done in 5.909526361967437
******************************
===============================================================================================
[xlsx] old way, no nrows, nor skiprows :
[xlsx] done in 39.742386338009965
******************************
[xlsx] actual way, with nrows and skiprows :
[xlsx] done in 31.731780430010986
******************************
[xlsx] new way, with nrows and offset :
[xlsx] done in 25.8184904170339
******************************
==============================================================================================
I made a Proof Of Concept for this update available here : https://github.com/Sanix-Darker/pandas/pull/1