Skip to content

ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

Closed
@Sanix-Darker

Description

@Sanix-Darker

Is your feature request related to a problem?

I wish pandas read_excel could be faster.

Describe the solution you'd like

pandas.read_excel should get faster if we use engines iterators, here i created an offset variable for testing, that can go deeper to the engines, so for theese tests, there is a new parameter offset that goes deeply to the engine iterators functions.

API breaking implications

It should not break the actual API.

Describe alternatives you've considered

For now, this is the only way i found.

Additional context

Basically, the offset can keep the columns name and be use like this :
Screenshot from 2022-06-08 17-37-43

LIVE DEMO

Peek 2022-06-08 02-57

BENCHMARKS

And as benchmarks are runned with :

import pandas as pd
from timeit import default_timer


def bench_mark_func():
    for ext in ["xls", "xlsx"]:
        print(f"\n[{ext}] old way, no nrows, nor skiprows :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(f"benchmark_5000.{ext}")
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] actual way, with nrows and sometime skiprows :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(
                f"benchmark_5000.{ext}", nrows=50 + i, skiprows=100 * (++i)
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] new way, with nrows and offset :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(
                f"benchmark_5000.{ext}", nrows=50 + i, offset=100 * (++i)
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print("==" * 50)


if __name__ == "__main__":
    bench_mark_func()
OUTPUT
===============================================================================================
[xls] old way, no nrows, nor skiprows :
[xls] done in 7.315936979022808
******************************

[xls] actual way, with nrows and skiprows :
[xls] done in 6.850192359997891
******************************

[xls] new way, with nrows and offset :
[xls] done in 5.909526361967437
******************************
===============================================================================================

[xlsx] old way, no nrows, nor skiprows :
[xlsx] done in 39.742386338009965
******************************

[xlsx] actual way, with nrows and skiprows :
[xlsx] done in 31.731780430010986
******************************

[xlsx] new way, with nrows and offset :
[xlsx] done in 25.8184904170339
******************************
==============================================================================================

I made a Proof Of Concept for this update available here : https://github.com/Sanix-Darker/pandas/pull/1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions