Skip to content

PERF: lazify IO imports #52421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@MarcoGorelli
Copy link
Member

generally in favour - do you have an estimate of how much this helps?

@jbrockmendel
Copy link
Member Author

do you have an estimate of how much this helps?

The measurements with python -X importtime -c "import pandas as pd" are extremely noisy. I eyeball it at roughly 511ms in main 472ms in this branch.

The next big chunk available would be to delay core.groupby import, which is 86ms. Doing that requires some yak-shaving bc core.window imports it, so that needs to be made lazy (and docstrings moved xref #51823). Also dask references pd.core.groupby.groupby.something_i_forget_what at import-time to do a docstring pinning, so we'd need to do something about that.

Following that is some smaller-but-easier stuff: stdlib imports like shutil and stuff in pd.compat.compressors that are relatively expensive, dateutil (xref #52659).

The big ones are pyarrow (87ms) and numpy (86ms). making numpy import lazy is probably not worth the effort (though asking numpy to lazify parts of its own imports might be worthwhile). I'd like to see the pyarrow import made lazy, especially if we don't make it required.

# error: Cannot determine type of 'index_col'
kwds["allow_leading_cols"] = (
self.index_col is not False # type: ignore[has-type]
)
kwds["allow_leading_cols"] = self.index_col is not False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does lazifying imports change how things are type-checked here?

is it that it than mypy can't follow imports, and so we end up with a lot more Anys?

If so, then there is a tradeoff to consider...slightly faster import pandas, but fewer type hints whilst developing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does lazifying imports change how things are type-checked here?

hmm i have no idea. cc @simonjayhawkins ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea @Dr-Irv?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some mypy funkyness.

If you insert the line reveal_type(self.index_col) in main, you get:

pandas\io\parsers\c_parser_wrapper.py:68: error: Cannot determine type of "index_col"  [has-type]
pandas\io\parsers\c_parser_wrapper.py:68: note: Revealed type is "Any"

With this PR, the first message about "Cannot determine type" goes away. But it still sees the type of self.index_col as Any. So I guess this is a good thing

@jbrockmendel
Copy link
Member Author

No clear interest here, closing.

@jbrockmendel jbrockmendel deleted the perf-import-2 branch May 3, 2023 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants