Skip to content

ENH: option efficient memory use by downcasting #275

Closed
@dkapitan

Description

@dkapitan

We use pandas-gbq a lot for our daily analyses. It is known that memory consumption can be a pain, see e.g. https://www.dataquest.io/blog/pandas-big-data/

I have started to write a patch, which could be integrated into an enhancement for read_gbq (rough idea, details TBD):

  • Provide boolean optimize_memory option
  • If True, the source table is inspected with a query to get min, max, presence of nulls and % of unique number of strings for INTEGER and STRING columns, respectively
  • When calling to_dataframe this information is passed to the dtypes option, downcasting integers to the appropriate numpy (u)int type, and converting strings to pandas category type at some threshold (less than 50% of unique values)

I already have a working monkey-patch, which is still a bit rough. If there is enough interest I'd happily make it more robust and submit a PR. Would be my first significant contribution to an open source project, so some help and feedback would be appreciated.

Curious to hear your views on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    accepting pull requestsapi: bigqueryIssues related to the googleapis/python-bigquery-pandas API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions