ENH: option efficient memory use by downcasting

We use pandas-gbq a lot for our daily analyses. It is known that memory consumption can be a pain, see e.g. https://www.dataquest.io/blog/pandas-big-data/

I have started to write a patch, which could be integrated into an enhancement for `read_gbq` (rough idea, details TBD):
* Provide boolean `optimize_memory` option
* If `True`, the source table is inspected with a query to get min, max, presence of nulls and % of unique number of strings for INTEGER and STRING columns, respectively
* When calling `to_dataframe` this information is passed to the `dtypes` option, downcasting integers to the appropriate numpy (u)int type, and converting strings to pandas `category` type at some threshold (less than 50% of unique values)

I already have a working monkey-patch, which is still a bit rough. If there is enough interest I'd happily make it more robust and submit a PR. Would be my first significant contribution to an open source project, so some help and feedback would be appreciated.

Curious to hear your views on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: option efficient memory use by downcasting #275

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: option efficient memory use by downcasting #275

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions