Closed
Description
We use pandas-gbq a lot for our daily analyses. It is known that memory consumption can be a pain, see e.g. https://www.dataquest.io/blog/pandas-big-data/
I have started to write a patch, which could be integrated into an enhancement for read_gbq
(rough idea, details TBD):
- Provide boolean
optimize_memory
option - If
True
, the source table is inspected with a query to get min, max, presence of nulls and % of unique number of strings for INTEGER and STRING columns, respectively - When calling
to_dataframe
this information is passed to thedtypes
option, downcasting integers to the appropriate numpy (u)int type, and converting strings to pandascategory
type at some threshold (less than 50% of unique values)
I already have a working monkey-patch, which is still a bit rough. If there is enough interest I'd happily make it more robust and submit a PR. Would be my first significant contribution to an open source project, so some help and feedback would be appreciated.
Curious to hear your views on this.