Skip to content

Slow read_gbq comparing to QueryJob.to_dataframe #924

Open
@tulinski

Description

@tulinski

There is a substantial difference in execution time between pandas_gbq.read_gbq and google.cloud.bigquery.job.query.QueryJob.to_dataframe functions.

The reason is pandas_gbq.read_gbq calls rows_iter.to_dataframe here with dtypes=conversion_dtypes which causes converting columns in RowIterator.to_dataframe.

conversion_dtypes is created this way:
conversion_dtypes = _bqschema_to_nullsafe_dtypes(schema_fields)

Why this is default behavior?
In our case we don't need these costly column conversions.

Profiling results

117.538    RowIterator.to_dataframe        google/cloud/bigquery/table.py:2261
    65.551      DataFrame.__setitem__      pandas/core/frame.py:3630
    48.537      RowIterator.to_arrow       google/cloud/bigquery/table.py:2058
    2.302       table_to_dataframe

Column transformations take 55% time of RowIterator.to_dataframe.

Environment details

  • Python version: 3.10.12
  • pandas-gbq version: 0.28.0

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/python-bigquery-pandas API.priority: p3Desirable enhancement or fix. May not be included in next release.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions