@@ -3400,83 +3400,79 @@ Google BigQuery (Experimental)
3400
3400
The :mod: `pandas.io.gbq ` module provides a wrapper for Google's BigQuery
3401
3401
analytics web service to simplify retrieving results from BigQuery tables
3402
3402
using SQL-like queries. Result sets are parsed into a pandas
3403
- DataFrame with a shape derived from the source table. Additionally,
3404
- DataFrames can be uploaded into BigQuery datasets as tables
3405
- if the source datatypes are compatible with BigQuery ones .
3403
+ DataFrame with a shape and data types derived from the source table.
3404
+ Additionally, DataFrames can be appended to existing BigQuery tables if
3405
+ the destination table is the same shape as the DataFrame .
3406
3406
3407
3407
For specifics on the service itself, see `here <https://developers.google.com/bigquery/ >`__
3408
3408
3409
- As an example, suppose you want to load all data from an existing table
3410
- : `test_dataset.test_table `
3411
- into BigQuery and pull it into a DataFrame .
3409
+ As an example, suppose you want to load all data from an existing BigQuery
3410
+ table : `test_dataset.test_table ` into a DataFrame using the :func: ` ~pandas.io.read_gbq `
3411
+ function .
3412
3412
3413
3413
.. code-block :: python
3414
-
3415
- from pandas.io import gbq
3416
-
3417
3414
# Insert your BigQuery Project ID Here
3418
- # Can be found in the web console, or
3419
- # using the command line tool `bq ls`
3415
+ # Can be found in the Google web console
3420
3416
projectid = " xxxxxxxx"
3421
3417
3422
- data_frame = gbq .read_gbq(' SELECT * FROM test_dataset.test_table' , project_id = projectid)
3418
+ data_frame = pd .read_gbq(' SELECT * FROM test_dataset.test_table' , project_id = projectid)
3423
3419
3424
- The user will then be authenticated by the `bq ` command line client -
3425
- this usually involves the default browser opening to a login page,
3426
- though the process can be done entirely from command line if necessary.
3427
- Datasets and additional parameters can be either configured with `bq `,
3428
- passed in as options to `read_gbq `, or set using Google's gflags (this
3429
- is not officially supported by this module, though care was taken
3430
- to ensure that they should be followed regardless of how you call the
3431
- method).
3420
+ You will then be authenticated to the specified BigQuery account
3421
+ via Google's Oauth2 mechanism. In general, this is as simple as following the
3422
+ prompts in a browser window which will be opened for you. Should the browser not
3423
+ be available, or fail to launch, a code will be provided to complete the process
3424
+ manually. Additional information on the authentication mechanism can be found
3425
+ `here <https://developers.google.com/accounts/docs/OAuth2#clientside/ >`__
3432
3426
3433
- Additionally, you can define which column to use as an index as well as a preferred column order as follows:
3427
+ You can define which column from BigQuery to use as an index in the
3428
+ destination DataFrame as well as a preferred column order as follows:
3434
3429
3435
3430
.. code-block :: python
3436
3431
3437
- data_frame = gbq .read_gbq(' SELECT * FROM test_dataset.test_table' ,
3432
+ data_frame = pd .read_gbq(' SELECT * FROM test_dataset.test_table' ,
3438
3433
index_col = ' index_column_name' ,
3439
- col_order = ' [col1, col2, col3,...]' , project_id = projectid)
3440
-
3441
- Finally, if you would like to create a BigQuery table, `my_dataset.my_table `, from the rows of DataFrame, `df `:
3434
+ col_order = [' col1' , ' col2' , ' col3' ], project_id = projectid)
3435
+
3436
+ Finally, you can append data to a BigQuery table from a pandas DataFrame
3437
+ using the :func: `~pandas.io.to_gbq ` function. This function uses the
3438
+ Google streaming API which requires that your destination table exists in
3439
+ BigQuery. Given the BigQuery table already exists, your DataFrame should
3440
+ match the destination table in column order, structure, and data types.
3441
+ DataFrame indexes are not supported. By default, rows are streamed to
3442
+ BigQuery in chunks of 10,000 rows, but you can pass other chuck values
3443
+ via the ``chunksize `` argument. You can also see the progess of your
3444
+ post via the ``verbose `` flag which defaults to ``True ``. The http
3445
+ response code of Google BigQuery can be successful (200) even if the
3446
+ append failed. For this reason, if there is a failure to append to the
3447
+ table, the complete error response from BigQuery is returned which
3448
+ can be quite long given it provides a status for each row. You may want
3449
+ to start with smaller chuncks to test that the size and types of your
3450
+ dataframe match your destination table to make debugging simpler.
3442
3451
3443
3452
.. code-block :: python
3444
3453
3445
3454
df = pandas.DataFrame({' string_col_name' : [' hello' ],
3446
3455
' integer_col_name' : [1 ],
3447
3456
' boolean_col_name' : [True ]})
3448
- schema = [' STRING' , ' INTEGER' , ' BOOLEAN' ]
3449
- data_frame = gbq.to_gbq(df, ' my_dataset.my_table' ,
3450
- if_exists = ' fail' , schema = schema, project_id = projectid)
3451
-
3452
- To add more rows to this, simply:
3453
-
3454
- .. code-block :: python
3455
-
3456
- df2 = pandas.DataFrame({' string_col_name' : [' hello2' ],
3457
- ' integer_col_name' : [2 ],
3458
- ' boolean_col_name' : [False ]})
3459
- data_frame = gbq.to_gbq(df2, ' my_dataset.my_table' , if_exists = ' append' , project_id = projectid)
3457
+ df.to_gbq(' my_dataset.my_table' , project_id = projectid)
3460
3458
3461
- .. note ::
3459
+ The BigQuery SQL query language has some oddities, see ` here < https://developers.google.com/bigquery/query-reference >`__
3462
3460
3463
- A default project id can be set using the command line:
3464
- `bq init `.
3461
+ While BigQuery uses SQL-like syntax, it has some important differences
3462
+ from traditional databases both in functionality, API limitations (size and
3463
+ qunatity of queries or uploads), and how Google charges for use of the service.
3464
+ You should refer to Google documentation often as the service seems to
3465
+ be changing and evolving. BiqQuery is best for analyzing large sets of
3466
+ data quickly, but it is not a direct replacement for a transactional database.
3465
3467
3466
- There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
3467
- see `here <https://developers.google.com/bigquery/query-reference >`__
3468
-
3469
- You can access the management console to determine project id's by:
3470
- <https://code.google.com/apis/console/b/0/?noredirect>
3468
+ You can access the management console to determine project id's by:
3469
+ <https://code.google.com/apis/console/b/0/?noredirect>
3471
3470
3472
3471
.. warning ::
3473
3472
3474
- To use this module, you will need a BigQuery account. See
3475
- <https://cloud.google.com/products/big-query> for details.
3476
-
3477
- As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
3478
- but any client changes will not make it into 0.13.1. See:
3479
- http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
3473
+ To use this module, you will need a valid BigQuery account. See
3474
+ <https://cloud.google.com/products/big-query> for details on the
3475
+ service.
3480
3476
3481
3477
.. _io.stata :
3482
3478
0 commit comments