Use general float format when writing to CSV buffer to prevent numerical overload #193

anthonydelage · 2018-07-23T21:38:42Z

Summary

Based on issue #192.

The load.py module's encode_chunk() function writes to a local CSV buffer using Pandas' to_csv() function, which has a known issue regarding added significant figures on some operating systems. This can cause numerical overflow when loading the data into BigQuery.

This PR adds the float_format='%.15g' parameter to the to_csv() call.

Details

This section provides a bit more detail behind the %.15g float format.

As per the Python 3 string format documentation, the g (general) format acts as such:

For a given precision p >= 1, this rounds the number to p significant digits and then formats the result in either fixed-point format or in scientific notation, depending on its magnitude.

BigQuery stores floats as IEEE-754 doubles. 15 is the maximum number of significant digits that can be used for a floating point number to guarantee that it does not overflow (according to Wikipedia):

If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string.

Hence, we choose the general format with at most 15 decimal digits of precision. It reduces rounding loss from pandas to BigQuery to a minimum and does not increase storage requirements since floats are being stored as IEEE-754 doubles, anyway.

max-sixty · 2018-07-23T21:52:26Z

Thanks a lot @anthonydelage

Could you add a test? You could use the case you put in the issue

The error the the current test is unrelated

anthonydelage · 2018-07-24T03:43:56Z

@max-imlian I added a test based on an example provided here.

One issue that I did notice while writing the test, and that I'm not sure how to deal with, is dealing with numbers that will overflow in BigQuery, e.g. 1e20 or 1e-20. With or without the proposed string format update, overflowing values are written to the CSV buffer using exponential notation, which fails upon load to BigQuery because it's interpreted as a string. It's the kind of value that should either fail or be transformed to +Inf in BigQuery anyway, but that users won't be able to easily debug when they look at their load logs because it'll appear as a data type mismatch instead of a numeric overflow.

max-sixty · 2018-07-24T16:02:37Z

tests/unit/test_load.py

+
+    See: https://github.com/pydata/pandas-gbq/issues/192
+    """
+    input_csv = StringIO('01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4')


For Py2 to pass, you need to put StringIO(u'...') to pass unicode into StringIO (you can see the test failure in Travis)

max-sixty · 2018-07-24T16:05:09Z

Yes, good point. Do we get an overflow in BQ but not pandas because pandas can handle larger numbers?

(though we can deal with that separately, no need to delay this PR I think)

anthonydelage · 2018-07-24T22:03:14Z

I think both systems theoretically use IEEE-754 (pandas uses numpy's float64 under the hood), but somehow the binary -> decimal -> binary conversion flow is causing issues.

Ideally, we'd have one of the following:

Pandas send the data to BigQuery without the CSV intermediary, maintaining its binary format
BigQuery learns how to interpret numbers in exponential notation as floats

tswast · 2018-07-24T23:58:40Z

Pandas send the data to BigQuery without the CSV intermediary, maintaining its binary format

In google-cloud-bigquery we chose to serialize to Parquet. Eventually I'd like to move this library over to using load_table_from_dataframe() but I've hesitated because it means (a) an additional dependency on pyarrow and (b) there might be some subtle behavior differences (I know there are for read_gbq(), but haven't tested with to_gbq()).

max-sixty · 2018-07-25T15:37:55Z

This looks good!

@anthonydelage do you want to add a note & your name to the whatsnew?

anthonydelage · 2018-07-25T17:59:48Z

@max-imlian are you referring to the changelog.rst file?

max-sixty · 2018-07-25T18:07:55Z

Yes!

And you can add "by @anthonydelage " if you like. You can start a 0.5.1 (Unreleased) section.

anthonydelage · 2018-07-26T02:02:57Z

Done!

max-sixty · 2018-07-26T14:47:16Z

Thanks @anthonydelage !

anthonydelage added 2 commits July 22, 2018 16:37

Write to CSV stream with general float format.

10ee0da

Specify number of significant digits for float format.

7a5b410

anthonydelage mentioned this pull request Jul 23, 2018

Writing to CSV buffer with default float format causes numerical overflow in BigQuery #192

Closed

anthonydelage and others added 2 commits July 23, 2018 23:26

Change format to '%.15g' and add tests.

8d84d84

Fixing style errors.

ef0553f

max-sixty reviewed Jul 24, 2018

View reviewed changes

Define string as unicode in to_gbq's float test.

3d10d62

Update Changelog.

d13c370

max-sixty merged commit 993fe55 into googleapis:master Jul 26, 2018

danielchatfield mentioned this pull request Sep 1, 2020

Floats can lose precision when loading to BigQuery #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use general float format when writing to CSV buffer to prevent numerical overload #193

Use general float format when writing to CSV buffer to prevent numerical overload #193

anthonydelage commented Jul 23, 2018 •

edited

Loading

max-sixty commented Jul 23, 2018

anthonydelage commented Jul 24, 2018

max-sixty Jul 24, 2018

max-sixty commented Jul 24, 2018

anthonydelage commented Jul 24, 2018

tswast commented Jul 24, 2018

max-sixty commented Jul 25, 2018

anthonydelage commented Jul 25, 2018

max-sixty commented Jul 25, 2018

anthonydelage commented Jul 26, 2018

max-sixty commented Jul 26, 2018

Use general float format when writing to CSV buffer to prevent numerical overload #193

Use general float format when writing to CSV buffer to prevent numerical overload #193

Conversation

anthonydelage commented Jul 23, 2018 • edited Loading

Summary

Details

max-sixty commented Jul 23, 2018

anthonydelage commented Jul 24, 2018

max-sixty Jul 24, 2018

Choose a reason for hiding this comment

max-sixty commented Jul 24, 2018

anthonydelage commented Jul 24, 2018

tswast commented Jul 24, 2018

max-sixty commented Jul 25, 2018

anthonydelage commented Jul 25, 2018

max-sixty commented Jul 25, 2018

anthonydelage commented Jul 26, 2018

max-sixty commented Jul 26, 2018

anthonydelage commented Jul 23, 2018 •

edited

Loading