Skip to content

Use general float format when writing to CSV buffer to prevent numerical overload #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 26, 2018

Conversation

anthonydelage
Copy link
Contributor

@anthonydelage anthonydelage commented Jul 23, 2018

Summary

Based on issue #192.

The load.py module's encode_chunk() function writes to a local CSV buffer using Pandas' to_csv() function, which has a known issue regarding added significant figures on some operating systems. This can cause numerical overflow when loading the data into BigQuery.

This PR adds the float_format='%.15g' parameter to the to_csv() call.

Details

This section provides a bit more detail behind the %.15g float format.

As per the Python 3 string format documentation, the g (general) format acts as such:

For a given precision p >= 1, this rounds the number to p significant digits and then formats the result in either fixed-point format or in scientific notation, depending on its magnitude.

BigQuery stores floats as IEEE-754 doubles. 15 is the maximum number of significant digits that can be used for a floating point number to guarantee that it does not overflow (according to Wikipedia):

If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string.

Hence, we choose the general format with at most 15 decimal digits of precision. It reduces rounding loss from pandas to BigQuery to a minimum and does not increase storage requirements since floats are being stored as IEEE-754 doubles, anyway.

@max-sixty
Copy link
Contributor

Thanks a lot @anthonydelage

Could you add a test? You could use the case you put in the issue

The error the the current test is unrelated

@anthonydelage
Copy link
Contributor Author

@max-imlian I added a test based on an example provided here.

One issue that I did notice while writing the test, and that I'm not sure how to deal with, is dealing with numbers that will overflow in BigQuery, e.g. 1e20 or 1e-20. With or without the proposed string format update, overflowing values are written to the CSV buffer using exponential notation, which fails upon load to BigQuery because it's interpreted as a string. It's the kind of value that should either fail or be transformed to +Inf in BigQuery anyway, but that users won't be able to easily debug when they look at their load logs because it'll appear as a data type mismatch instead of a numeric overflow.


See: https://github.com/pydata/pandas-gbq/issues/192
"""
input_csv = StringIO('01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Py2 to pass, you need to put StringIO(u'...') to pass unicode into StringIO (you can see the test failure in Travis)

@max-sixty
Copy link
Contributor

Yes, good point. Do we get an overflow in BQ but not pandas because pandas can handle larger numbers?

(though we can deal with that separately, no need to delay this PR I think)

@anthonydelage
Copy link
Contributor Author

I think both systems theoretically use IEEE-754 (pandas uses numpy's float64 under the hood), but somehow the binary -> decimal -> binary conversion flow is causing issues.

Ideally, we'd have one of the following:

  • Pandas send the data to BigQuery without the CSV intermediary, maintaining its binary format
  • BigQuery learns how to interpret numbers in exponential notation as floats

@tswast
Copy link
Collaborator

tswast commented Jul 24, 2018

Pandas send the data to BigQuery without the CSV intermediary, maintaining its binary format

In google-cloud-bigquery we chose to serialize to Parquet. Eventually I'd like to move this library over to using load_table_from_dataframe() but I've hesitated because it means (a) an additional dependency on pyarrow and (b) there might be some subtle behavior differences (I know there are for read_gbq(), but haven't tested with to_gbq()).

@max-sixty
Copy link
Contributor

This looks good!

@anthonydelage do you want to add a note & your name to the whatsnew?

@anthonydelage
Copy link
Contributor Author

@max-imlian are you referring to the changelog.rst file?

@max-sixty
Copy link
Contributor

Yes!

And you can add "by @anthonydelage " if you like. You can start a 0.5.1 (Unreleased) section.

@anthonydelage
Copy link
Contributor Author

Done!

@max-sixty max-sixty merged commit 993fe55 into googleapis:master Jul 26, 2018
@max-sixty
Copy link
Contributor

Thanks @anthonydelage !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants