-
Notifications
You must be signed in to change notification settings - Fork 125
Use general float format when writing to CSV buffer to prevent numerical overload #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot @anthonydelage Could you add a test? You could use the case you put in the issue The error the the current test is unrelated |
@max-imlian I added a test based on an example provided here. One issue that I did notice while writing the test, and that I'm not sure how to deal with, is dealing with numbers that will overflow in BigQuery, e.g. |
tests/unit/test_load.py
Outdated
|
||
See: https://github.com/pydata/pandas-gbq/issues/192 | ||
""" | ||
input_csv = StringIO('01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Py2 to pass, you need to put StringIO(u'...')
to pass unicode into StringIO
(you can see the test failure in Travis)
Yes, good point. Do we get an overflow in BQ but not pandas because pandas can handle larger numbers? (though we can deal with that separately, no need to delay this PR I think) |
I think both systems theoretically use IEEE-754 (pandas uses numpy's float64 under the hood), but somehow the binary -> decimal -> binary conversion flow is causing issues. Ideally, we'd have one of the following:
|
In |
This looks good! @anthonydelage do you want to add a note & your name to the whatsnew? |
@max-imlian are you referring to the changelog.rst file? |
Yes! And you can add "by @anthonydelage " if you like. You can start a 0.5.1 (Unreleased) section. |
Done! |
Thanks @anthonydelage ! |
Summary
Based on issue #192.
The
load.py
module'sencode_chunk()
function writes to a local CSV buffer using Pandas'to_csv()
function, which has a known issue regarding added significant figures on some operating systems. This can cause numerical overflow when loading the data into BigQuery.This PR adds the
float_format='%.15g'
parameter to theto_csv()
call.Details
This section provides a bit more detail behind the
%.15g
float format.As per the Python 3 string format documentation, the
g
(general) format acts as such:BigQuery stores floats as IEEE-754 doubles. 15 is the maximum number of significant digits that can be used for a floating point number to guarantee that it does not overflow (according to Wikipedia):
Hence, we choose the general format with at most 15 decimal digits of precision. It reduces rounding loss from pandas to BigQuery to a minimum and does not increase storage requirements since floats are being stored as IEEE-754 doubles, anyway.