-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Make sure that sas7bdat parsers memory is initialized to 0 (#21616) #22651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @troels! Thanks for submitting the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #22651 +/- ##
=======================================
Coverage 92.17% 92.17%
=======================================
Files 169 169
Lines 50715 50715
=======================================
Hits 46747 46747
Misses 3968 3968
Continue to review full report at Codecov.
|
@@ -183,6 +183,18 @@ def test_date_time(datapath): | |||
tm.assert_frame_equal(df, df0) | |||
|
|||
|
|||
def test_compact_numerical_values(datapath): | |||
# Regression test for #21616 | |||
fname = datapath("io", "sas", "data", "cars.sas7bdat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the fixture in this fashion will generate warnings. We just went through an exercise to clean these up in #22515 - can you take a look at that and adjust here accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood the discussion on the pytest-tracker properly, datapath used this way is in fact not something that will be deprecated.
datapath is a "factory as a fixture" as described here:
https://docs.pytest.org/en/latest/fixture.html#factories-as-fixtures
Which is something distinct from what is beeing talked about here:
pytest-dev/pytest#3661
Have I misunderstood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it produces no warnings,as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This usage is fine I think.
But, do you need to include a binary file in the git repo to reproduce the error on master? I'd like to avoid adding new ones if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding a sas7bdat file for testing the parsing of sas7bdat files. How can that be done more sensibly?
0783757
to
c2219c7
Compare
pandas/tests/io/sas/test_sas7bdat.py
Outdated
# The two columns CYL and WGT in cars.sas7bdat have column | ||
# width < 8 and only contains integral values. Test | ||
# that pandas doesn't corrupt the less significant bits. | ||
tm.assert_series_equal(df['WGT'], df['WGT'].round(), check_exact=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure comparison to self as part of the test is the best option here - do we have any way of strengthening the assertion(s) being made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't comparing to self. We are making sure, that floats with no decimal part in the sas7bdat-file doesn't get a decimal part because of a bug in pandas. That is exactly what this bug is about, after all.
Before:
- Memory was not initialized.
- The decimal part of the float was taken from random uninitialized memory.
- Test fails most of the time.
Now:
- Memory is zeroed out.
- The decimal part of the float which is not read (because it's implicit in the file format, that it should be zero) is therefore still zero in pandas.
- Test succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use
result =
expected =
to make this easier to read
why do we need to .round(), rather than just using check_less_precision
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also do:
msg = "Expected df['WGT'] to be full of integers"
assert df['WGT'] == df['WGT'].astype('int'), msg
tm.assert_series_equal(df['WGT'], df['WGT'].astype('int'), check_exact=True)
assert all(f.is_integer() for f in df['WGT']), msg
assert all(f - int(f) == 0 for f in df['WGT']), msg
assert all(f == int(f) for f in df['WGT']), msg
or any other way of saying the same thing you would prefer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger any thoughts on this one?
Not sure. Do any of the test files already in the repo have this problem.
We can add it if necessary, just trying to avoid bloating the git
repository more than it already is.
…On Tue, Sep 11, 2018 at 12:49 PM Troels Nielsen ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/tests/io/sas/test_sas7bdat.py
<#22651 (comment)>:
> @@ -183,6 +183,18 @@ def test_date_time(datapath):
tm.assert_frame_equal(df, df0)
+def test_compact_numerical_values(datapath):
+ # Regression test for #21616
+ fname = datapath("io", "sas", "data", "cars.sas7bdat")
Hi @TomAugspurger <https://github.com/TomAugspurger>
I am adding a sas7bdat file for testing the parsing of sas7bdat files. How
can that be done more sensibly?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22651 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIiuIKiA5BprHocrjYhmRSSWSt8Wxks5uZ_esgaJpZM4WgZxE>
.
|
No, none of the test-files has the problem. If any of the already existing test-files had numeric columns less than 8 bytes, the tests would probably have been unstable. Usually one wants to avoid binary files in git repositories for two reasons:
A small file like cars.sas7bdat, which will never be edited and to top it off is highly compressible is not going to take very much space at all, neither now nor in the future. |
pandas/tests/io/sas/test_sas7bdat.py
Outdated
# The two columns CYL and WGT in cars.sas7bdat have column | ||
# width < 8 and only contains integral values. Test | ||
# that pandas doesn't corrupt the less significant bits. | ||
tm.assert_series_equal(df['WGT'], df['WGT'].round(), check_exact=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use
result =
expected =
to make this easier to read
why do we need to .round(), rather than just using check_less_precision
?
…as-dev#21616) Memory for numbers in sas7bdat-parsing was not initialized properly to 0. For sas7bdat files with numbers smaller than 8 bytes this made the least significant part of the numbers essentially random. Fix it by initializing memory correctly.
c2219c7
to
796fc43
Compare
Hi @jreback Sure thing, I've added variables and extended the comment a bit. We are comparing the result that the sas7bdat-parser reads from the file with the closest integers. As we know that the numbers are integral they should be exactly equal to their closest integer. When the bug was active the decimal part of the floating point numbers were not read from the file but simply taken from uninitialized memory, and so the WGT and CYL were not actually whole numbers in the pandas dataframe. I originally had a CSV-file containing the correct numbers, but was compelled to remove it because of @WillAyd 's comments. I'll gladly add it back if that's preferable. |
thanks @troels |
Memory for numbers in sas7bdat-parsing was not initialized properly to 0.
For sas7bdat files with numbers smaller than 8 bytes this made the
least significant part of the numbers essentially random.
Fix it by initializing memory correctly.
git diff upstream/master -u -- "*.py" | flake8 --diff