Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BUG: Fix numpy boolean subtraction error in Series.diff #28251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix numpy boolean subtraction error in Series.diff #28251
Changes from 38 commits
8be97f8
7c982d2
45f9c60
83907a2
6c3875e
ecf1c04
dd8ff98
e123486
269d148
3993206
b610dd9
92075fb
1b90657
182cbc2
3ba0958
47feb30
d425cdf
050b8cb
e028aef
9d947db
b115c9e
865e913
6f8986b
081c2c8
96bb1b4
91ebae5
74a474f
40df511
18a2e35
716a70e
e879dd7
850e315
8d2547a
e6f1890
3231d8f
3e0dbc7
5f101fe
f98cbf2
2127651
a174239
b83135d
b14f7c6
d92d6c7
b96bde8
6a72f2c
030d305
649042f
d1c1755
1e4a7ae
951556e
d1c6b65
8323e71
bbd2550
21defe6
6227564
88cea85
acc8f87
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree that this should differ from the previous case. IMO, both of these are boolean dtype. Pandas just doesn't have a nullable boolean dtype (yet) but we generally try to treat a mix of bool and NA the same as bool.
So I think that this should also be object dtype, but the values should be booleans instead of integers. Is that difficult to support? I suppose it'll require a scan of the values for the
object
-dtype case, to ensure that everything is boolean or NA?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it should not differ. However, when I previously tested my code I did notice a difference, hence why I added the tests as such. If you are sure that pandas does not treat it any different and testing that assumption is not required now and in the future, I will gladly remove the first test.
In regards to your second point, I could definitely change the code to create only a boolean series instead, but why? The same could be done with logic instead of diff. I think the point of diff is to see the difference between the current value and the previous one (e.g. True - False = 1 and False - True = -1). This is also in line with the subtraction done by diff on other non-boolean series and what happens if you subtract booleans in Python. I found this bug looking for ways to detect these changes. Do you not agree or did you mean something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time following, but IMO these two should have the same output
I think both of those should be
Out[3]
, the same result asnp.diff
with the missing values handled appropriately.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to be missing something, when I run np.diff on the Series it results in this:
Which is pretty close to the current pandas implementation. Do we want to remove the first item in the Series (always np.nan) to make it fit better with np.diff? That would mean changing the other code in the diff function as well. Ignoring that, the results are the same as my code. As the dtypes are the problem, we could either convert the Series to always be of the object dtype or do the same numpy does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, I also tested np.diff on a numpy array with the same results:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the same behavior as
np.diff
on a boolean-dtype ndarray.Because of the missing values, we'll always return an object-dtype Series where the values are either NaN, True, or False.
cc @jbrockmendel if you have thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like this would be a lot easier if we had a BoolNA in place. Is that accurate? Since bool-dtype is a thin wrapper over uint8, I wonder how hard it would be to adapt IntegerNA to this purpose (@WillAyd you had a PR about this a while back, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long time ago. @jreback had something more modern in #25415
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this would be easier with BoolNA.
I think we want to achieve the BoolNA behavior, but with object dtype today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am -1 on doing anything with object dtype as it makes the code much too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this routine is way too long can you parameterize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Shall I rewrite all other tests in pandas as well? I did not write the tests the way they are, I only added a few lines... I would rather do this in a refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why you moved this test? I think would be easier just to create another test specific to bool here and leave the time series stuff where it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ WillAyd Done.