-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Enable Series.equals to compare numpy arrays to scalars #36161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
9bd42bf
c84a39d
8b26aec
6367f9a
e94db6d
dc67138
61a58a9
993cd93
98c580f
3b1d788
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -581,6 +581,13 @@ def array_equivalent_object(left: object[:], right: object[:]) -> bool: | |
return False | ||
elif (x is C_NA) ^ (y is C_NA): | ||
return False | ||
elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y): | ||
# Only compare scalars to scalars and non-scalars to non-scalars | ||
return False | ||
elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are you using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't believe I am. My reasoning for the code is as follows: The first condition checks if both values are of different types, i.e. scalar vs non-scalar comparison. If they are different then fail. If both sides are scalars then the current logic in Pandas can correctly handle the comparison. The code still fails though if both sides have a different non-scalar type (e.g. dict vs list comparison). Hence the check to see if we are dealing with non scalars ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types this will avoid the perf hit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put check in a try/except clause. frame_methods.Equals never has a significant perf decrease now (<1.1). Some test from index_cached_properties.IndexCache are sometimes a little bit higher than 1.1 (e.g. 1.11-1.16) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok let's go with this (add a comment as well) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great. I've added the comment to the try/except clause. |
||
and not (isinstance(x, type(y)) or isinstance(y, type(x)))): | ||
# Check if non-scalars have the same type | ||
return False | ||
elif not (PyObject_RichCompareBool(x, y, Py_EQ) or | ||
(x is None or is_nan(x)) and (y is None or is_nan(y))): | ||
return False | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,16 +24,18 @@ def test_equals(arr, idx): | |
assert not s1.equals(s2) | ||
|
||
|
||
def test_equals_list_array(): | ||
@pytest.mark.parametrize( | ||
"val", [1, 1.1, 1 + 1j, True, "abc", [1, 2], (1, 2), {1, 2}, {"a": 1}, None] | ||
) | ||
def test_equals_list_array(val): | ||
# GH20676 Verify equals operator for list of Numpy arrays | ||
arr = np.array([1, 2]) | ||
s1 = Series([arr, arr]) | ||
s2 = s1.copy() | ||
assert s1.equals(s2) | ||
|
||
# TODO: Series equals should also work between single value and list | ||
# s1[1] = 9 | ||
# assert not s1.equals(s2) | ||
s1[1] = val | ||
assert not s1.equals(s2) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add all of the missing cases also we have a bunch of tests for array_equivalent, these need a battery of tests There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. parameterized values in test_equals and added tests for array_equivalent with a Series |
||
|
||
|
||
def test_equals_false_negative(): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put these comments indented inside the conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think id rather use our
is_scalar
thancnp.PyArray_IsAnyScalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change it to
is_scalar
one unit test fails and the perf decreases even more. It is almost twice as slow as usingcnp.PyArray_IsAnyScalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put comments indented inside the conditons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im pre-caffeine so bear with me: what does PyArray_IsAnyScalar include? or more specifically, what does it not include that is_scalar does?
what breaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArray_IsAnyScalar
does not include None, datetime.datetime, datetime.timedelta, Period, decimal.Decimal, Fraction, numbers.Number, Interval DateOffset, whileis_scalar
does.If I see this I think it would be good to use
is_scalar
for consistency. But the drawback is the worse perf, because under the hoodis_scalar
makes use ofPyArray_IsAnyScalar
and some other functions.Nvm the failed unit test, I see that it is a broken test on master that appeared after I merged master (
pandas/tests/io/test_parquet.y::test_s3_roundtrip_for_dir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel how do we want to continue with this? Shall I replace
PyArray_IsAnyScalar
withis_scalar
, or leave it as it is?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC
not is_listlike(obj)
is more general and more performant thanis_scalar(obj)
. would that work in this context?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I replace
PyArray_IsAnyScalar
withis_listlike
and update the code like below, the perf becomes (much) worse. Will try the try/except option jreback proposed.