Skip to content

BUG: Enable Series.equals to compare numpy arrays to scalars #36161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 19, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ Numeric
^^^^^^^
- Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`)
- Bug in :meth:`DataFrame.any` with ``axis=1`` and ``bool_only=True`` ignoring the ``bool_only`` keyword (:issue:`32432`)
- Bug in :meth:`Series.equals` where a ``ValueError`` was raised when numpy arrays were compared to scalars (:issue:`35267`)
-

Conversion
Expand Down
7 changes: 7 additions & 0 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -581,6 +581,13 @@ def array_equivalent_object(left: object[:], right: object[:]) -> bool:
return False
elif (x is C_NA) ^ (y is C_NA):
return False
elif cnp.PyArray_IsAnyScalar(x) != cnp.PyArray_IsAnyScalar(y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put these comments indented inside the conditions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think id rather use our is_scalar than cnp.PyArray_IsAnyScalar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change it to is_scalar one unit test fails and the perf decreases even more. It is almost twice as slow as using cnp.PyArray_IsAnyScalar

   before           after         ratio
 [822dc6f9]       [eeabb266]
                  <gh35267>
  •    25.6±1ms         99.8±2ms     3.90  frame_methods.Equals.time_frame_nonunique_unequal
    
  •    28.2±3ms         97.3±1ms     3.44  frame_methods.Equals.time_frame_nonunique_equal
    
  •    31.2±2ms        106±0.9ms     3.40  frame_methods.Equals.time_frame_object_unequal
    
  •    43.3±3ms        112±0.7ms     2.59  frame_methods.Equals.time_frame_object_equal
    

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put comments indented inside the conditons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im pre-caffeine so bear with me: what does PyArray_IsAnyScalar include? or more specifically, what does it not include that is_scalar does?

is_scalar one unit test fails

what breaks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArray_IsAnyScalar does not include None, datetime.datetime, datetime.timedelta, Period, decimal.Decimal, Fraction, numbers.Number, Interval DateOffset, while is_scalar does.

If I see this I think it would be good to use is_scalar for consistency. But the drawback is the worse perf, because under the hood is_scalar makes use of PyArray_IsAnyScalar and some other functions.

Nvm the failed unit test, I see that it is a broken test on master that appeared after I merged master (pandas/tests/io/test_parquet.y::test_s3_roundtrip_for_dir)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel how do we want to continue with this? Shall I replace PyArray_IsAnyScalar with is_scalar, or leave it as it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC not is_listlike(obj) is more general and more performant than is_scalar(obj). would that work in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I replace PyArray_IsAnyScalar with is_listlike and update the code like below, the perf becomes (much) worse. Will try the try/except option jreback proposed.

elif is_list_like(x) != is_list_like(y):
    # Only compare scalars to scalars and non-scalars to non-scalars
    return False
elif (is_list_like(x) and is_list_like(y)
       and not (isinstance(x, type(y)) or isinstance(y, type(x)))):
    # Check if non-scalars have the same type
   return False
   before           after         ratio
 [822dc6f9]       [98cea4b3]
 <master~31>       <gh35267_clean>
  •  25.1±0.6ms          695±2ms    27.65  frame_methods.Equals.time_frame_nonunique_equal
    
  •  25.4±0.6ms          702±3ms    27.65  frame_methods.Equals.time_frame_nonunique_unequal
    
  •  34.0±0.3ms          711±9ms    20.92  frame_methods.Equals.time_frame_object_unequal
    
  •  44.5±0.5ms          718±3ms    16.14  frame_methods.Equals.time_frame_object_equal
    

# Only compare scalars to scalars and non-scalars to non-scalars
return False
elif (not (cnp.PyArray_IsPythonScalar(x) or cnp.PyArray_IsPythonScalar(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using not cnp.PyArray_IsPythonScalar(x) as a standin for isinstance(x, np.ndarray)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe I am. My reasoning for the code is as follows:

The first condition checks if both values are of different types, i.e. scalar vs non-scalar comparison. If they are different then fail.

If both sides are scalars then the current logic in Pandas can correctly handle the comparison. The code still fails though if both sides have a different non-scalar type (e.g. dict vs list comparison). Hence the check to see if we are dealing with non scalars (not cnp.PyArray_IsPythonScalar(x)) and then check if they are of the same type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just wrap a try/except around the RichCompareBool? and handle there if there are odd types

this will avoid the perf hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put check in a try/except clause. frame_methods.Equals never has a significant perf decrease now (<1.1). Some test from index_cached_properties.IndexCache are sometimes a little bit higher than 1.1 (e.g. 1.11-1.16)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's go with this (add a comment as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I've added the comment to the try/except clause.

and not (isinstance(x, type(y)) or isinstance(y, type(x)))):
# Check if non-scalars have the same type
return False
elif not (PyObject_RichCompareBool(x, y, Py_EQ) or
(x is None or is_nan(x)) and (y is None or is_nan(y))):
return False
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/dtypes/test_missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,14 @@ def test_array_equivalent(dtype_equal):
assert not array_equivalent(DatetimeIndex([0, np.nan]), TimedeltaIndex([0, np.nan]))


@pytest.mark.parametrize(
"val", [1, 1.1, 1 + 1j, True, "abc", [1, 2], (1, 2), {1, 2}, {"a": 1}, None]
)
def test_array_equivalent_series(val):
arr = np.array([1, 2])
assert not array_equivalent(Series([arr, arr]), Series([arr, val]))


def test_array_equivalent_different_dtype_but_equal():
# Unclear if this is exposed anywhere in the public-facing API
assert array_equivalent(np.array([1, 2]), np.array([1.0, 2.0]))
Expand Down
10 changes: 6 additions & 4 deletions pandas/tests/series/methods/test_equals.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,18 @@ def test_equals(arr, idx):
assert not s1.equals(s2)


def test_equals_list_array():
@pytest.mark.parametrize(
"val", [1, 1.1, 1 + 1j, True, "abc", [1, 2], (1, 2), {1, 2}, {"a": 1}, None]
)
def test_equals_list_array(val):
# GH20676 Verify equals operator for list of Numpy arrays
arr = np.array([1, 2])
s1 = Series([arr, arr])
s2 = s1.copy()
assert s1.equals(s2)

# TODO: Series equals should also work between single value and list
# s1[1] = 9
# assert not s1.equals(s2)
s1[1] = val
assert not s1.equals(s2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add all of the missing cases
e.q. not s1.equals('foo')

also we have a bunch of tests for array_equivalent, these need a battery of tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameterized values in test_equals and added tests for array_equivalent with a Series



def test_equals_false_negative():
Expand Down