Skip to content

PERF: fix assert_frame_equal can be very slow #38202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 24, 2020
37 changes: 25 additions & 12 deletions pandas/_testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1294,6 +1294,8 @@ def assert_series_equal(
rtol=1.0e-5,
atol=1.0e-8,
obj="Series",
*,
check_index=True,
):
"""
Check that left and right Series are equal.
Expand Down Expand Up @@ -1353,6 +1355,10 @@ def assert_series_equal(
obj : str, default 'Series'
Specify object name being compared, internally used to show appropriate
assertion message.
check_index : bool, default True
Whether to check index equivalence. If False, then compare only values.

.. versionadded:: 1.3.0

Examples
--------
Expand Down Expand Up @@ -1388,18 +1394,20 @@ def assert_series_equal(
if check_flags:
assert left.flags == right.flags, f"{repr(left.flags)} != {repr(right.flags)}"

# index comparison
assert_index_equal(
left.index,
right.index,
exact=check_index_type,
check_names=check_names,
check_exact=check_exact,
check_categorical=check_categorical,
rtol=rtol,
atol=atol,
obj=f"{obj}.index",
)
if check_index:
# GH #38183
assert_index_equal(
left.index,
right.index,
exact=check_index_type,
check_names=check_names,
check_exact=check_exact,
check_categorical=check_categorical,
rtol=rtol,
atol=atol,
obj=f"{obj}.index",
)

if check_freq and isinstance(left.index, (pd.DatetimeIndex, pd.TimedeltaIndex)):
lidx = left.index
ridx = right.index
Expand Down Expand Up @@ -1704,6 +1712,10 @@ def assert_frame_equal(
assert col in right
lcol = left.iloc[:, i]
rcol = right.iloc[:, i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marginally faster to do right._ixs(i, axis=1)

# GH #38183
# use check_index=False, because we do not want to run
# assert_index_equal for each column,
# as we already checked it for the whole dataframe before.
assert_series_equal(
lcol,
rcol,
Expand All @@ -1717,6 +1729,7 @@ def assert_frame_equal(
obj=f'{obj}.iloc[:, {i}] (column name="{col}")',
rtol=rtol,
atol=atol,
check_index=False,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a lot of kwargs already. would it be viable to call assert_(ea|numpy)_array_equal on lcol._values and rcol._values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably.
But I looked inside assert_series_equal and noticed that there are a lot of cases covered, different dtypes, etc.
So, I thought that it would be safer to parametrize.
I will try to find another way, as you suggest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel, I am afraid I can only suggest the following approach for now.

Split assert_series_equal into assert_index_equal and _assert_series_values_equal.
The latter one is the same as the original assert_series_equal, but with the assert_index_equal removed from there.

Then we can use _assert_series_values_equal in the loop inside assert_frame_equal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i actually liked your prior implementation (check_index) but i guess this is fine.



Expand Down