Skip to content

ENH: Unhelpful output from assert_frame_equal when indexes differ and check_like=True #37478

Closed
@amilbourne

Description

@amilbourne

Problem:

Calling testing.assert_frame_equal with mismatched indexes and check_like=True generates unhelpful output.

If you run:

import pandas as pd
df1 = pd.DataFrame({"A": [1.0, 2.0, 3.0], "B": [4.0, 5.0, 6.0]}, index=["a", "b", "c"])
df2 = pd.DataFrame({"A": [1.0, 2.0, 3.0], "B": [4.0, 5.0, 6.0]}, index=["a", "b", "d"])
pd.testing.assert_frame_equal(df1, df2, check_like=True)

The output will be:

AssertionError: DataFrame.iloc[:, 0] (column name="A") are different

DataFrame.iloc[:, 0] (column name="A") values are different (33.33333 %)
[index]: [a, b, d]
[left]:  [1.0, 2.0, nan]
[right]: [1.0, 2.0, 3.0]

The data of the input DataFrames are not actually different (there is no nan), but when check_like=True the code calls left.reindex_like(right) before comparing indexes (and columns), in order to ensure that both frames are ordered the same.
However, if the indexes contain different values (rather than the same values in a different order),
the reindex_like function fills the data values (row or column) for the mismatched index entries with NaNs.
This results in the subsequent index checks passing, but the assert_frame_equals function failing
with a data not equal error (as above).

Even more confusingly, if the values being compared are not floats then you get a dtype not equal error:

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="A") are different

Attribute "dtype" are different
[left]:  float64
[right]: int64

These messages are quite unhelpful, as the mismatch is in the index, and the error should logically be the same as you would get if you ran with check_like=False.

Applies to:

The code above was run against the latest code from master.

>>> print(pd.__version__)
1.2.0.dev0+950.gd321be6

Solution:

The message for the above assertion failure should be something like:

AssertionError: DataFrame.index are different

DataFrame.index values are different (33.33333 %)
[left]:  Index(['a', 'b', 'c'], dtype='object')
[right]: Index(['a', 'b', 'd'], dtype='object')

Which is what you get if you run with check_like=False.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementTestingpandas testing functions or related to the test suite

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions