Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Our internal validation tool's tolerance needs to depend on compared metrics. For example, when obtaining results from an analytical database from a query like
SELECT count(distinct device_id) as device_count, avg(score) as score GROUP BY ...
We expect device_count
to always be accurate, but score
is expected to have random numerical floating point inaccuracies.
My old code ran assert_frame_equal
several times on different subsets of columns, which is cumbersome and doesn't express the intent well.
I recently refactored it by extracting assert_frame_equal
's implementation and just adding the extra arguments to support per-column customizable rtol
and atol
.
It would be nice if such an ability was built into Pandas.
Note that this overlaps a bit with feature request #54861 .
Feature Description
One way is to add extra arguments to assert_frame_equal
, usable like so:
assert_frame_equal(
left,
right,
rtol=1e-5,
atol=1e-8,
rtols={'device_count': 0, 'score': 1e-6},
atols={'device_count': 0}, # for unspecified columns, the rtol/atol argument is used as default
)
Or the entire comparison configuration (check_exact
, check_datetimelike_compat
etc) could be overridden per-series, for example
assert_frame_equal(
left,
right,
overrides={
'device_count': {'check_exact': True},
'score': {'rtol': 1e-6},
}
)
Alternative Solutions
The current way to do it with public APIs is to do something like
for column_names, rtol in [(["device_count", ...], 0.0), (["score", ...], 1e-6), ...]:
left = # extract index and columns from left
right = # extract index and columns from right
assert_frame_equal(left, right, rtol=rtol)