Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
print("DataFrame without null values:")
non_null_df = pd.DataFrame({"id": ["a", "a"], "val1": [1.0, 2.0], "val2": [1.0, 2.0],})
DataFrame without null values:
id val1 val2
0 a 1.0 1.0
1 a 2.0 2.0
# We normalize by N-ddof = 2-2 = 0, and thus get infinity from dividing by 0.0
# This is the expected result:
print(non_null_df.cov(ddof=2))
val1 val2
val1 inf inf
val2 inf inf
# A groupby covariance behaves like the function above, as expected:
print(non_null_df.groupby("id").cov(ddof=2))
val1 val2
id
a val1 inf inf
val2 inf inf
print("DataFrame with null values:")
null_df = pd.DataFrame({ "id": ["a", "a"], "val1": [1.0, 2.0],"val2": [np.nan, np.nan],})
DataFrame with null values:
id val1 val2
0 a 1.0 NaN
1 a 2.0 NaN
# We expect to normalize by N-ddof = 2-2 = 0, but ddof is ignored because there are null values.
# The underlying problem is that libalgos.nancorr does not accept and use the provided ddof parameter.
# Instead, it returns 0.5 for the correlation of val1 with val1. This term of the covariance matrix should be
# infinity to match the behavior of the non-null DataFrame above, which handles the ddof argument.
# https://github.com/pandas-dev/pandas/blob/bb1f651536508cdfef8550f93ace7849b00046ee/pandas/core/frame.py#L9658-L9666
print(null_df.cov(ddof=2))
val1 val2
val1 0.5 NaN
val2 NaN NaN
# A groupby covariance behaves like the function above, as expected.
print(null_df.groupby("id").cov(ddof=2))
val1 val2
id
a val1 0.5 NaN
val2 NaN NaN
Issue Description
The problem : pandas
and numpy
have mismatching behavior when computing covariance .cov()
in the presence of missing/nan values. This has also been highlighted in issue #16837
In estimating covariance, the data is normalized by (N - ddof)
. Therefore for a case when the number of observations
N
is equal to the value passed in for ddof
, dividing by zero results to infinity inf
.
I think it’s specifically an issue in pandas code, in that pandas uses numpy for the calculation if no values are missing (NaN).
On the flip side, a pandas-internal implementation libalgos.nancorr
is used if nulls are present, in which ddof
is currently not being used to normalize data before estimating the covariance
Expected Behavior
Refer to Reproducible Example above
Installed Versions
In [5]: pd.show_versions()
INSTALLED VERSIONS
commit : 66e3805
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.3.5
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.8.0
Cython : 0.29.26
pytest : 6.2.5
hypothesis : 6.36.0
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.55.0