Description
Sample Code
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'a': [-0.04096, -0.04096, -0.04096, -0.04096, -0.04096],
'b': [1., 2., 3., 4., 5.],
'c': [0.053646, 0.053646, 0.053646, 0.053646, 0.053646]},
dtype=np.float64)
corr_df = df.corr()
s_a = pd.Series(data=[-0.04096, -0.04096, -0.04096, -0.04096, -0.04096],
dtype=np.float64, name='a')
s_b = pd.Series(data=[1., 2., 3., 4., 5.], index=[1, 2, 3, 4, 5], dtype=np.float64, name='b')
s_c = pd.Series(data=[0.053646, 0.053646, 0.053646, 0.053646, 0.053646],
dtype=np.float64, name='c')
# Trying to rebuild the correlation matrix from above with the pandas.Series version.
# np.nan is used because correlation with the same Series does not work.
corr_series_new = pd.DataFrame(
{'a': [np.nan, s_a.corr(s_b), s_a.corr(s_c)],
'b': [s_b.corr(s_a), np.nan, s_b.corr(s_c)],
'c': [s_c.corr(s_a), s_c.corr(s_b), np.nan ]}
)
corr_series_old = pd.DataFrame(
{'a': [np.nan, df['a'].corr(df['b']), df['a'].corr(df['c'])],
'b': [df['b'].corr(df['a']), np.nan, df['b'].corr(df['c'])],
'c': [df['c'].corr(df['a']), df['c'].corr(df['b']), np.nan ]}
)
Problem description
1
For some reason pandas.DataFrame.corr()
and pandas.Series.corr(other)
show different behavior. In general, the correlation between two Series is not defined when one Series does not have varying values, like e.g. s_a
or s_c
, as the denominator of the correlation function is evaluated to zero, resulting in a by-zero-division. However, the correlation function defined in DataFrame
somehow manages to evaluate something as shown in the following result:
>>> corr_df
a b c
a NaN NaN NaN
b NaN 1.0 0.0
c NaN 0.0 1.0
2
The above results do also not match when working with Series, which should be expected(?). Note that I have explicitly put NaN
s at the identities since e.g. s_b.corr(s_b)
does yield an Error.
>>> corr_series_new
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3
Another problem is that by using the existing data instead of newly created series, we get different results.
>>> corr_series_old
a b c
0 NaN NaN NaN
1 NaN NaN 0.0
2 NaN 0.0 NaN
I hope I did not miss anything.
Expected Output
Both methods in Series and DataFrame should produce the same output.
Output of pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None