Skip to content

Correlation inconsistencies between Series and DataFrame #20954

Open
@BobMcFry

Description

@BobMcFry

Sample Code

import pandas as pd
import numpy as np


df = pd.DataFrame(data={'a': [-0.04096, -0.04096, -0.04096, -0.04096, -0.04096],
                        'b': [1., 2., 3., 4., 5.],
                        'c': [0.053646, 0.053646, 0.053646, 0.053646, 0.053646]},
                  dtype=np.float64)
corr_df = df.corr()

s_a = pd.Series(data=[-0.04096, -0.04096, -0.04096, -0.04096, -0.04096],
                dtype=np.float64, name='a')
s_b = pd.Series(data=[1., 2., 3., 4., 5.], index=[1, 2, 3, 4, 5], dtype=np.float64, name='b')
s_c = pd.Series(data=[0.053646, 0.053646, 0.053646, 0.053646, 0.053646],
                dtype=np.float64, name='c')

# Trying to rebuild the correlation matrix from above with the pandas.Series version.
# np.nan is used because correlation with the same Series does not work.
corr_series_new = pd.DataFrame(
    {'a': [np.nan,        s_a.corr(s_b), s_a.corr(s_c)],
     'b': [s_b.corr(s_a), np.nan,        s_b.corr(s_c)],
     'c': [s_c.corr(s_a), s_c.corr(s_b), np.nan       ]}
)

corr_series_old = pd.DataFrame(
    {'a': [np.nan,                df['a'].corr(df['b']), df['a'].corr(df['c'])],
     'b': [df['b'].corr(df['a']), np.nan,                df['b'].corr(df['c'])],
     'c': [df['c'].corr(df['a']), df['c'].corr(df['b']), np.nan               ]}
)

Problem description

1

For some reason pandas.DataFrame.corr() and pandas.Series.corr(other) show different behavior. In general, the correlation between two Series is not defined when one Series does not have varying values, like e.g. s_a or s_c, as the denominator of the correlation function is evaluated to zero, resulting in a by-zero-division. However, the correlation function defined in DataFrame somehow manages to evaluate something as shown in the following result:

>>> corr_df
    a    b    c
a NaN  NaN  NaN
b NaN  1.0  0.0
c NaN  0.0  1.0
2

The above results do also not match when working with Series, which should be expected(?). Note that I have explicitly put NaNs at the identities since e.g. s_b.corr(s_b) does yield an Error.

>>> corr_series_new
    a   b   c
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3

Another problem is that by using the existing data instead of newly created series, we get different results.

>>> corr_series_old
    a    b    c
0 NaN  NaN  NaN
1 NaN  NaN  0.0
2 NaN  0.0  NaN

I hope I did not miss anything.

Expected Output

Both methods in Series and DataFrame should produce the same output.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions