Skip to content

corr return without duplicates and sorted by correlation strength #24728

Open
@chrisluedtke

Description

@chrisluedtke

Code Sample, a copy-pastable example if possible

Functionalized example of what I'm seeking to implement in pandas as a corr argument or separate function:

import numpy as np
import pandas as pd

def correlate_sort(df: pd.DataFrame, method: str = 'pearson') -> pd.DataFrame:
  """
  pd.DataFrame.corr() without redundancy and sorted by strength
  """
  df = df.corr(method)
  df = df.mask(np.tril(np.ones(df.shape)).astype(np.bool))
  df = df.stack().reset_index()
  df = df.rename(columns={0:method})
  
  df['sort'] = df[method].abs()
  df = df.sort_values('sort', ascending=False)
  
  return df.drop('sort', axis=1).reset_index(drop=True)

Problem description

pd.DataFrame.corr() returns a table with redundancies. I'm interested in implementing an enhancement (as an argument option or function, etc.) to return a DataFrame without redundancy and sorted by correlation strength.

import pandas as pd

data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'
df = pd.read_csv(data_url, names=['Age', 'op_year', 'pos_nodes', '5_yr_outcome'])
df.corr()
                   Age   op_year  pos_nodes  5_yr_outcome
Age           1.000000  0.089529  -0.063176     -0.067950
op_year       0.089529  1.000000  -0.003764      0.004768
pos_nodes    -0.063176 -0.003764   1.000000     -0.286768
5_yr_outcome -0.067950  0.004768  -0.286768      1.000000

Expected Output

     level_0       level_1   pearson
0  pos_nodes  5_yr_outcome -0.286768
1        Age       op_year  0.089529
2        Age  5_yr_outcome -0.067950
3        Age     pos_nodes -0.063176
4    op_year  5_yr_outcome  0.004768
5    op_year     pos_nodes -0.003764

Output of pd.show_versions()

/usr/local/lib/python3.6/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: . """)

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.79+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.10.1
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.14.6
scipy: 1.1.0
pyarrow: None
xarray: 0.11.2
IPython: 5.5.0
sphinx: 1.8.3
patsy: 0.5.1
dateutil: 2.5.3
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 2.1.2
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.2.0
fastparquet: None
pandas_gbq: 0.4.1
pandas_datareader: 0.7.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions