Skip to content

ENH: parallelize DataFrame.corr #40956

Open
@Vysybyl

Description

@Vysybyl

Is your feature request related to a problem?

DataFrame.corr(method="spearman") is extremely slow.
method="pearson" is quite slow too.
I can see from my machine resource monitor that the implementation is single threaded. Is it a design choice? If so, there should be at least an optional argument to parallelize it (at C++ level, of course).
I did not check the actual code implementing this method.

Describe the solution you'd like

scipy.stats.spearmanr implements this computation on a numpy array in 1/20 of the time in my 6-core machine.

API breaking implications

None.

Describe alternatives you've considered

Add an optional argument (ex. "parallelize"=[True, False]) so that you give the user this option.
Then, the method should either be reimplemented from scratch at C++ level or we must use the existing scipy.stats function
on the DataFrame.values, wrapping the returned array in a new DataFrame.

Additional context

IMPORTANT: DataFrame.corr and spearmanr gives slightly different results (some kind of small rounding error of about 10e-15)

import numpy as np
from scipy.stats import spearmanr
import pandas as pd

df = pd.DataFrame(np.random.rand(1000, 2000))
pd_corr = df.corr(method='spearman')  # a few seconds
scipy_corr, p_value = spearmanr(df.values)  # <1 sec

np.equal(pd_corr.values, scipy_corr)  # False
np.sum(np.abs(corr_m.values - corr_m_sci) > 1e-15)  # 0

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions