Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Calling df.dropna()
on a dataframe with pd.arrays.SparseArray
has a huge penalty that increases exponentially with the size of the dataframe.
Given the following script (pytest
, pandas
and pytest-benchmark
are needed. It can be run with pytest <script>
):
from typing import Final
import numpy as np
import pandas as pd
import pytest
from pytest_benchmark.fixture import BenchmarkFixture
N: Final = 1_000
def generate_dataframe(N, sparse_flag=False):
data = np.random.randn(N)
data[:N // 2] = np.nan
df = pd.DataFrame({'col': data})
if sparse_flag:
df['col'] = pd.arrays.SparseArray(df['col'])
return df
@pytest.fixture
def normal_df() -> pd.DataFrame:
return generate_dataframe(N, sparse_flag=False)
@pytest.fixture
def sparse_df() -> pd.DataFrame:
return generate_dataframe(N, sparse_flag=True)
def test_dropna_normal(benchmark: BenchmarkFixture, normal_df: pd.DataFrame) -> None:
benchmark(lambda: normal_df.dropna())
def test_dropna_sparse(benchmark: BenchmarkFixture, sparse_df: pd.DataFrame) -> None:
benchmark(lambda: sparse_df.dropna())
It can be seen that, for only 1000 samples, the sparse array is 120 times slower than the standard approach. This feature gets unusable for medium-large datasets with only >1_000 samples or 10_000 samples
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.10.11
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United Kingdom.1252
pandas : 2.2.3
numpy : 1.23.5
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 23.0.1
Cython : 0.29.37
sphinx : None
IPython : 8.25.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : 5.2.2
matplotlib : None
numba : 0.60.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : 8.2.2
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Prior Performance
No response