Skip to content

PERF: apply on boolean-dtype DataFrame is exceedingly slow #44172

Closed
@alexreg

Description

@alexreg

  • I have checked that this issue has not already been reported.
  • I have confirmed this issue exists on the latest version of pandas.
  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

from timeit import timeit
import pandas as pd

# Create a dummy DataFrame of bools and null values.
df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")

# This runs very slowly!
print(timeit(lambda: df.astype("boolean").apply(lambda row: row.count(), axis = 1), number = 10)) # 112s

# (As can be easily seen, this is due partly to the `astype` on the entire DataFrame, but mainly to the subsequent `apply` being particularly slow for a boolean-dtype DataFrame).
print(timeit(lambda: df.astype("boolean"), number = 10)) # 3.98s

# This *equivalent* statement runs fast. There seems to be no good reason why the call to `apply` in the previous statement (overall equivalent) must be so slow.
print(timeit(lambda: df.apply(lambda row: row.astype("boolean").count(), axis = 1), number = 10)) # 0.95s

There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make apply efficient.

Installed Versions

commit : aced6ee
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.4.0.dev0+970.gaced6eedf9
numpy : 1.20.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.2.0
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.4
sphinx : 4.2.0
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : None
pandas_datareader: 0.10.0
bs4 : 4.10.0
bottleneck : None
fsspec : 2021.10.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.10.1
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.54.1

Prior Performance

Not applicable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapExtensionArrayExtending pandas with custom dtypes or arrays.PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions