Description
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
from timeit import timeit
import pandas as pd
# Create a dummy DataFrame of bools and null values.
df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")
# This runs very slowly!
print(timeit(lambda: df.astype("boolean").apply(lambda row: row.count(), axis = 1), number = 10)) # 112s
# (As can be easily seen, this is due partly to the `astype` on the entire DataFrame, but mainly to the subsequent `apply` being particularly slow for a boolean-dtype DataFrame).
print(timeit(lambda: df.astype("boolean"), number = 10)) # 3.98s
# This *equivalent* statement runs fast. There seems to be no good reason why the call to `apply` in the previous statement (overall equivalent) must be so slow.
print(timeit(lambda: df.apply(lambda row: row.astype("boolean").count(), axis = 1), number = 10)) # 0.95s
There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make apply
efficient.
Installed Versions
commit : aced6ee
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.4.0.dev0+970.gaced6eedf9
numpy : 1.20.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.2.0
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.4
sphinx : 4.2.0
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : None
pandas_datareader: 0.10.0
bs4 : 4.10.0
bottleneck : None
fsspec : 2021.10.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.10.1
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.54.1
Prior Performance
Not applicable.