Skip to content

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

Closed
@leaver2000

Description

@leaver2000

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms
16.3 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.86 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.1 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.73 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Edit: additional context and unexpected behavior

import pandas as pd
import numpy as np
import pyarrow as pa

# Columns WIND_SPEED and WIND_GUST might be any number between 0 and 100, while the QC codes are 0-3
df = pd.DataFrame(
    {
        "WIND_SPEED": np.random.randint(0, 100, size=(100,)),
        "WIND_SPEED_QC": np.random.randint(0, 3, size=(100,)),
        "WIND_GUST": np.random.randint(0, 100, size=(100,)),
        "WIND_GUST_QC": np.random.randint(0, 3, size=(100,)),
    }
# I've been looking into the pyarrow dtypes and encountered a unexpected behavior
).astype(pd.ArrowDtype(pa.uint8()))
# the equality comparison returns a pd.BooleanDtype rather than pd.ArrowDtype
mask = df[["WIND_SPEED_QC", "WIND_GUST_QC"]].__ge__(1)
# this is not expected
assert all(isinstance(x, pd.BooleanDtype) for x in mask.dtypes)
# pd.BooleanDtype
%timeit mask.any(axis=1) # 5.35 ms
# pd.ArrowDtype
%timeit mask.astype(pd.ArrowDtype(pa.bool_())).any(axis=1) # 7.24 ms 
# np.bool_
%timeit mask.astype(bool).any(axis=1) # 197 µs

The pd.BooleanDtype is faster than the pd.ArrowDtype which makes sense, but if the backend is going to change
it would make sense to use the np.bool_ dtype.

5.35 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.24 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
197 µs ± 7.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS

commit : 1a2e300
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.90.1-microsoft-standard-WSL2
Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.0rc0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    NA - MaskedArraysRelated to pd.NA and nullable extension arraysPerformanceMemory or execution speed performanceReduction Operationssum, mean, min, max, etc.RegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions