Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))
np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())
assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s
%timeit np_mask.any(axis=1) # 6.73 ms
16.3 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.86 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.1 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.73 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Edit: additional context and unexpected behavior
import pandas as pd
import numpy as np
import pyarrow as pa
# Columns WIND_SPEED and WIND_GUST might be any number between 0 and 100, while the QC codes are 0-3
df = pd.DataFrame(
{
"WIND_SPEED": np.random.randint(0, 100, size=(100,)),
"WIND_SPEED_QC": np.random.randint(0, 3, size=(100,)),
"WIND_GUST": np.random.randint(0, 100, size=(100,)),
"WIND_GUST_QC": np.random.randint(0, 3, size=(100,)),
}
# I've been looking into the pyarrow dtypes and encountered a unexpected behavior
).astype(pd.ArrowDtype(pa.uint8()))
# the equality comparison returns a pd.BooleanDtype rather than pd.ArrowDtype
mask = df[["WIND_SPEED_QC", "WIND_GUST_QC"]].__ge__(1)
# this is not expected
assert all(isinstance(x, pd.BooleanDtype) for x in mask.dtypes)
# pd.BooleanDtype
%timeit mask.any(axis=1) # 5.35 ms
# pd.ArrowDtype
%timeit mask.astype(pd.ArrowDtype(pa.bool_())).any(axis=1) # 7.24 ms
# np.bool_
%timeit mask.astype(bool).any(axis=1) # 197 µs
The pd.BooleanDtype is faster than the pd.ArrowDtype which makes sense, but if the backend is going to change
it would make sense to use the np.bool_ dtype.
5.35 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.24 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
197 µs ± 7.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Installed Versions
INSTALLED VERSIONS
commit : 1a2e300
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.90.1-microsoft-standard-WSL2
Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.0rc0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None
Prior Performance
No response