Closed
Description
xref #15738 (comment)
Currently bool data is passed to the generic python object hashtable for the following methods, making them all very slow.
-
value_counts
-
rank
-
unique
Linked PR casts to int for factorize
, duplicated
, drop_duplicates
We could skip the hashing altogether and take advantage of the fixed set of values, e.g. below is a fastpath for unique.
%%cython
from numpy cimport uint8_t
cimport cython
import numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def unique(uint8_t[:] data):
cdef:
bint seen_true = False
bint seen_false = False
Py_ssize_t i, N = len(data)
uint8_t val
with nogil:
for i in range(N):
val = data[i]
if val == 0:
seen_false = True
elif val == 1:
seen_true = True
if seen_true and seen_false:
break
if seen_true and seen_false:
return np.array([True, False])
elif seen_true:
return np.array([True])
elif seen_false:
return np.array([False])
else:
return np.array([], dtype=bool)
In [35]: bools = np.array([False, True] * 100000)
In [36]: %timeit pd.unique(bools.view('uint8')).astype(bool)
1000 loops, best of 3: 1.74 ms per loop
In [37]: %timeit unique(bools.view('uint8'))
100000 loops, best of 3: 3.47 µs per loop