PERF: BoolHashTable

xref https://github.com/pandas-dev/pandas/pull/15738#issuecomment-287843016

Currently bool data is passed to the generic python object hashtable for the following methods, making them all very slow.
- [ ]  `value_counts`
- [ ]  `rank`
- [ ]  `unique`

Linked PR casts to int for `factorize`, `duplicated`, `drop_duplicates`

We could skip the hashing altogether and take advantage of the fixed set of values, e.g. below is a fastpath for unique.

```python
%%cython
from numpy cimport uint8_t
cimport cython

import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def unique(uint8_t[:] data):
    cdef:
        bint seen_true = False
        bint seen_false = False
        Py_ssize_t i, N = len(data)
        uint8_t val
    with nogil:
        for i in range(N):
            val = data[i]
            if val == 0:
                seen_false = True
            elif val == 1:
                seen_true = True
            if seen_true and seen_false:
                break
    if seen_true and seen_false:
        return np.array([True, False])
    elif seen_true:
        return np.array([True])
    elif seen_false:
        return np.array([False])
    else:
        return np.array([], dtype=bool)

In [35]: bools = np.array([False, True] * 100000)

In [36]: %timeit pd.unique(bools.view('uint8')).astype(bool)
1000 loops, best of 3: 1.74 ms per loop

In [37]: %timeit unique(bools.view('uint8'))
100000 loops, best of 3: 3.47 µs per loop
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: BoolHashTable #15751

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PERF: BoolHashTable #15751

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions