Skip to content

PERF: BoolHashTable #15751

Closed
Closed
@chris-b1

Description

@chris-b1

xref #15738 (comment)

Currently bool data is passed to the generic python object hashtable for the following methods, making them all very slow.

  • value_counts
  • rank
  • unique

Linked PR casts to int for factorize, duplicated, drop_duplicates

We could skip the hashing altogether and take advantage of the fixed set of values, e.g. below is a fastpath for unique.

%%cython
from numpy cimport uint8_t
cimport cython

import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def unique(uint8_t[:] data):
    cdef:
        bint seen_true = False
        bint seen_false = False
        Py_ssize_t i, N = len(data)
        uint8_t val
    with nogil:
        for i in range(N):
            val = data[i]
            if val == 0:
                seen_false = True
            elif val == 1:
                seen_true = True
            if seen_true and seen_false:
                break
    if seen_true and seen_false:
        return np.array([True, False])
    elif seen_true:
        return np.array([True])
    elif seen_false:
        return np.array([False])
    else:
        return np.array([], dtype=bool)

In [35]: bools = np.array([False, True] * 100000)

In [36]: %timeit pd.unique(bools.view('uint8')).astype(bool)
1000 loops, best of 3: 1.74 ms per loop

In [37]: %timeit unique(bools.view('uint8'))
100000 loops, best of 3: 3.47 µs per loop

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversionsInternalsRelated to non-user accessible pandas implementationPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions