Skip to content

blosc2.jit support for pandas UDFs #383

Closed
@datapythonista

Description

@datapythonista

xref pandas-dev/pandas#61125

We discussed this informally in the past, sharing more clearly how blosc2.jit and pandas can interact.

I'm about to open a PR in pandas to support this:

import pandas
import blosc2

def my_func(x):
    return np.sin(x * 2)

s = pandas.Series([1, 2, 3], index=list('abc'), name='sample')

# normal call executed by pandas
print(s.map(my_func))

# we let blosc2 handle this
print(s.map(my_func, engine=blosc2.jit))

To be able to do this, we would need blosc2 to implement a new interface. The implementation shouldn't be too complex, something like (the example ignores skip_na and another method apply for column-wise operations (function being called with the whole array, not each scalar):

import numpy as np
import blosc2

# Reference base class: https://github.com/pandas-dev/pandas/blob/main/pandas/core/apply.py#L77
class Blosc2ExecutionEngine:
    @staticmethod
    def map(data, func, args, kwargs, decorator, skip_na):
        if not isinstance(data, np.ndarray):
            # we probably received a Series
            if hasattr(data, "values"):
                data = data.values
            else:
                # there is a chance that we call this with a pyarrow object in the future
                raise ValueError("blosc2.jit does not support {data.__name__}")
                
        func = decorator(func)
        result = func(data, *args, **kwargs)
        return result


blosc2.jit.__pandas_udf__ = Blosc2ExecutionEngine

The advantage of this approach over just decorating the function is that the whole execution loop can be jitted, not only the individual calls.

What do you think? Is this something you'd like to implement? Any feedback? It's designed in a way that you don't need to add a dependency on pandas. We aim to have Numba and Bodo supporting this same interface, and possibly others.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions