Description
We discussed this informally in the past, sharing more clearly how blosc2.jit and pandas can interact.
I'm about to open a PR in pandas to support this:
import pandas
import blosc2
def my_func(x):
return np.sin(x * 2)
s = pandas.Series([1, 2, 3], index=list('abc'), name='sample')
# normal call executed by pandas
print(s.map(my_func))
# we let blosc2 handle this
print(s.map(my_func, engine=blosc2.jit))
To be able to do this, we would need blosc2 to implement a new interface. The implementation shouldn't be too complex, something like (the example ignores skip_na
and another method apply
for column-wise operations (function being called with the whole array, not each scalar):
import numpy as np
import blosc2
# Reference base class: https://github.com/pandas-dev/pandas/blob/main/pandas/core/apply.py#L77
class Blosc2ExecutionEngine:
@staticmethod
def map(data, func, args, kwargs, decorator, skip_na):
if not isinstance(data, np.ndarray):
# we probably received a Series
if hasattr(data, "values"):
data = data.values
else:
# there is a chance that we call this with a pyarrow object in the future
raise ValueError("blosc2.jit does not support {data.__name__}")
func = decorator(func)
result = func(data, *args, **kwargs)
return result
blosc2.jit.__pandas_udf__ = Blosc2ExecutionEngine
The advantage of this approach over just decorating the function is that the whole execution loop can be jitted, not only the individual calls.
What do you think? Is this something you'd like to implement? Any feedback? It's designed in a way that you don't need to add a dependency on pandas. We aim to have Numba and Bodo supporting this same interface, and possibly others.