-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add numba engine to df.apply #55104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This should be ready for a first pass now.
As of right now, there isn't support for things such as the Arrow dtypes, and non-numeric Indexes, but those shouldn't be hard to add in the future - I just didn't think the maintenance burden would be worth it as of right now. (If you want, I can try to split up the PR, but because 1 is still a lot of lines, its still gonna be pretty big to review unfortunately). Performance is decent as of now. Most of the time is spent boxing/unboxing (converting to/from) the numba representation, and the pandas representation of the DataFrame. In the future, we could optimize this to potentially get like 100x speedups. This should be fairly fixable with some effort, when just need to re-implement the concat/results wrapping that apply does in numba. I've held off on this because I've written a lot of code so far, and like before, I'd like to see people use this feature before I add to the maintenance burden of the numba code. Sadly, multithreading doesn't yet work - numba has no thread safe structures, but this won't be relevant until the unboxing/boxing overhead can be dealt with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to make review easier.
|
||
# TODO: Range index support | ||
# (not passing an index to series constructor doesn't work) | ||
class IndexType(types.Type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block just defines the types for Index and Series, there isn't much to see here.
It is pretty boilerplate and standard.
|
||
|
||
@typeof_impl.register(Index) | ||
def typeof_index(val, c): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is where we teach numba to recognize our pandas objects as something that can be lowered into numba.
Again, pretty boilerplate and standard.
|
||
|
||
@type_callable(Series) | ||
def type_series_constructor(context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These define the types that the Series/Index constructors take in when you call the Series/Index constructors inside numba code.
It's fairly uninteresting. Note that the actual implementation of the constructors are further down - think of these declarations as kind of like a C prototype function.
|
||
|
||
# Backend extensions for Index and Series and Frame | ||
@register_model(IndexType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defines the numba representations of index/series.
Only interesting thing here is that Index has a pointer to the original index object, so we can avoid calling the index constructor and then just return that object.
Also, we add a hashmap to the index for indexing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the hashmap support duplicate values like a pandas Index would?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will update and add some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, it's probably easier for now to disallow duplicate indexes.
I don't most frames have duplicate columns/indexes.
|
||
|
||
@lower_builtin(Series, types.Array, IndexType) | ||
def pdseries_constructor(context, builder, sig, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructor implementations.
|
||
|
||
@unbox(IndexType) | ||
def unbox_index(typ, obj, c): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code that transforms Series/Index -> numba representations and back.
There's a lot of C API stuff here, it's maybe worth a closer look if you want.
# and also add common binops (e.g. add, sub, mul, div) | ||
|
||
|
||
def generate_series_reduction(ser_reduction, ser_method): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code to generate reductions (mean, std, var) and binops (addition, subtraction, multiplication)
|
||
|
||
# get_loc on Index | ||
@overload_method(IndexType, "get_loc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexing code goes from here to the end of the file, its maybe worth a closer look if you want to make sure it aligns with what we actually do.
gentle ping @mroeschke |
pandas/core/apply.py
Outdated
results = {} | ||
for j in range(values.shape[1]): | ||
# Create the series | ||
ser = Series(values[:, j], index=df_index, name=str(col_names[j])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to str
cast col_names[j]? Technically name
could be any hashable value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restrict it to only allow string names.
(e.g. here https://github.com/pandas-dev/pandas/pull/55104/files#diff-2257b34410aee27eb14e348b9545fef2e212ff93bd72af02d700ae8df43d97bbR1153-R1158)
The cast to string is a quirk of my hack.
(I convert the column names to a numpy string array, so each element is a numpy string that I then need to convert to a regular numba unicode value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, I think I see what you mean.
I think I need to change it so it only cassts to string when the index is str dtype.
pandas/core/apply.py
Outdated
) | ||
col_names_values = orig_values.astype("U") | ||
# Remember to set this back! | ||
self.columns._data = col_names_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be assigned to ._data
? Couldn't a copy of the columns just be passed to nb_func
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh is this needed due to how the numba extension is defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in Index, we don't allow numpy string dtypes, but my hack uses numpy string dtypes, since those already have a native representation in numba.
Let me know if this is too hacky.
@numba.jit(nogil=nogil, nopython=nopython, parallel=parallel) | ||
def numba_func(values, col_names, df_index): | ||
results = {} | ||
for j in range(values.shape[1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for j in range(values.shape[1]): | |
for j in numba.prange(values.shape[1]): |
? (and below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out!
I think for now it'll probably make better sense to disable parallel mode for now, since the dict in numba isn't thread-safe.
The overhead from the boxing/unboxing is also really high (99% of the time spent is there), so I doubt parallel will give a good speedup, at least for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK makes sense. Would be good to put a TODO: comment explaining why we shouldn't use prange for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment.
@mroeschke I've simplified the hacky parts a bit so we no longer clobber the There's still some work to be done on catching and testing all the unsupported cases(e.g. EAs, Arrow arrays, object dtype, etc.), but I'll put that in a followup PR to keep this one small. |
Final question. What error do we currently get when trying to use apply with numba with an unsupported type (like |
I added an error message and a test. It's basically a ValueError saying the dtype in this column is unsupported. |
Great work on this @lithomas1. In a follow up, could you add a whatsnew note in 2.2? |
Thanks for the reviews - I think there should be a whatnsew from when I added support with raw apply. This definetely deserves a bigger blurb, though. I'll add that sometime next week. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Performance is a mixed bag, it is sometimes faster and sometimes slower than Python.This is because there is a huge cost (esp. in boxing the numba representations back into Series/Index since the constructors are slow).
Right now, in the no-op case(return self), we are around 5-10x slower - I'm hoping to bring that down to around 2x.
(There is an issue with the Index being repeatedly unboxed unnecessarily)
EDIT: I think I've closed the gap, the difference isn't much now.
We are also ~10x faster in a normalization (subtract mean, divide by std. dev) test I did.