Skip to content

ENH: Add numba engine to df.apply #55104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Oct 22, 2023
Merged

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Sep 12, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Performance is a mixed bag, it is sometimes faster and sometimes slower than Python.
This is because there is a huge cost (esp. in boxing the numba representations back into Series/Index since the constructors are slow).
Right now, in the no-op case(return self), we are around 5-10x slower - I'm hoping to bring that down to around 2x.
(There is an issue with the Index being repeatedly unboxed unnecessarily)

EDIT: I think I've closed the gap, the difference isn't much now.

We are also ~10x faster in a normalization (subtract mean, divide by std. dev) test I did.

@lithomas1 lithomas1 added Apply Apply, Aggregate, Transform, Map numba numba-accelerated operations labels Sep 12, 2023
@lithomas1
Copy link
Member Author

@mroeschke

This should be ready for a first pass now.
I know this is a lot of code, as of right now, this PR implements a pretty minimal subset of the pandas features, which are:

  1. Numba versions of DataFrame, Series, Index
  2. Common Series methods that have a numpy equivalent
    a. Think sum, var, mean, and co.
  3. Basic indexing support
    a. It mimics our khash stuff, just with a numba dictionary instead. We might look into re-using the khash code, it would require some Cython code to expose that to numba though.

As of right now, there isn't support for things such as the Arrow dtypes, and non-numeric Indexes, but those shouldn't be hard to add in the future - I just didn't think the maintenance burden would be worth it as of right now.

(If you want, I can try to split up the PR, but because 1 is still a lot of lines, its still gonna be pretty big to review unfortunately).

Performance is decent as of now.
We are generally anywhere between 2-10x faster when going row-wise, but can be slower in cases where there are few rows/columns.

Most of the time is spent boxing/unboxing (converting to/from) the numba representation, and the pandas representation of the DataFrame. In the future, we could optimize this to potentially get like 100x speedups.

This should be fairly fixable with some effort, when just need to re-implement the concat/results wrapping that apply does in numba. I've held off on this because I've written a lot of code so far, and like before, I'd like to see people use this feature before I add to the maintenance burden of the numba code.

Sadly, multithreading doesn't yet work - numba has no thread safe structures, but this won't be relevant until the unboxing/boxing overhead can be dealt with.

@lithomas1 lithomas1 marked this pull request as ready for review September 25, 2023 21:46
@lithomas1 lithomas1 requested a review from mroeschke September 25, 2023 21:46
Copy link
Member Author

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to make review easier.


# TODO: Range index support
# (not passing an index to series constructor doesn't work)
class IndexType(types.Type):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block just defines the types for Index and Series, there isn't much to see here.

It is pretty boilerplate and standard.



@typeof_impl.register(Index)
def typeof_index(val, c):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is where we teach numba to recognize our pandas objects as something that can be lowered into numba.

Again, pretty boilerplate and standard.



@type_callable(Series)
def type_series_constructor(context):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These define the types that the Series/Index constructors take in when you call the Series/Index constructors inside numba code.

It's fairly uninteresting. Note that the actual implementation of the constructors are further down - think of these declarations as kind of like a C prototype function.



# Backend extensions for Index and Series and Frame
@register_model(IndexType)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defines the numba representations of index/series.

Only interesting thing here is that Index has a pointer to the original index object, so we can avoid calling the index constructor and then just return that object.

Also, we add a hashmap to the index for indexing purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the hashmap support duplicate values like a pandas Index would?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will update and add some tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it's probably easier for now to disallow duplicate indexes.

I don't most frames have duplicate columns/indexes.



@lower_builtin(Series, types.Array, IndexType)
def pdseries_constructor(context, builder, sig, args):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructor implementations.



@unbox(IndexType)
def unbox_index(typ, obj, c):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code that transforms Series/Index -> numba representations and back.

There's a lot of C API stuff here, it's maybe worth a closer look if you want.

# and also add common binops (e.g. add, sub, mul, div)


def generate_series_reduction(ser_reduction, ser_method):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code to generate reductions (mean, std, var) and binops (addition, subtraction, multiplication)



# get_loc on Index
@overload_method(IndexType, "get_loc")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexing code goes from here to the end of the file, its maybe worth a closer look if you want to make sure it aligns with what we actually do.

@lithomas1
Copy link
Member Author

gentle ping @mroeschke

results = {}
for j in range(values.shape[1]):
# Create the series
ser = Series(values[:, j], index=df_index, name=str(col_names[j]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to str cast col_names[j]? Technically name could be any hashable value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restrict it to only allow string names.
(e.g. here https://github.com/pandas-dev/pandas/pull/55104/files#diff-2257b34410aee27eb14e348b9545fef2e212ff93bd72af02d700ae8df43d97bbR1153-R1158)

The cast to string is a quirk of my hack.
(I convert the column names to a numpy string array, so each element is a numpy string that I then need to convert to a regular numba unicode value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I think I see what you mean.

I think I need to change it so it only cassts to string when the index is str dtype.

)
col_names_values = orig_values.astype("U")
# Remember to set this back!
self.columns._data = col_names_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be assigned to ._data? Couldn't a copy of the columns just be passed to nb_func?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is this needed due to how the numba extension is defined?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in Index, we don't allow numpy string dtypes, but my hack uses numpy string dtypes, since those already have a native representation in numba.

Let me know if this is too hacky.

@numba.jit(nogil=nogil, nopython=nopython, parallel=parallel)
def numba_func(values, col_names, df_index):
results = {}
for j in range(values.shape[1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for j in range(values.shape[1]):
for j in numba.prange(values.shape[1]):

? (and below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!

I think for now it'll probably make better sense to disable parallel mode for now, since the dict in numba isn't thread-safe.

The overhead from the boxing/unboxing is also really high (99% of the time spent is there), so I doubt parallel will give a good speedup, at least for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK makes sense. Would be good to put a TODO: comment explaining why we shouldn't use prange for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment.

@lithomas1 lithomas1 requested a review from mroeschke October 9, 2023 22:59
@lithomas1
Copy link
Member Author

@mroeschke
Can you take another look at this?

I've simplified the hacky parts a bit so we no longer clobber the _data attribute of an Index, and added some more tests.

There's still some work to be done on catching and testing all the unsupported cases(e.g. EAs, Arrow arrays, object dtype, etc.), but I'll put that in a followup PR to keep this one small.

@lithomas1 lithomas1 requested a review from mroeschke October 16, 2023 20:57
@mroeschke
Copy link
Member

Final question. What error do we currently get when trying to use apply with numba with an unsupported type (like datetime64[ns]?

@lithomas1
Copy link
Member Author

Final question. What error do we currently get when trying to use apply with numba with an unsupported type (like datetime64[ns]?

I added an error message and a test.

It's basically a ValueError saying the dtype in this column is unsupported.

@mroeschke mroeschke added this to the 2.2 milestone Oct 22, 2023
@mroeschke mroeschke merged commit ac5587c into pandas-dev:main Oct 22, 2023
@mroeschke
Copy link
Member

Great work on this @lithomas1. In a follow up, could you add a whatsnew note in 2.2?

@lithomas1 lithomas1 deleted the numba-apply branch October 22, 2023 21:41
@lithomas1
Copy link
Member Author

Thanks for the reviews - I think there should be a whatnsew from when I added support with raw apply.

This definetely deserves a bigger blurb, though. I'll add that sometime next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map numba numba-accelerated operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants