ENH: Add numba engine to df.apply #55104

lithomas1 · 2023-09-12T01:24:35Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Performance is a mixed bag, it is sometimes faster and sometimes slower than Python.
This is because there is a huge cost (esp. in boxing the numba representations back into Series/Index since the constructors are slow).
Right now, in the no-op case(return self), we are around 5-10x slower - I'm hoping to bring that down to around 2x.
(There is an issue with the Index being repeatedly unboxed unnecessarily)

EDIT: I think I've closed the gap, the difference isn't much now.

We are also ~10x faster in a normalization (subtract mean, divide by std. dev) test I did.

lithomas1 · 2023-09-25T21:46:18Z

@mroeschke

This should be ready for a first pass now.
I know this is a lot of code, as of right now, this PR implements a pretty minimal subset of the pandas features, which are:

Numba versions of DataFrame, Series, Index
Common Series methods that have a numpy equivalent
a. Think sum, var, mean, and co.
Basic indexing support
a. It mimics our khash stuff, just with a numba dictionary instead. We might look into re-using the khash code, it would require some Cython code to expose that to numba though.

As of right now, there isn't support for things such as the Arrow dtypes, and non-numeric Indexes, but those shouldn't be hard to add in the future - I just didn't think the maintenance burden would be worth it as of right now.

(If you want, I can try to split up the PR, but because 1 is still a lot of lines, its still gonna be pretty big to review unfortunately).

Performance is decent as of now.
We are generally anywhere between 2-10x faster when going row-wise, but can be slower in cases where there are few rows/columns.

Most of the time is spent boxing/unboxing (converting to/from) the numba representation, and the pandas representation of the DataFrame. In the future, we could optimize this to potentially get like 100x speedups.

This should be fairly fixable with some effort, when just need to re-implement the concat/results wrapping that apply does in numba. I've held off on this because I've written a lot of code so far, and like before, I'd like to see people use this feature before I add to the maintenance burden of the numba code.

Sadly, multithreading doesn't yet work - numba has no thread safe structures, but this won't be relevant until the unboxing/boxing overhead can be dealt with.

lithomas1

Added some comments to make review easier.

lithomas1 · 2023-09-25T21:47:51Z

pandas/core/_numba/extensions.py

+
+# TODO: Range index support
+# (not passing an index to series constructor doesn't work)
+class IndexType(types.Type):


This block just defines the types for Index and Series, there isn't much to see here.

It is pretty boilerplate and standard.

lithomas1 · 2023-09-25T21:48:24Z

pandas/core/_numba/extensions.py

+
+
+@typeof_impl.register(Index)
+def typeof_index(val, c):


This block is where we teach numba to recognize our pandas objects as something that can be lowered into numba.

Again, pretty boilerplate and standard.

lithomas1 · 2023-09-25T21:50:09Z

pandas/core/_numba/extensions.py

+
+
+@type_callable(Series)
+def type_series_constructor(context):


These define the types that the Series/Index constructors take in when you call the Series/Index constructors inside numba code.

It's fairly uninteresting. Note that the actual implementation of the constructors are further down - think of these declarations as kind of like a C prototype function.

lithomas1 · 2023-09-25T21:51:15Z

pandas/core/_numba/extensions.py

+
+
+# Backend extensions for Index and Series and Frame
+@register_model(IndexType)


This defines the numba representations of index/series.

Only interesting thing here is that Index has a pointer to the original index object, so we can avoid calling the index constructor and then just return that object.

Also, we add a hashmap to the index for indexing purposes.

Does the hashmap support duplicate values like a pandas Index would?

Good point, I will update and add some tests.

On second thought, it's probably easier for now to disallow duplicate indexes.

I don't most frames have duplicate columns/indexes.

lithomas1 · 2023-09-25T21:51:47Z

pandas/core/_numba/extensions.py

+
+
+@lower_builtin(Series, types.Array, IndexType)
+def pdseries_constructor(context, builder, sig, args):


Constructor implementations.

lithomas1 · 2023-09-25T21:52:27Z

pandas/core/_numba/extensions.py

+
+
+@unbox(IndexType)
+def unbox_index(typ, obj, c):


Code that transforms Series/Index -> numba representations and back.

There's a lot of C API stuff here, it's maybe worth a closer look if you want.

lithomas1 · 2023-09-25T21:53:10Z

pandas/core/_numba/extensions.py

+# and also add common binops (e.g. add, sub, mul, div)
+
+
+def generate_series_reduction(ser_reduction, ser_method):


Code to generate reductions (mean, std, var) and binops (addition, subtraction, multiplication)

lithomas1 · 2023-09-25T21:53:52Z

pandas/core/_numba/extensions.py

+
+
+# get_loc on Index
+@overload_method(IndexType, "get_loc")


The indexing code goes from here to the end of the file, its maybe worth a closer look if you want to make sure it aligns with what we actually do.

lithomas1 · 2023-10-02T18:44:56Z

gentle ping @mroeschke

mroeschke · 2023-10-03T22:05:47Z

pandas/core/apply.py

+            results = {}
+            for j in range(values.shape[1]):
+                # Create the series
+                ser = Series(values[:, j], index=df_index, name=str(col_names[j]))


Why do we need to str cast col_names[j]? Technically name could be any hashable value

I restrict it to only allow string names.
(e.g. here https://github.com/pandas-dev/pandas/pull/55104/files#diff-2257b34410aee27eb14e348b9545fef2e212ff93bd72af02d700ae8df43d97bbR1153-R1158)

The cast to string is a quirk of my hack.
(I convert the column names to a numpy string array, so each element is a numpy string that I then need to convert to a regular numba unicode value)

Nvm, I think I see what you mean.

I think I need to change it so it only cassts to string when the index is str dtype.

mroeschke · 2023-10-03T22:09:01Z

pandas/core/apply.py

+                )
+            col_names_values = orig_values.astype("U")
+            # Remember to set this back!
+            self.columns._data = col_names_values


Why does this need to be assigned to ._data? Couldn't a copy of the columns just be passed to nb_func?

Oh is this needed due to how the numba extension is defined?

Yeah, in Index, we don't allow numpy string dtypes, but my hack uses numpy string dtypes, since those already have a native representation in numba.

Let me know if this is too hacky.

mroeschke · 2023-10-03T22:10:21Z

pandas/core/apply.py

+        @numba.jit(nogil=nogil, nopython=nopython, parallel=parallel)
+        def numba_func(values, col_names, df_index):
+            results = {}
+            for j in range(values.shape[1]):


Suggested change

for j in range(values.shape[1]):

for j in numba.prange(values.shape[1]):

? (and below)

Thanks for pointing this out!

I think for now it'll probably make better sense to disable parallel mode for now, since the dict in numba isn't thread-safe.

The overhead from the boxing/unboxing is also really high (99% of the time spent is there), so I doubt parallel will give a good speedup, at least for now.

OK makes sense. Would be good to put a TODO: comment explaining why we shouldn't use prange for now

added a comment.

pandas/core/_numba/extensions.py

lithomas1 · 2023-10-12T19:36:46Z

@mroeschke
Can you take another look at this?

I've simplified the hacky parts a bit so we no longer clobber the _data attribute of an Index, and added some more tests.

There's still some work to be done on catching and testing all the unsupported cases(e.g. EAs, Arrow arrays, object dtype, etc.), but I'll put that in a followup PR to keep this one small.

pandas/tests/apply/test_numba.py

pandas/core/_numba/extensions.py

…apply

mroeschke · 2023-10-17T15:54:53Z

Final question. What error do we currently get when trying to use apply with numba with an unsupported type (like datetime64[ns]?

lithomas1 · 2023-10-19T01:52:13Z

Final question. What error do we currently get when trying to use apply with numba with an unsupported type (like datetime64[ns]?

I added an error message and a test.

It's basically a ValueError saying the dtype in this column is unsupported.

mroeschke · 2023-10-22T19:52:50Z

Great work on this @lithomas1. In a follow up, could you add a whatsnew note in 2.2?

lithomas1 · 2023-10-22T21:42:51Z

Thanks for the reviews - I think there should be a whatnsew from when I added support with raw apply.

This definetely deserves a bigger blurb, though. I'll add that sometime next week.

ENH: Add numba engine to df.apply

1fa802c

lithomas1 added Apply Apply, Aggregate, Transform, Map numba numba-accelerated operations labels Sep 12, 2023

lithomas1 added 8 commits September 12, 2023 11:25

Merge branch 'main' of github.com:pandas-dev/pandas into numba-apply

c6af7c9

complete?

0ac544d

wip: pass tests

31b9e20

Merge branch 'main' of github.com:pandas-dev/pandas into numba-apply

6190772

fix existing tests

55df7ad

go for green

3c89b0f

fix checks?

1418d3e

fix pyright

c143c67

lithomas1 marked this pull request as ready for review September 25, 2023 21:46

lithomas1 requested a review from mroeschke September 25, 2023 21:46

lithomas1 commented Sep 25, 2023

View reviewed changes

lithomas1 added 4 commits September 28, 2023 16:45

update docs

0d827c4

Merge branch 'main' of github.com:pandas-dev/pandas into numba-apply

7129ee8

Merge branch 'main' into numba-apply

b0ba283

eliminate a blank line

f4e80a6

mroeschke reviewed Oct 3, 2023

View reviewed changes

pandas/core/_numba/extensions.py Outdated Show resolved Hide resolved

lithomas1 added 2 commits October 6, 2023 21:02

update from code review + more tests

21e2186

Merge branch 'main' of github.com:pandas-dev/pandas into numba-apply

b60bef8

lithomas1 requested a review from mroeschke October 9, 2023 22:59

lithomas1 added 3 commits October 10, 2023 10:02

fix failing tests

ba1d0e0

Simplify w/ context manager

088d27f

skip if no numba

60539a1

simplify more

76538d6

lithomas1 added 2 commits October 12, 2023 18:40

specify dtypes

cca34f9

Merge branch 'main' of github.com:pandas-dev/pandas into numba-apply

8b423bf

mroeschke reviewed Oct 16, 2023

View reviewed changes

pandas/tests/apply/test_numba.py Outdated Show resolved Hide resolved

mroeschke reviewed Oct 16, 2023

View reviewed changes

pandas/tests/apply/test_numba.py Outdated Show resolved Hide resolved

mroeschke reviewed Oct 16, 2023

View reviewed changes

pandas/core/_numba/extensions.py Show resolved Hide resolved

lithomas1 added 3 commits October 16, 2023 16:30

Merge branch 'main' into numba-apply

f135def

Merge branch 'numba-apply' of github.com:lithomas1/pandas into numba-…

b2e50d2

…apply

address code review

f86024f

lithomas1 requested a review from mroeschke October 16, 2023 20:57

add errors for invalid columns

a15293d

adjust message

8fe5d89

mroeschke added this to the 2.2 milestone Oct 22, 2023

mroeschke approved these changes Oct 22, 2023

View reviewed changes

mroeschke merged commit ac5587c into pandas-dev:main Oct 22, 2023

lithomas1 deleted the numba-apply branch October 22, 2023 21:41

jorisvandenbossche mentioned this pull request Sep 20, 2024

String dtype: allow string dtype for non-raw apply with numba engine #59854

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add numba engine to df.apply #55104

ENH: Add numba engine to df.apply #55104

lithomas1 commented Sep 12, 2023 •

edited

Loading

lithomas1 commented Sep 25, 2023

lithomas1 left a comment

lithomas1 Sep 25, 2023

lithomas1 Sep 25, 2023

lithomas1 Sep 25, 2023

lithomas1 Sep 25, 2023

mroeschke Oct 3, 2023

lithomas1 Oct 6, 2023

lithomas1 Oct 7, 2023

lithomas1 Sep 25, 2023

lithomas1 Sep 25, 2023

lithomas1 Sep 25, 2023

lithomas1 Sep 25, 2023

lithomas1 commented Oct 2, 2023

mroeschke Oct 3, 2023

lithomas1 Oct 5, 2023

lithomas1 Oct 6, 2023

mroeschke Oct 3, 2023

mroeschke Oct 3, 2023

lithomas1 Oct 5, 2023

mroeschke Oct 3, 2023

lithomas1 Oct 5, 2023

mroeschke Oct 16, 2023

lithomas1 Oct 16, 2023

lithomas1 commented Oct 12, 2023

mroeschke commented Oct 17, 2023

lithomas1 commented Oct 19, 2023

mroeschke commented Oct 22, 2023

lithomas1 commented Oct 22, 2023



		# Backend extensions for Index and Series and Frame
		@register_model(IndexType)



		@lower_builtin(Series, types.Array, IndexType)
		def pdseries_constructor(context, builder, sig, args):

		# and also add common binops (e.g. add, sub, mul, div)


		def generate_series_reduction(ser_reduction, ser_method):

	for j in range(values.shape[1]):
	for j in numba.prange(values.shape[1]):

ENH: Add numba engine to df.apply #55104

ENH: Add numba engine to df.apply #55104

Conversation

lithomas1 commented Sep 12, 2023 • edited Loading

lithomas1 commented Sep 25, 2023

lithomas1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Oct 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Oct 12, 2023

mroeschke commented Oct 17, 2023

lithomas1 commented Oct 19, 2023

mroeschke commented Oct 22, 2023

lithomas1 commented Oct 22, 2023

lithomas1 commented Sep 12, 2023 •

edited

Loading