A more low level implementation of vectorize in numba #92

aseyboldt · 2022-12-08T19:10:19Z

As discussed in #70, this rewrites the numba implementation using llvm intrinsics.
This way we can use broadcasting information, supply llvm with required aliasing information for vectoriziation, support multiple outputs of elemwise and prepare for ElemwiseSum-like optimizations. (the llvm code supports that already).

Locally, there are still some test failures due to execution in python mode that I will have to work out.

This can lead to sizable performance improvements (with svml installed):

import pytensor
import pytensor.tensor as pt
import numpy as np
import numba
import pytensor.link.numba.dispatch

import llvmlite.binding as llvm
#llvm.set_option('', '--debug-only=loop-vectorize,iv-descriptors')

x = pt.dmatrix("y")
y = pt.dvector("z")

out = np.exp(2 * x * y + y)

x_val = np.random.randn(200, 500)
y_val = np.random.randn(500)

func = pytensor.function([x, y], out, mode="NUMBA")
out = func(x_val, y_val)
np.testing.assert_allclose(np.exp(2 * x_val * y_val + y_val), out)

%timeit func(x_val, y_val)

# Before PR
712 µs ± 6.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# After PR
189 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

twiecki · 2022-12-20T09:49:58Z

pytensor/link/numba/dispatch/elemwise_codegen.py

+        layout = aryty.layout
+        return (data, shape, strides, layout)
+
+    # TODO I think this is better than the noalias attribute


Maybe we can ask for a release?

Would need to be merged first :-)
numba/llvmlite#895
Also, I don't think this is a major issue. The current noalias attribute for the output arrays should cover most (if not all) cases as well.

aseyboldt · 2022-12-21T04:13:23Z

I think I finally got there, tests might just pass this time :-)

ricardoV94 · 2022-12-21T07:14:32Z

I'll try to review soon. In the meantime do you want to add 1 or 2 benchmark tests? #139

twiecki · 2022-12-21T08:53:30Z

@aseyboldt Any more timing results?

codecov-commenter · 2022-12-21T23:39:17Z

Codecov Report

Merging #92 (c62ed02) into main (f4de2fd) will decrease coverage by 0.02%.
The diff coverage is 82.19%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #92      +/-   ##
==========================================
- Coverage   79.98%   79.95%   -0.03%     
==========================================
  Files         169      170       +1     
  Lines       44607    44848     +241     
  Branches     9426     9493      +67     
==========================================
+ Hits        35678    35859     +181     
- Misses       6738     6778      +40     
- Partials     2191     2211      +20

Impacted Files	Coverage Δ
pytensor/link/numba/dispatch/extra_ops.py	`92.24% <ø> (ø)`
pytensor/link/numba/dispatch/elemwise.py	`89.01% <78.01%> (-8.04%)`	⬇️
pytensor/link/numba/dispatch/elemwise_codegen.py	`85.96% <85.96%> (ø)`
pytensor/link/numba/dispatch/basic.py	`89.33% <100.00%> (-0.73%)`	⬇️
pytensor/link/numba/dispatch/scalar.py	`94.51% <100.00%> (+0.06%)`	⬆️
pytensor/link/numba/dispatch/scan.py	`95.45% <100.00%> (+0.06%)`	⬆️
pytensor/link/numba/linker.py	`100.00% <100.00%> (ø)`

aseyboldt · 2022-12-22T18:53:21Z

@twiecki The above benchmark hasn't really changed.
Hard to say how this impacts real models, for the radon model in the nutpie example this currently reduces my compile time from ~15s to ~12s (lots of noise...) and logp eval time from 12μs to 10μs (much less noise).
I would expect bigger wins for models with more data, and also more wins once we get this merged with the loop fusion and move the sums into the elemwise op.

@ricardoV94 added the code from the description as benchmark.

pytensor/link/numba/dispatch/elemwise_codegen.py

ricardoV94 · 2023-01-03T08:15:07Z

pytensor/link/numba/dispatch/elemwise_codegen.py

@@ -0,0 +1,240 @@
+from typing import Any, List, Optional, Tuple


I can guess the answer to this question... but do we need to go this level? Can't we add some shape asserts in a vanilla numba function as a way to get the speedup without us having to do all this?

Some shape asserts won't help I'm afraid. :-)
I think it is pretty clear that we want support for reductions during elemwise, and we also want to support multiple outputs. This means that the build-in numba vectorize won't do it.
So as far as I can see we have two options to get those:

Build numba functions using string processing

Use the llvm interface

I think working with strings is actually harder and more error-prone than working with the llvm code directly. (have a look at the Reduction functions in this module. I think that's even more complicated). Also, if we think about what happens if we do this, this would look like:

We start with a properly typed graph

We (conditionally) generate strings of numba code, that have no type info

The numba code get's evaluated and transformed to bytecode

The bytecode in analysed by numba

Numba tries to figure out types for all the variables (those types that we discarded earlier, hopefully)

Numba uses code just like in the PR to build an llvm module, based on the bytecode

I don't see why we would want those in-between steps to happen? pytensor is a compiler, I don't think we can reasonably expect to get around doing what compilers do: generate code. ;-)

The long term problem I see here is maintenance. This is the backbone of the library and I don't know how much time it would take me (or another dev) to sort out a bug in the future. I don't know how volatile these functions are or how well documented.

Yeah, the move from "we build a stats library" to "let's build a compiler" is a bit sudden :-)

ricardoV94 · 2023-01-03T08:17:21Z

tests/link/numba/test_elemwise.py

@@ -117,6 +117,25 @@ def test_Elemwise(inputs, input_vals, output_fn, exc):
        compare_numba_and_py(out_fg, input_vals)


+def test_elemwise_speed(benchmark):


Can you add a case with broadcasting?

That one is using broadcasting, isn't it?

Hmm I guess I wanted non-dimshuffled broadcasting because that's the one we've been thinking about, but should be the same?

Elemwise will always call dimshuffle in make_node anyway, so there shouldn't be a non-dimshuffled case?

By non-dimshuffled I mean matrix + row/col

ricardoV94 · 2023-01-03T08:18:28Z

LGTM, not happy with having to write low-level code that seems as complex if not more than our C implementation.

Left some small questions/comments above

twiecki · 2023-01-03T08:36:48Z

Yeah, I'd also feel more comfortable if we proved the benefit a bit more of adding this much complexity.

aseyboldt · 2023-01-03T22:19:02Z

The docs failure looks like it is because of the 6.0 release, which dropped py37 support: https://www.sphinx-doc.org/en/master/changes.html#id9
(now #165)

@twiecki

Yeah, I'd also feel more comfortable if we proved the benefit a bit more of adding this much complexity.

Does that post above convince you already?
I can see the maintenance problem, it's just that I don't see alternatives that are much better...

twiecki · 2023-01-04T05:34:49Z

Does that post above convince you already?

So it's not just faster but actually allowing us to do things we couldn't do, i.e. reductions during elemwise. In that case I'm convinced.

michaelosthege · 2023-01-04T10:45:18Z

@aseyboldt a rebase should fix the docs build

Co-authored-by: Ricardo Vieira <[email protected]>

aseyboldt marked this pull request as draft December 8, 2022 20:12

aseyboldt force-pushed the llvm-elemwise branch from 04580b8 to 089db94 Compare December 8, 2022 23:32

ricardoV94 added numba performance labels Dec 9, 2022

aseyboldt mentioned this pull request Dec 11, 2022

Rewrite rank 0 elemwise ops and push scalar constants into elemwise #107

Draft

aseyboldt force-pushed the llvm-elemwise branch from 089db94 to f66814d Compare December 20, 2022 06:26

twiecki reviewed Dec 20, 2022

View reviewed changes

aseyboldt force-pushed the llvm-elemwise branch 5 times, most recently from 2ce8ae9 to 8e52ff7 Compare December 21, 2022 04:02

aseyboldt marked this pull request as ready for review December 21, 2022 04:12

aseyboldt force-pushed the llvm-elemwise branch from 8e52ff7 to 2fd7a75 Compare December 21, 2022 17:55

aseyboldt force-pushed the llvm-elemwise branch from c912f3f to 683479c Compare December 29, 2022 02:04

ricardoV94 reviewed Jan 3, 2023

View reviewed changes

ricardoV94 approved these changes Jan 3, 2023

View reviewed changes

aseyboldt added 3 commits January 4, 2023 10:13

Initial version of llvm elemwise impl

d4e5dcd

numba reshape should always return an array

61e7a67

numba careduce should return an array

7ecbac7

aseyboldt and others added 11 commits January 4, 2023 10:13

Specialized numba sum impl

6936d2d

Add shape checking in numba elemwise

ae122aa

Add typing for some numba elemwise

0ed277f

Remove py310 only strict arg to zip

79b1037

Fix tests and fix scalar numba return types

e9ab63c

Add cast in numba elemwise between func type and output type

05f6ab2

numba while condition must be tensor

b93a4f7

Fix some floatX issues

bfb91f9

Add benchmark for numba elemwise

84d6fd5

Fix typo in error message

4fedcbe

Co-authored-by: Ricardo Vieira <[email protected]>

Remove unused helper file

462b957

aseyboldt force-pushed the llvm-elemwise branch from c62ed02 to 462b957 Compare January 4, 2023 16:13

aseyboldt merged commit 8051ffb into pymc-devs:main Jan 4, 2023

ricardoV94 mentioned this pull request Feb 20, 2023

Fuse Composite Ops with CAReduce Ops in Numba/C backend #224

Open

ricardoV94 mentioned this pull request Nov 29, 2024

Improve performance of CAReduce in Numba backend #1109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A more low level implementation of vectorize in numba #92

A more low level implementation of vectorize in numba #92

aseyboldt commented Dec 8, 2022

twiecki Dec 20, 2022

aseyboldt Dec 20, 2022

aseyboldt commented Dec 21, 2022

ricardoV94 commented Dec 21, 2022

twiecki commented Dec 21, 2022

codecov-commenter commented Dec 21, 2022 •

edited

Loading

aseyboldt commented Dec 22, 2022

ricardoV94 Jan 3, 2023

aseyboldt Jan 3, 2023

ricardoV94 Jan 3, 2023

aseyboldt Jan 3, 2023

ricardoV94 Jan 3, 2023

aseyboldt Jan 3, 2023

ricardoV94 Jan 3, 2023

aseyboldt Jan 3, 2023

ricardoV94 Jan 3, 2023

ricardoV94 commented Jan 3, 2023 •

edited

Loading

twiecki commented Jan 3, 2023

aseyboldt commented Jan 3, 2023

twiecki commented Jan 4, 2023

michaelosthege commented Jan 4, 2023

		@@ -0,0 +1,240 @@
		from typing import Any, List, Optional, Tuple

		@@ -117,6 +117,25 @@ def test_Elemwise(inputs, input_vals, output_fn, exc):
		compare_numba_and_py(out_fg, input_vals)


		def test_elemwise_speed(benchmark):

A more low level implementation of vectorize in numba #92

A more low level implementation of vectorize in numba #92

Conversation

aseyboldt commented Dec 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aseyboldt commented Dec 21, 2022

ricardoV94 commented Dec 21, 2022

twiecki commented Dec 21, 2022

codecov-commenter commented Dec 21, 2022 • edited Loading

Codecov Report

aseyboldt commented Dec 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricardoV94 commented Jan 3, 2023 • edited Loading

twiecki commented Jan 3, 2023

aseyboldt commented Jan 3, 2023

twiecki commented Jan 4, 2023

michaelosthege commented Jan 4, 2023

codecov-commenter commented Dec 21, 2022 •

edited

Loading

ricardoV94 commented Jan 3, 2023 •

edited

Loading