-
Notifications
You must be signed in to change notification settings - Fork 129
Inline constants in composite graphs #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline constants in composite graphs #361
Conversation
7d9dfe0
to
ce2d4c3
Compare
ce2d4c3
to
193b22a
Compare
Interesting: Scalar Constants were never in-lined in the compiled thunks (the C-functions are simply given a pointer to the Python list that will hold those constants). That reduces number of compiled thunks, at the expense of compiler-level optimizations like the one mentioned in this PR: pytensor/pytensor/tensor/rewriting/math.py Line 2083 in b9c4f20
The optimization in this PR allows to inline some cases via the Composite graph |
193b22a
to
8c0b6b1
Compare
8c0b6b1
to
b8be9e8
Compare
12ea5f6
to
1580034
Compare
be115bb
to
2351a1f
Compare
@aseyboldt I think the functionality is working. Would you be interested in doing a quick Numba benchmark? |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #361 +/- ##
==========================================
+ Coverage 80.40% 80.44% +0.03%
==========================================
Files 156 156
Lines 45401 45483 +82
Branches 11106 11138 +32
==========================================
+ Hits 36505 36588 +83
+ Misses 6689 6687 -2
- Partials 2207 2208 +1
|
new_outputs = Elemwise(new_composite_op).make_node(*new_outer_inputs).outputs | ||
|
||
# Some of the inlined constants were broadcasting the output shape | ||
if node.outputs[0].type.broadcastable != new_outputs[0].type.broadcastable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could this happen if we only changed scalar inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_unique_constant_value
works for homogeneous constants of any rank
In cases where the constant influenced the final shape of the Elemwise operation | ||
We broadcast (via alloc) the new Elemwise result: | ||
|
||
Elemwise(row, ones((3, 4))) -> Alloc(Elemwise(row, ones((1, 1))), 3, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the post-rewrite version might actually be slower than the pre-rewrite version.
After the rewrite we add an extra allocation and copy operation. I guess which one is faster in practice depends on how expensive the scalar_op of the Elemwise is...
Maybe we just shouldn't do that rewrite if we'd have to broadcast the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make Alloc become a view so it doesn't re-allocate (or as things stand now, use a BroadcastTo instead of Alloc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution to #367 will fix it. Everything is an Alloc and then during the inplace optimizations we take care of finding out what actually needs to allocate an output and what doesn't.
Note that for the common cheap Elemwise-broadcasted cases 1*x or 0+y, we actually already rewrite it like this via the AlgebraCanocalizer. That's why I parametrized the tests with an odd pow(x, 2.5)
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do we really want to use broadcasting to go from an (1,) array to a (n,) array without copying? We can do that, but the resulting array would use the same storage location multiple times. And if a downstream op works inplace and doesn't take that possibility into account, then we'd end up with incorrect results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we have a daddy called Supervisor
:
pytensor/pytensor/compile/function/types.py
Line 135 in 5c87d74
class Supervisor(Feature): |
And why we don't let users give us Functions with inplace Ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have to worry about that at the level of pre-inplace rewrites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was more that I think we have a whole bunch of logic that just silently assumes that we don't use any weird arrays like broadcast(np.ones(1), (3,))
at all. Numpy will even just return a read-only array for that, probably to make sure users don't shoot themselves into the foot:
vals = np.ones(1)
bc = np.broadcast_to(vals, (3,))
assert not bc.flags.writeable
A later rewrite that would be totally valid with a normal array could turn into something that errors or even gives silently incorrect results.
Let's say we have a graph like
inc_subtensor(pow(ones(3), 1.)[0, 1], 1)
-> inc_subtensor(Alloc(pow(ones(1), 1)), (3,), inplace=False)[0, 1], 1, inplace=False)
-> inc_subtensor(Alloc(pow(ones(1), 1)), (3,), inplace=True)[0, 1], 1, inplace=False)
-> inc_subtensor(Alloc(pow(ones(1), 1)), (3,), inplace=True)[0, 1], 1, inplace=True)
The two inplace rewrites seem perfectly innocent in isolation, but because the first inplace rewrite changed how the array is stored, the second one won't work. There is only one storage location for all three entries of the array, so it will return an array like [3, 3, 3] instead of [2, 2, 1].
I think the most sensible solution to this is to just never ever create arrays that refer to the same memory location multiple times. But that means we can't implement Alloc(ones(1), (3,), inplace=True)
at all.
So maybe that rewrite is still fine, and we think that there are more cases where it helps than cases where it hurts (I think that's probably true). But at least we should make sure we disallow Alloc(ones(1), 3, 4, inplace=True)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe that rewrite is still fine, and we think that there are more cases where it helps than cases where it hurts (I think that's probably true). But at least we should make sure we disallow Alloc(ones(1), 3, 4, inplace=True).
I'll add this to the #367 issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion: We agree that the rewrite is fine, but we shouldn't rewrite the alloc to a broadcast to avoid creating arrays that alias memory. Doing this might be fine (there are some checks in the blas wrappers already), but we thought it is not worth the trouble with possible bugs that this could introduce.
2351a1f
to
71c018b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :-)
71c018b
to
b618a0f
Compare
I removed the Elemwise rewrite for now, as it raises other questions that I have to think for a bit longer. |
b618a0f
to
033ad21
Compare
033ad21
to
451b18c
Compare
This PR inlines constants in composite graphs, which reduces the number of variables that must be iterated over in the fused Elemwise graphs. I noticed some speedups when combined with
openmp
but not really otherwise.Would be interesting to check if
numba
benefits more from this.Regardless, this seems like a reasonable rewrite to include, if for nothing else that it also simplifies the debug_print of the final compiled graphs.
There is no further constant folding triggered in the inner Composite graph. This shouldn't be needed for automatically introduced Composites, as those happen after constant folding.
If users create their own Composite and pass constants, then they may have useless operations on constants inside the graph, but that's hopefully something compilers can optimize themselves?
Related to #107