Skip to content

Inline constants in composite graphs #361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 11, 2023

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented Jun 30, 2023

This PR inlines constants in composite graphs, which reduces the number of variables that must be iterated over in the fused Elemwise graphs. I noticed some speedups when combined with openmp but not really otherwise.

Would be interesting to check if numba benefits more from this.

Regardless, this seems like a reasonable rewrite to include, if for nothing else that it also simplifies the debug_print of the final compiled graphs.

import pytensor
import pytensor.tensor as pt

x = pt.vector("x")
out = pt.exp(x + 2)

fn = pytensor.function([x], out)
pytensor.dprint(fn)
Composite{exp((2.0 + i0))} [id A] 0
 └─ x [id B]

Inner graphs:

Composite{exp((2.0 + i0))} [id A]
 ← exp [id C] 'o0'
    └─ add [id D]
       ├─ 2.0 [id E]
       └─ i0 [id F]

There is no further constant folding triggered in the inner Composite graph. This shouldn't be needed for automatically introduced Composites, as those happen after constant folding.

If users create their own Composite and pass constants, then they may have useless operations on constants inside the graph, but that's hopefully something compilers can optimize themselves?

Related to #107

@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch 4 times, most recently from 7d9dfe0 to ce2d4c3 Compare June 30, 2023 14:47
@ricardoV94 ricardoV94 added the bug Something isn't working label Jun 30, 2023
@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch from ce2d4c3 to 193b22a Compare June 30, 2023 17:11
@ricardoV94
Copy link
Member Author

Interesting: Scalar Constants were never in-lined in the compiled thunks (the C-functions are simply given a pointer to the Python list that will hold those constants). That reduces number of compiled thunks, at the expense of compiler-level optimizations like the one mentioned in this PR:

def local_pow_specialize_device(fgraph, node):

The optimization in this PR allows to inline some cases via the Composite graph

@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch from 193b22a to 8c0b6b1 Compare July 3, 2023 18:30
@ricardoV94 ricardoV94 marked this pull request as draft July 3, 2023 18:30
@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch from 8c0b6b1 to b8be9e8 Compare July 3, 2023 18:35
This was referenced Jul 3, 2023
@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch 4 times, most recently from 12ea5f6 to 1580034 Compare July 5, 2023 09:45
@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch 5 times, most recently from be115bb to 2351a1f Compare July 5, 2023 13:10
@ricardoV94 ricardoV94 marked this pull request as ready for review July 5, 2023 13:11
@ricardoV94
Copy link
Member Author

@aseyboldt I think the functionality is working. Would you be interested in doing a quick Numba benchmark?

@ricardoV94 ricardoV94 removed the bug Something isn't working label Jul 5, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #361 (2351a1f) into main (5c87d74) will increase coverage by 0.03%.
The diff coverage is 92.39%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #361      +/-   ##
==========================================
+ Coverage   80.40%   80.44%   +0.03%     
==========================================
  Files         156      156              
  Lines       45401    45483      +82     
  Branches    11106    11138      +32     
==========================================
+ Hits        36505    36588      +83     
+ Misses       6689     6687       -2     
- Partials     2207     2208       +1     
Impacted Files Coverage Δ
pytensor/scalar/basic.py 80.16% <66.66%> (-0.04%) ⬇️
pytensor/tensor/rewriting/shape.py 81.20% <83.33%> (+0.10%) ⬆️
pytensor/tensor/rewriting/elemwise.py 89.29% <98.14%> (+0.78%) ⬆️
pytensor/scan/rewriting.py 79.94% <100.00%> (ø)
pytensor/tensor/basic.py 90.77% <100.00%> (ø)
pytensor/tensor/rewriting/math.py 86.36% <100.00%> (+0.31%) ⬆️
pytensor/tensor/var.py 87.76% <100.00%> (+0.19%) ⬆️

... and 1 file with indirect coverage changes

new_outputs = Elemwise(new_composite_op).make_node(*new_outer_inputs).outputs

# Some of the inlined constants were broadcasting the output shape
if node.outputs[0].type.broadcastable != new_outputs[0].type.broadcastable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could this happen if we only changed scalar inputs?

Copy link
Member Author

@ricardoV94 ricardoV94 Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_unique_constant_value works for homogeneous constants of any rank

In cases where the constant influenced the final shape of the Elemwise operation
We broadcast (via alloc) the new Elemwise result:

Elemwise(row, ones((3, 4))) -> Alloc(Elemwise(row, ones((1, 1))), 3, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the post-rewrite version might actually be slower than the pre-rewrite version.
After the rewrite we add an extra allocation and copy operation. I guess which one is faster in practice depends on how expensive the scalar_op of the Elemwise is...
Maybe we just shouldn't do that rewrite if we'd have to broadcast the output?

Copy link
Member Author

@ricardoV94 ricardoV94 Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make Alloc become a view so it doesn't re-allocate (or as things stand now, use a BroadcastTo instead of Alloc)

Copy link
Member Author

@ricardoV94 ricardoV94 Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution to #367 will fix it. Everything is an Alloc and then during the inplace optimizations we take care of finding out what actually needs to allocate an output and what doesn't.

Note that for the common cheap Elemwise-broadcasted cases 1*x or 0+y, we actually already rewrite it like this via the AlgebraCanocalizer. That's why I parametrized the tests with an odd pow(x, 2.5) as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we really want to use broadcasting to go from an (1,) array to a (n,) array without copying? We can do that, but the resulting array would use the same storage location multiple times. And if a downstream op works inplace and doesn't take that possibility into account, then we'd end up with incorrect results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we have a daddy called Supervisor:

class Supervisor(Feature):

And why we don't let users give us Functions with inplace Ops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have to worry about that at the level of pre-inplace rewrites

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was more that I think we have a whole bunch of logic that just silently assumes that we don't use any weird arrays like broadcast(np.ones(1), (3,)) at all. Numpy will even just return a read-only array for that, probably to make sure users don't shoot themselves into the foot:

vals = np.ones(1)
bc = np.broadcast_to(vals, (3,))
assert not bc.flags.writeable

A later rewrite that would be totally valid with a normal array could turn into something that errors or even gives silently incorrect results.

Let's say we have a graph like

inc_subtensor(pow(ones(3), 1.)[0, 1], 1)
-> inc_subtensor(Alloc(pow(ones(1), 1)), (3,), inplace=False)[0, 1], 1, inplace=False)
-> inc_subtensor(Alloc(pow(ones(1), 1)), (3,), inplace=True)[0, 1], 1, inplace=False)
-> inc_subtensor(Alloc(pow(ones(1), 1)), (3,), inplace=True)[0, 1], 1, inplace=True)

The two inplace rewrites seem perfectly innocent in isolation, but because the first inplace rewrite changed how the array is stored, the second one won't work. There is only one storage location for all three entries of the array, so it will return an array like [3, 3, 3] instead of [2, 2, 1].

I think the most sensible solution to this is to just never ever create arrays that refer to the same memory location multiple times. But that means we can't implement Alloc(ones(1), (3,), inplace=True) at all.


So maybe that rewrite is still fine, and we think that there are more cases where it helps than cases where it hurts (I think that's probably true). But at least we should make sure we disallow Alloc(ones(1), 3, 4, inplace=True).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe that rewrite is still fine, and we think that there are more cases where it helps than cases where it hurts (I think that's probably true). But at least we should make sure we disallow Alloc(ones(1), 3, 4, inplace=True).

I'll add this to the #367 issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion: We agree that the rewrite is fine, but we shouldn't rewrite the alloc to a broadcast to avoid creating arrays that alias memory. Doing this might be fine (there are some checks in the blas wrappers already), but we thought it is not worth the trouble with possible bugs that this could introduce.

@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch from 2351a1f to 71c018b Compare July 10, 2023 08:32
@ricardoV94 ricardoV94 requested a review from aseyboldt July 10, 2023 08:33
Copy link
Member

@aseyboldt aseyboldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :-)

@ricardoV94 ricardoV94 marked this pull request as draft July 10, 2023 19:03
@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch from 71c018b to b618a0f Compare July 11, 2023 08:06
@ricardoV94
Copy link
Member Author

ricardoV94 commented Jul 11, 2023

I removed the Elemwise rewrite for now, as it raises other questions that I have to think for a bit longer.

@ricardoV94 ricardoV94 marked this pull request as ready for review July 11, 2023 08:07
@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch from b618a0f to 033ad21 Compare July 11, 2023 08:35
@ricardoV94 ricardoV94 force-pushed the inline_constants_composite branch from 033ad21 to 451b18c Compare July 11, 2023 10:06
@ricardoV94 ricardoV94 merged commit 5e6b356 into pymc-devs:main Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants