Skip to content

draw_values draws from marginal distributions #3210

Closed
@lucianopaz

Description

@lucianopaz

Description of your problem

draw_values draws from the marginal distributions of the requested variables instead of the joint distribution.

Please provide a minimal, self-contained, and reproducible example.

import numpy as np
import pymc3 as pm
import seaborn
from matplotlib import pyplot as plt


with pm.Model():
    a = pm.Normal('a', mu=0, sd=100)
    b = pm.Normal('b', mu=a, sd=1e-6)
    c = pm.Normal('c', mu=a, sd=1e-6)

params = [a, b, c]
N = 10000
np.random.seed(1)
A, B, C = list(zip(*[pm.distributions.distribution.draw_values(params)
                     for i in range(N)]))
A = np.array(A)
B = np.array(B)
C = np.array(C)

np.random.seed(1)
eA = np.random.randn(N) * 100.
eB = eA + np.random.randn(N) * 1e-6
eC = eA + np.random.randn(N) * 1e-6

seaborn.jointplot(B, C)
plt.suptitle('draw_values')
plt.savefig('draw_values_output.png')
seaborn.jointplot(eB, eC)
plt.suptitle('Expected')
plt.savefig('expected_output.png')

Please provide the full traceback.
This model is a rather simple hierarchical model. The expected joint distribution of variables b and c is:

expected_output

However, the values drawn with draw_values are conditionally independent from one another

draw_values_output

The marginals match but the joint distribution does not. Given that #2983 relies heavily on draw_values to make sample_prior_predictive work, this problem is really important.

I've given this issue some though before submitting it here and I think I found the origin of the problem and also a possible solution, that I'll pull in later. What seems to be happening is

  1. draw_values sees that it must sample from a, b and c.
  2. None of these nodes are Apply nodes with ownerships so there is no dependency tree being searched and nothing happens to givens or point.
  3. draw_values arrives to its final stage where it makes three seperate calls to _draw_value, one for each variable.
  4. Inside _draw_value, its decided that param's random method must be called.
  5. random in turn issues a new call to draw_values to get the Normal's mu and sd. The values returned by this call to draw_values are never made available to the other calls in point 3. Thus, effectively, each variable ends up being drawn from its marginal distribution.

As it stands now, the only way that the nested draw_values calls from point 5 could be made aware of any value given to a is by modifying the point dictionary with the values drawn from a, before calling the conditionally dependent b and c.

My proposal is a follows. Each distribution should store a list of the theano.Variable's it is conditionally dependent on.

For example Normal should have an attribute called conditional_on which should be the list [self.mu, self.tau]. If this attribute is adopted as a convention of all distributions (in the end, every distribution already implicitly uses it when calling draw_values from their random method), we could build a dependency graph (I suppose it should be a DAG) between all the variables used by draw_values. I also imagine that this DAG could be built behind the scenes when adding random variables into a model, to save some unnecessary computations. This DAG should also take into account deterministic relations, given by theano owner input nodes, something that is already a bit in place in draw_values. Knowing the full DAG, we can know which nodes are independent from anything else, start drawing their values, putting them into the point dictionary, and them moving on to the next nodes in the hierarchy.

Versions and main components

  • PyMC3 Version: 3.4
  • Theano Version: 1.03
  • Python Version: 3.6
  • Operating system: Ubuntu 16.04
  • How did you install PyMC3: pip install -e

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions