Skip to content

Commit 1ed0478

Browse files
AlexAndorrahottwaj
andauthored
Expand pm.Data capacities (#3925)
* Initial changes to allow pymc3.Data() to support both int and float input data (previously all input data was coerced to float) WIP for #3813 * added exception for invalid dtype input to pandas_to_array * Refined implementation * Finished dtype conversion handling * Added SharedVariable option to getattr_value * Added dtype handling to set_data function * Added tests for pm.Data used for index variables * Added tests for using pm.data as RV input * Ran Black on data tests files * Added release note * Updated release notes * Updated code in light of Luciano's comments * Fixed implementation of integer checking * Simplified implementation of type checking * Corrected implementation for other uses of pandas_to_array Co-authored-by: hottwaj <[email protected]>
1 parent 7f307b9 commit 1ed0478

File tree

5 files changed

+142
-72
lines changed

5 files changed

+142
-72
lines changed

RELEASE-NOTES.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## PyMC3 3.9 (On deck)
44

55
### New features
6-
- use [fastprogress](https://github.com/fastai/fastprogress) instead of tqdm [#3693](https://github.com/pymc-devs/pymc3/pull/3693)
6+
- Use [fastprogress](https://github.com/fastai/fastprogress) instead of tqdm [#3693](https://github.com/pymc-devs/pymc3/pull/3693).
77
- `DEMetropolis` can now tune both `lambda` and `scaling` parameters, but by default neither of them are tuned. See [#3743](https://github.com/pymc-devs/pymc3/pull/3743) for more info.
88
- `DEMetropolisZ`, an improved variant of `DEMetropolis` brings better parallelization and higher efficiency with fewer chains with a slower initial convergence. This implementation is experimental. See [#3784](https://github.com/pymc-devs/pymc3/pull/3784) for more info.
99
- Notebooks that give insight into `DEMetropolis`, `DEMetropolisZ` and the `DifferentialEquation` interface are now located in the [Tutorials/Deep Dive](https://docs.pymc.io/nb_tutorials/index.html) section.
@@ -14,6 +14,8 @@
1414
- `pm.sample` now has support for adapting dense mass matrix using `QuadPotentialFullAdapt` (see [#3596](https://github.com/pymc-devs/pymc3/pull/3596), [#3705](https://github.com/pymc-devs/pymc3/pull/3705), [#3858](https://github.com/pymc-devs/pymc3/pull/3858), and [#3893](https://github.com/pymc-devs/pymc3/pull/3893)). Use `init="adapt_full"` or `init="jitter+adapt_full"` to use.
1515
- `Moyal` distribution added (see [#3870](https://github.com/pymc-devs/pymc3/pull/3870)).
1616
- `pm.LKJCholeskyCov` now automatically computes and returns the unpacked Cholesky decomposition, the correlations and the standard deviations of the covariance matrix (see [#3881](https://github.com/pymc-devs/pymc3/pull/3881)).
17+
- `pm.Data` container can now be used for index variables, i.e with integer data and not only floats (issue [#3813](https://github.com/pymc-devs/pymc3/issues/3813), fixed by [#3925](https://github.com/pymc-devs/pymc3/pull/3925)).
18+
- `pm.Data` container can now be used as input for other random variables (issue [#3842](https://github.com/pymc-devs/pymc3/issues/3842), fixed by [#3925](https://github.com/pymc-devs/pymc3/pull/3925)).
1719

1820
### Maintenance
1921
- Tuning results no longer leak into sequentially sampled `Metropolis` chains (see #3733 and #3796).

pymc3/data.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -153,9 +153,9 @@ class Minibatch(tt.TensorVariable):
153153
Examples
154154
--------
155155
Consider we have `data` as follows:
156-
156+
157157
>>> data = np.random.rand(100, 100)
158-
158+
159159
if we want a 1d slice of size 10 we do
160160
161161
>>> x = Minibatch(data, batch_size=10)
@@ -182,7 +182,7 @@ class Minibatch(tt.TensorVariable):
182182
183183
>>> assert x.eval().shape == (10, 10)
184184
185-
185+
186186
You can pass the Minibatch `x` to your desired model:
187187
188188
>>> with pm.Model() as model:
@@ -192,7 +192,7 @@ class Minibatch(tt.TensorVariable):
192192
193193
194194
Then you can perform regular Variational Inference out of the box
195-
195+
196196
197197
>>> with model:
198198
... approx = pm.fit()
@@ -478,16 +478,19 @@ class Data:
478478
For more information, take a look at this example notebook
479479
https://docs.pymc.io/notebooks/data_container.html
480480
"""
481+
481482
def __new__(self, name, value):
483+
if isinstance(value, list):
484+
value = np.array(value)
482485

483486
# Add data container to the named variables of the model.
484487
try:
485488
model = pm.Model.get_context()
486489
except TypeError:
487-
raise TypeError("No model on context stack, which is needed to "
488-
"instantiate a data container. Add variable "
489-
"inside a 'with model:' block.")
490-
490+
raise TypeError(
491+
"No model on context stack, which is needed to instantiate a data container. "
492+
"Add variable inside a 'with model:' block."
493+
)
491494
name = model.name_for(name)
492495

493496
# `pm.model.pandas_to_array` takes care of parameter `value` and
@@ -498,7 +501,6 @@ def __new__(self, name, value):
498501
# its shape.
499502
shared_object.dshape = tuple(shared_object.shape.eval())
500503

501-
502504
model.add_random_variable(shared_object)
503505

504506
return shared_object

pymc3/distributions/distribution.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ def getattr_value(self, val):
111111
if isinstance(val, tt.TensorVariable):
112112
return val.tag.test_value
113113

114+
if isinstance(val, tt.sharedvar.TensorSharedVariable):
115+
return val.get_value()
116+
114117
if isinstance(val, theano_constant):
115118
return val.value
116119

pymc3/model.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1244,7 +1244,7 @@ def set_data(new_data, model=None):
12441244
----------
12451245
new_data: dict
12461246
New values for the data containers. The keys of the dictionary are
1247-
the variables names in the model and the values are the objects
1247+
the variables' names in the model and the values are the objects
12481248
with which to update.
12491249
model: Model (optional if in `with` context)
12501250
@@ -1266,7 +1266,7 @@ def set_data(new_data, model=None):
12661266
.. code:: ipython
12671267
12681268
>>> with model:
1269-
... pm.set_data({'x': [5,6,9]})
1269+
... pm.set_data({'x': [5., 6., 9.]})
12701270
... y_test = pm.sample_posterior_predictive(trace)
12711271
>>> y_test['obs'].mean(axis=0)
12721272
array([4.6088569 , 5.54128318, 8.32953844])
@@ -1275,6 +1275,8 @@ def set_data(new_data, model=None):
12751275

12761276
for variable_name, new_value in new_data.items():
12771277
if isinstance(model[variable_name], SharedVariable):
1278+
if isinstance(new_value, list):
1279+
new_value = np.array(new_value)
12781280
model[variable_name].set_value(pandas_to_array(new_value))
12791281
else:
12801282
message = 'The variable `{}` must be defined as `pymc3.' \
@@ -1501,7 +1503,17 @@ def pandas_to_array(data):
15011503
ret = generator(data)
15021504
else:
15031505
ret = np.asarray(data)
1504-
return pm.floatX(ret)
1506+
1507+
# type handling to enable index variables when data is int:
1508+
if hasattr(data, "dtype"):
1509+
if "int" in str(data.dtype):
1510+
return pm.intX(ret)
1511+
# otherwise, assume float:
1512+
else:
1513+
return pm.floatX(ret)
1514+
# needed for uses of this function other than with pm.Data:
1515+
else:
1516+
return pm.floatX(ret)
15051517

15061518

15071519
def as_tensor(data, name, model, distribution):

pymc3/tests/test_data_container.py

Lines changed: 110 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -20,117 +20,168 @@
2020

2121
class TestData(SeededTest):
2222
def test_deterministic(self):
23-
data_values = np.array([.5, .4, 5, 2])
23+
data_values = np.array([0.5, 0.4, 5, 2])
2424
with pm.Model() as model:
25-
X = pm.Data('X', data_values)
26-
pm.Normal('y', 0, 1, observed=X)
25+
X = pm.Data("X", data_values)
26+
pm.Normal("y", 0, 1, observed=X)
2727
model.logp(model.test_point)
2828

2929
def test_sample(self):
3030
x = np.random.normal(size=100)
3131
y = x + np.random.normal(scale=1e-2, size=100)
3232

33-
x_pred = np.linspace(-3, 3, 200, dtype='float32')
33+
x_pred = np.linspace(-3, 3, 200, dtype="float32")
3434

3535
with pm.Model():
36-
x_shared = pm.Data('x_shared', x)
37-
b = pm.Normal('b', 0., 10.)
38-
pm.Normal('obs', b * x_shared, np.sqrt(1e-2), observed=y)
39-
prior_trace0 = pm.sample_prior_predictive(1000)
36+
x_shared = pm.Data("x_shared", x)
37+
b = pm.Normal("b", 0.0, 10.0)
38+
pm.Normal("obs", b * x_shared, np.sqrt(1e-2), observed=y)
4039

40+
prior_trace0 = pm.sample_prior_predictive(1000)
4141
trace = pm.sample(1000, init=None, tune=1000, chains=1)
4242
pp_trace0 = pm.sample_posterior_predictive(trace, 1000)
4343
pp_trace01 = pm.fast_sample_posterior_predictive(trace, 1000)
4444

4545
x_shared.set_value(x_pred)
46+
prior_trace1 = pm.sample_prior_predictive(1000)
4647
pp_trace1 = pm.sample_posterior_predictive(trace, samples=1000)
4748
pp_trace11 = pm.fast_sample_posterior_predictive(trace, samples=1000)
48-
prior_trace1 = pm.sample_prior_predictive(1000)
4949

50-
assert prior_trace0['b'].shape == (1000,)
51-
assert prior_trace0['obs'].shape == (1000, 100)
52-
assert prior_trace1['obs'].shape == (1000, 200)
50+
assert prior_trace0["b"].shape == (1000,)
51+
assert prior_trace0["obs"].shape == (1000, 100)
52+
assert prior_trace1["obs"].shape == (1000, 200)
5353

54-
assert pp_trace0['obs'].shape == (1000, 100)
55-
assert pp_trace01['obs'].shape == (1000, 100)
54+
assert pp_trace0["obs"].shape == (1000, 100)
55+
assert pp_trace01["obs"].shape == (1000, 100)
5656

57-
np.testing.assert_allclose(x, pp_trace0['obs'].mean(axis=0), atol=1e-1)
58-
np.testing.assert_allclose(x, pp_trace01['obs'].mean(axis=0), atol=1e-1)
57+
np.testing.assert_allclose(x, pp_trace0["obs"].mean(axis=0), atol=1e-1)
58+
np.testing.assert_allclose(x, pp_trace01["obs"].mean(axis=0), atol=1e-1)
5959

60-
assert pp_trace1['obs'].shape == (1000, 200)
61-
assert pp_trace11['obs'].shape == (1000, 200)
60+
assert pp_trace1["obs"].shape == (1000, 200)
61+
assert pp_trace11["obs"].shape == (1000, 200)
6262

63-
np.testing.assert_allclose(x_pred, pp_trace1['obs'].mean(axis=0),
64-
atol=1e-1)
65-
np.testing.assert_allclose(x_pred, pp_trace11['obs'].mean(axis=0),
66-
atol=1e-1)
63+
np.testing.assert_allclose(x_pred, pp_trace1["obs"].mean(axis=0), atol=1e-1)
64+
np.testing.assert_allclose(x_pred, pp_trace11["obs"].mean(axis=0), atol=1e-1)
6765

6866
def test_sample_posterior_predictive_after_set_data(self):
6967
with pm.Model() as model:
70-
x = pm.Data('x', [1., 2., 3.])
71-
y = pm.Data('y', [1., 2., 3.])
72-
beta = pm.Normal('beta', 0, 10.)
73-
pm.Normal('obs', beta * x, np.sqrt(1e-2), observed=y)
68+
x = pm.Data("x", [1.0, 2.0, 3.0])
69+
y = pm.Data("y", [1.0, 2.0, 3.0])
70+
beta = pm.Normal("beta", 0, 10.0)
71+
pm.Normal("obs", beta * x, np.sqrt(1e-2), observed=y)
7472
trace = pm.sample(1000, tune=1000, chains=1)
7573
# Predict on new data.
7674
with model:
7775
x_test = [5, 6, 9]
78-
pm.set_data(new_data={'x': x_test})
76+
pm.set_data(new_data={"x": x_test})
7977
y_test = pm.sample_posterior_predictive(trace)
8078
y_test1 = pm.fast_sample_posterior_predictive(trace)
8179

82-
assert y_test['obs'].shape == (1000, 3)
83-
assert y_test1['obs'].shape == (1000, 3)
84-
np.testing.assert_allclose(x_test, y_test['obs'].mean(axis=0),
85-
atol=1e-1)
86-
np.testing.assert_allclose(x_test, y_test1['obs'].mean(axis=0),
87-
atol=1e-1)
80+
assert y_test["obs"].shape == (1000, 3)
81+
assert y_test1["obs"].shape == (1000, 3)
82+
np.testing.assert_allclose(x_test, y_test["obs"].mean(axis=0), atol=1e-1)
83+
np.testing.assert_allclose(x_test, y_test1["obs"].mean(axis=0), atol=1e-1)
8884

8985
def test_sample_after_set_data(self):
9086
with pm.Model() as model:
91-
x = pm.Data('x', [1., 2., 3.])
92-
y = pm.Data('y', [1., 2., 3.])
93-
beta = pm.Normal('beta', 0, 10.)
94-
pm.Normal('obs', beta * x, np.sqrt(1e-2), observed=y)
87+
x = pm.Data("x", [1.0, 2.0, 3.0])
88+
y = pm.Data("y", [1.0, 2.0, 3.0])
89+
beta = pm.Normal("beta", 0, 10.0)
90+
pm.Normal("obs", beta * x, np.sqrt(1e-2), observed=y)
9591
pm.sample(1000, init=None, tune=1000, chains=1)
9692
# Predict on new data.
97-
new_x = [5., 6., 9.]
98-
new_y = [5., 6., 9.]
93+
new_x = [5.0, 6.0, 9.0]
94+
new_y = [5.0, 6.0, 9.0]
9995
with model:
100-
pm.set_data(new_data={'x': new_x, 'y': new_y})
96+
pm.set_data(new_data={"x": new_x, "y": new_y})
10197
new_trace = pm.sample(1000, init=None, tune=1000, chains=1)
10298
pp_trace = pm.sample_posterior_predictive(new_trace, 1000)
10399
pp_tracef = pm.fast_sample_posterior_predictive(new_trace, 1000)
104100

105-
assert pp_trace['obs'].shape == (1000, 3)
106-
assert pp_tracef['obs'].shape == (1000, 3)
107-
np.testing.assert_allclose(new_y, pp_trace['obs'].mean(axis=0),
108-
atol=1e-1)
109-
np.testing.assert_allclose(new_y, pp_tracef['obs'].mean(axis=0),
110-
atol=1e-1)
101+
assert pp_trace["obs"].shape == (1000, 3)
102+
assert pp_tracef["obs"].shape == (1000, 3)
103+
np.testing.assert_allclose(new_y, pp_trace["obs"].mean(axis=0), atol=1e-1)
104+
np.testing.assert_allclose(new_y, pp_tracef["obs"].mean(axis=0), atol=1e-1)
105+
106+
def test_shared_data_as_index(self):
107+
"""
108+
Allow pm.Data to be used for index variables, i.e with integers as well as floats.
109+
See https://github.com/pymc-devs/pymc3/issues/3813
110+
"""
111+
with pm.Model() as model:
112+
index = pm.Data("index", [2, 0, 1, 0, 2])
113+
y = pm.Data("y", [1.0, 2.0, 3.0, 2.0, 1.0])
114+
alpha = pm.Normal("alpha", 0, 1.5, shape=3)
115+
pm.Normal("obs", alpha[index], np.sqrt(1e-2), observed=y)
116+
117+
prior_trace = pm.sample_prior_predictive(1000, var_names=["alpha"])
118+
trace = pm.sample(1000, init=None, tune=1000, chains=1)
119+
120+
# Predict on new data
121+
new_index = np.array([0, 1, 2])
122+
new_y = [5.0, 6.0, 9.0]
123+
with model:
124+
pm.set_data(new_data={"index": new_index, "y": new_y})
125+
pp_trace = pm.sample_posterior_predictive(
126+
trace, 1000, var_names=["alpha", "obs"]
127+
)
128+
pp_tracef = pm.fast_sample_posterior_predictive(
129+
trace, 1000, var_names=["alpha", "obs"]
130+
)
131+
132+
assert prior_trace["alpha"].shape == (1000, 3)
133+
assert trace["alpha"].shape == (1000, 3)
134+
assert pp_trace["alpha"].shape == (1000, 3)
135+
assert pp_trace["obs"].shape == (1000, 3)
136+
assert pp_tracef["alpha"].shape == (1000, 3)
137+
assert pp_tracef["obs"].shape == (1000, 3)
138+
139+
def test_shared_data_as_rv_input(self):
140+
"""
141+
Allow pm.Data to be used as input for other RVs.
142+
See https://github.com/pymc-devs/pymc3/issues/3842
143+
"""
144+
with pm.Model() as m:
145+
x = pm.Data("x", [1.0, 2.0, 3.0])
146+
_ = pm.Normal("y", mu=x, shape=3)
147+
trace = pm.sample(chains=1)
148+
149+
np.testing.assert_allclose(np.array([1.0, 2.0, 3.0]), x.get_value(), atol=1e-1)
150+
np.testing.assert_allclose(
151+
np.array([1.0, 2.0, 3.0]), trace["y"].mean(0), atol=1e-1
152+
)
153+
154+
with m:
155+
pm.set_data({"x": np.array([2.0, 4.0, 6.0])})
156+
trace = pm.sample(chains=1)
157+
158+
np.testing.assert_allclose(np.array([2.0, 4.0, 6.0]), x.get_value(), atol=1e-1)
159+
np.testing.assert_allclose(
160+
np.array([2.0, 4.0, 6.0]), trace["y"].mean(0), atol=1e-1
161+
)
111162

112163
def test_creation_of_data_outside_model_context(self):
113164
with pytest.raises((IndexError, TypeError)) as error:
114-
pm.Data('data', [1.1, 2.2, 3.3])
115-
error.match('No model on context stack')
165+
pm.Data("data", [1.1, 2.2, 3.3])
166+
error.match("No model on context stack")
116167

117168
def test_set_data_to_non_data_container_variables(self):
118169
with pm.Model() as model:
119-
x = np.array([1., 2., 3.])
120-
y = np.array([1., 2., 3.])
121-
beta = pm.Normal('beta', 0, 10.)
122-
pm.Normal('obs', beta * x, np.sqrt(1e-2), observed=y)
170+
x = np.array([1.0, 2.0, 3.0])
171+
y = np.array([1.0, 2.0, 3.0])
172+
beta = pm.Normal("beta", 0, 10.0)
173+
pm.Normal("obs", beta * x, np.sqrt(1e-2), observed=y)
123174
pm.sample(1000, init=None, tune=1000, chains=1)
124175
with pytest.raises(TypeError) as error:
125-
pm.set_data({'beta': [1.1, 2.2, 3.3]}, model=model)
126-
error.match('defined as `pymc3.Data` inside the model')
176+
pm.set_data({"beta": [1.1, 2.2, 3.3]}, model=model)
177+
error.match("defined as `pymc3.Data` inside the model")
127178

128179
def test_model_to_graphviz_for_model_with_data_container(self):
129180
with pm.Model() as model:
130-
x = pm.Data('x', [1., 2., 3.])
131-
y = pm.Data('y', [1., 2., 3.])
132-
beta = pm.Normal('beta', 0, 10.)
133-
pm.Normal('obs', beta * x, np.sqrt(1e-2), observed=y)
181+
x = pm.Data("x", [1.0, 2.0, 3.0])
182+
y = pm.Data("y", [1.0, 2.0, 3.0])
183+
beta = pm.Normal("beta", 0, 10.0)
184+
pm.Normal("obs", beta * x, np.sqrt(1e-2), observed=y)
134185
pm.sample(1000, init=None, tune=1000, chains=1)
135186

136187
g = pm.model_to_graphviz(model)
@@ -147,7 +198,7 @@ def test_model_to_graphviz_for_model_with_data_container(self):
147198

148199
def test_data_naming():
149200
"""
150-
This is a test for issue #3793 -- `Data` objects in named models are
201+
This is a test for issue #3793 -- `Data` objects in named models are
151202
not given model-relative names.
152203
"""
153204
with pm.Model("named_model") as model:

0 commit comments

Comments
 (0)