ENH: Add checkpoints during sampling

### Before

_No response_

### After

```python
with pm.Model():
    ...
    pm.sample(..., checkpoint_file=some_path, checkpoint_freq=10)
```


### Context for the issue:

If one has models that take very long to sample, it would be great to have a way to store the information of the `steppers` in a checkpoint file so that if something happens and sampling stops, we could pick up from where we left off. This is a very old feature request that is related to #292, #143 and #3661.

Those issues talk about `iter_sample` that works as a generator that one could simply pause and resume later. The problem with that is that there is no access to the stepper's state. I think that we need two things to get the samplers warm started:

1. The trace that was collected so far
2. The step method's state

Currently, most samplers and step methods provide some ways to get 1 but we never have access to 2. The current pymc samplers have a bunch of `KeyboardInterrupt` catches ([here](https://github.com/pymc-devs/pymc/blob/main/pymc/sampling/mcmc.py#L1177), [here](https://github.com/pymc-devs/pymc/blob/main/pymc/sampling/population.py#L438), [here](https://github.com/pymc-devs/pymc/blob/main/pymc/sampling/parallel.py#L169), and [here](https://github.com/pymc-devs/pymc/blob/main/pymc/sampling/parallel.py#L189)). We could add a handling call there to also store the step method's state. `nutpie` has the [non-blocking sampling](https://github.com/pymc-devs/nutpie/blob/main/python/nutpie/sample.py#L466) with an `abort` function call when `KeyboardInterrupt` gets hit. We could maybe add a similar state recording thing there. `blackjax` has [its progress bar conditional steps](https://github.com/blackjax-devs/blackjax/blob/main/blackjax/progress_bar.py) which we could try to mimic to get the same effect. `numpyro` has a similar thing going with the [progress bar](https://github.com/pyro-ppl/numpyro/blob/master/numpyro/util.py#L368) but it looks like it's way deeper than with `blackjax`.

All of this to say that I think that we need to define some kind of standard way for the samplers to provide their state information. The specific samplers would then have to conform to the standard using whatever internal things they need. For `pymc` samplers it would be some way to recreate the step methods (maybe using some kind of `__setstate__` and `__getstate__`), for `nutpie` it would have to be some new datatype that could be sent into ruff, for `blackjax` it could be the kernel and random keys. I think that the important thing is to get the standard approach to which samplers should conform to, and once we have those, we could build support for checkpoints and restarting sampling from them later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add checkpoints during sampling #7503

Before

After

Context for the issue:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: Add checkpoints during sampling #7503

Description

Before

After

Context for the issue:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions