Skip to content

BUG: PyMC 5.7.2 OOM - memory leak #6852

Closed
@danjenson

Description

@danjenson

Describe the issue:

Process memory grows steadily until it consumes all available memory (and swap). Replicated on linux and M1 Mac. Note that the default 'fork' for multiprocessing on linux fails immediately before it even begins sampling with Errno 12 OOM.

PYMC version: 5.7.2

Linux system:

  • Void Linux
  • Kernel 6.3.12_1
  • 64 GB DDR5 RAM
  • 24 GB RTX 4090 GPU
  • AMD Ryzen 9 7950X 16 core, 32 threads

Mac System:

  • 16 GB memory
  • 8 Cores

Dataset: ~161 mb total.

Reproduceable code example:

#!/usr/bin/env python3
import numpy as np
import pandas as pd
import pymc as pm


def pymc_bayes(df: pd.DataFrame):
    a, b, c, i = df.a.values, df.b.values, df.c.values, df.i.values
    n_i = int(i.max() + 1)
    with pm.Model() as m:
        alpha = pm.Normal("alpha", 0, 1, shape=[n_i])
        beta_b = pm.HalfNormal("beta_b", 1)
        beta_c = pm.HalfNormal("beta_c", 1)
        beta_int = pm.Normal("beta_int", 0, 1)
        mu = pm.Deterministic(
            "mu", alpha[i] + beta_b * b + beta_c * c + beta_int * b * c
        )
        sigma = pm.Exponential("sigma", 1)
        a_hat = pm.Normal("a_hat", mu, sigma, observed=a)
        idata = pm.sample(mp_ctx="spawn")  # fork fails immediately with OOM
        idata.to_netcdf("pymc_bayes.nc")
    print("finished!")


if __name__ == "__main__":
    n, n_int = 2618018, 17  # to match the real dataset I care about
    df = pd.DataFrame(np.random.randn(n, 3), columns=['a', 'b', 'c'])
    df['i'] = np.random.randint(0, n_int, size=n)
    pymc_bayes(df)

Error message:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta_b, beta_c, beta_int, sigma]
Process worker_chain_2:███████████████████████---------------| 76.14% [6091/8000 18:42<05:51 Sampling 4 chains, 0 divergences]s]
Process worker_chain_3:
Process worker_chain_0:
Process worker_chain_1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 122, in run
    self._start_loop()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 181, in _start_loop
    msg = self._recv_msg()
          ^^^^^^^^^^^^^^^^
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 153, in _recv_msg
    return self._msg_pipe.recv()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 249, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 378, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 122, in run
    self._start_loop()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 181, in _start_loop
    msg = self._recv_msg()
          ^^^^^^^^^^^^^^^^
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 153, in _recv_msg
    return self._msg_pipe.recv()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 249, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 378, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 122, in run
    self._start_loop()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 181, in _start_loop
    msg = self._recv_msg()
          ^^^^^^^^^^^^^^^^
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 153, in _recv_msg
    return self._msg_pipe.recv()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 249, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 378, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 122, in run
    self._start_loop()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 181, in _start_loop
    msg = self._recv_msg()
          ^^^^^^^^^^^^^^^^
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 153, in _recv_msg
    return self._msg_pipe.recv()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 249, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 378, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 194, in _run_process
    _Process(*args).run()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 129, in run
    self._msg_pipe.send(("error", e))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 205, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 410, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 367, in _send
    n = write(self._handle, buf)
        ^^^^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 194, in _run_process
    _Process(*args).run()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 129, in run
    self._msg_pipe.send(("error", e))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 205, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 410, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 367, in _send
    n = write(self._handle, buf)
        ^^^^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 194, in _run_process
    _Process(*args).run()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 129, in run
    self._msg_pipe.send(("error", e))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 205, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 410, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 367, in _send
    n = write(self._handle, buf)
        ^^^^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 194, in _run_process
    _Process(*args).run()
  File "/home/danj/.local/lib/python3.11/site-packages/pymc/sampling/parallel.py", line 129, in run
    self._msg_pipe.send(("error", e))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 205, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 410, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 367, in _send
    n = write(self._handle, buf)
        ^^^^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe

PyMC version information:

PYMC 5.7.2
Aesara 2.9.1
PyTensor 2.14.2

uname -a: Linux ghost 6.3.13_1 #1 SMP PREEMPT_DYNAMIC Tue Jul 25 00:19:40 UTC 2023 x86_64 GNU/Linux

Context for the issue:

This is a simple linear model with an interaction term, although I couldn't get it to work without OOM even with two covariates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions