[Performance]: Stability Concerns with LLaMA-4 Models After Extended Uptime (llama-4 models stability on h100 gpus)

### Proposal to improve performance

Hi all,

I wanted to check if anyone else has encountered stability issues with the LLaMA-4 models over extended periods of time. In our setup, the model functions as expected immediately after deployment or a restart. However, after approximately 24 to 36 hours, it stops responding to inference requests.

I’ve verified that the underlying node conditions (including GPU health, memory, and system resources) remain healthy during this time. This behavior is consistent across restarts, where the model becomes unresponsive after running for a day or more.

Is this a known issue with the current version of the model or vLLM backend? Has anyone else experienced similar behavior or found a workaround?

Appreciate any insights.

### Report of performance regression

Here is the config for this model which i deployed  "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8".
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
    meta.helm.sh/release-name: llama-4-maverick-instruct-fp8
    meta.helm.sh/release-namespace: llms
  creationTimestamp: "2025-04-06T11:41:41Z"
  generation: 8
  labels:
    app.kubernetes.io/instance: llama-4-maverick-instruct-fp8
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: llama-4-maverick-instruct-fp8
    app.kubernetes.io/version: 0.1.0
    helm.sh/chart: vllm-server-0.1.0
  name: llama-4-maverick-instruct-fp8
  namespace: llms
  resourceVersion: "1383024842"
  uid: d2ff200c-6000-42da-995e-ffd498f600bb
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: llama-4-maverick-instruct-fp8
      app.kubernetes.io/name: llama-4-maverick-instruct-fp8
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2025-04-09T18:21:07+05:30"
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
        rollme: YKuic
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: llama-4-maverick-instruct-fp8
        app.kubernetes.io/name: llama-4-maverick-instruct-fp8
    spec:
      containers:
      - args:
        - --host
        - 0.0.0.0
        - --port
        - "8000"
        - --model
        - meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
        - --swap-space
        - "16"
        - --disable-log-requests
        - --tensor-parallel-size
        - "8"
        - --gpu-memory-utilization
        - "0.98"
        - --max-model-len
        - "430000"
        - --trust-remote-code
        - --enable-auto-tool-choice
        - --enable-prefix-caching
        - --tool-call-parser
        - llama3_json
        env:
        - name: HF_HOME
          value: /huggingface
        - name: HUGGINGFACE_HUB_CACHE
          value: /huggingface/hub
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "True"
        - name: HUGGING_FACE_HUB_TOKEN
          value: <-hf-token>
        image: vllm/vllm-openai:v0.8.3
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: vllm-server
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            nvidia.com/gpu: "8"
          requests:
            cpu: "16"
            memory: 1000Gi
        securityContext: {}
        startupProbe:
          failureThreshold: 60
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /huggingface
          name: hf-volume
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: gitlab-docker-credentials
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      volumes:
      - name: hf-volume
        persistentVolumeClaim:
          claimName: llama-4-maverick-instruct-fp8-cache
      - emptyDir:
          medium: Memory
          sizeLimit: 500Gi
        name: dshm



Here is the config for this model which i deployed  "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8".

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    meta.helm.sh/release-name: llama-4-scout-instruct
    meta.helm.sh/release-namespace: llms
  creationTimestamp: "2025-04-06T12:56:38Z"
  generation: 4
  labels:
    app.kubernetes.io/instance: llama-4-scout-instruct
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: llama-4-scout-instruct
    app.kubernetes.io/version: 0.1.0
    helm.sh/chart: vllm-server-0.1.0
  name: llama-4-scout-instruct
  namespace: llms
  resourceVersion: "1383204273"
  uid: d1633e26-094d-44a0-8c88-e7ad7481b4d1
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: llama-4-scout-instruct
      app.kubernetes.io/name: llama-4-scout-instruct
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
        rollme: VOJlB
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: llama-4-scout-instruct
        app.kubernetes.io/name: llama-4-scout-instruct
    spec:
      containers:
      - args:
        - --host
        - 0.0.0.0
        - --port
        - "8000"
        - --model
        - meta-llama/Llama-4-Scout-17B-16E-Instruct
        - --swap-space
        - "16"
        - --disable-log-requests
        - --tensor-parallel-size
        - "8"
        - --gpu-memory-utilization
        - "0.98"
        - --max-model-len
        - "1000000"
        - --trust-remote-code
        - --enable-auto-tool-choice
        - --enable-prefix-caching
        - --tool-call-parser
        - llama3_json
        env:
        - name: HF_HOME
          value: /huggingface
        - name: HUGGINGFACE_HUB_CACHE
          value: /huggingface/hub
        - name: HF_HUB_ENABLE_HF_TRANSFER
          value: "True"
        - name: HUGGING_FACE_HUB_TOKEN
          value: <-hf-token>
        image: vllm/vllm-openai:v0.8.3
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: vllm-server
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 20
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            nvidia.com/gpu: "8"
          requests:
            cpu: "16"
            memory: 500Gi
        securityContext: {}
        startupProbe:
          failureThreshold: 60
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /huggingface
          name: hf-volume
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: gitlab-docker-credentials
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      volumes:
      - name: hf-volume
        persistentVolumeClaim:
          claimName: llama-4-scout-instruct-cache
      - emptyDir:
          medium: Memory
          sizeLimit: 250Gi
        name: dshm

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Stability Concerns with LLaMA-4 Models After Extended Uptime (llama-4 models stability on h100 gpus) #16473

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Stability Concerns with LLaMA-4 Models After Extended Uptime (llama-4 models stability on h100 gpus) #16473

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions