Skip to content

[Performance]: Stability Concerns with LLaMA-4 Models After Extended Uptime (llama-4 models stability on h100 gpus) #16473

Open
@nskpro-cmd

Description

@nskpro-cmd

Proposal to improve performance

Hi all,

I wanted to check if anyone else has encountered stability issues with the LLaMA-4 models over extended periods of time. In our setup, the model functions as expected immediately after deployment or a restart. However, after approximately 24 to 36 hours, it stops responding to inference requests.

I’ve verified that the underlying node conditions (including GPU health, memory, and system resources) remain healthy during this time. This behavior is consistent across restarts, where the model becomes unresponsive after running for a day or more.

Is this a known issue with the current version of the model or vLLM backend? Has anyone else experienced similar behavior or found a workaround?

Appreciate any insights.

Report of performance regression

Here is the config for this model which i deployed "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8".
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "4"
meta.helm.sh/release-name: llama-4-maverick-instruct-fp8
meta.helm.sh/release-namespace: llms
creationTimestamp: "2025-04-06T11:41:41Z"
generation: 8
labels:
app.kubernetes.io/instance: llama-4-maverick-instruct-fp8
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: llama-4-maverick-instruct-fp8
app.kubernetes.io/version: 0.1.0
helm.sh/chart: vllm-server-0.1.0
name: llama-4-maverick-instruct-fp8
namespace: llms
resourceVersion: "1383024842"
uid: d2ff200c-6000-42da-995e-ffd498f600bb
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: llama-4-maverick-instruct-fp8
app.kubernetes.io/name: llama-4-maverick-instruct-fp8
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2025-04-09T18:21:07+05:30"
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
rollme: YKuic
creationTimestamp: null
labels:
app.kubernetes.io/instance: llama-4-maverick-instruct-fp8
app.kubernetes.io/name: llama-4-maverick-instruct-fp8
spec:
containers:
- args:
- --host
- 0.0.0.0
- --port
- "8000"
- --model
- meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
- --swap-space
- "16"
- --disable-log-requests
- --tensor-parallel-size
- "8"
- --gpu-memory-utilization
- "0.98"
- --max-model-len
- "430000"
- --trust-remote-code
- --enable-auto-tool-choice
- --enable-prefix-caching
- --tool-call-parser
- llama3_json
env:
- name: HF_HOME
value: /huggingface
- name: HUGGINGFACE_HUB_CACHE
value: /huggingface/hub
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "True"
- name: HUGGING_FACE_HUB_TOKEN
value: <-hf-token>
image: vllm/vllm-openai:v0.8.3
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: vllm-server
ports:
- containerPort: 8000
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
nvidia.com/gpu: "8"
requests:
cpu: "16"
memory: 1000Gi
securityContext: {}
startupProbe:
failureThreshold: 60
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /huggingface
name: hf-volume
- mountPath: /dev/shm
name: dshm
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: gitlab-docker-credentials
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
volumes:
- name: hf-volume
persistentVolumeClaim:
claimName: llama-4-maverick-instruct-fp8-cache
- emptyDir:
medium: Memory
sizeLimit: 500Gi
name: dshm

Here is the config for this model which i deployed "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8".

apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
meta.helm.sh/release-name: llama-4-scout-instruct
meta.helm.sh/release-namespace: llms
creationTimestamp: "2025-04-06T12:56:38Z"
generation: 4
labels:
app.kubernetes.io/instance: llama-4-scout-instruct
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: llama-4-scout-instruct
app.kubernetes.io/version: 0.1.0
helm.sh/chart: vllm-server-0.1.0
name: llama-4-scout-instruct
namespace: llms
resourceVersion: "1383204273"
uid: d1633e26-094d-44a0-8c88-e7ad7481b4d1
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: llama-4-scout-instruct
app.kubernetes.io/name: llama-4-scout-instruct
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
rollme: VOJlB
creationTimestamp: null
labels:
app.kubernetes.io/instance: llama-4-scout-instruct
app.kubernetes.io/name: llama-4-scout-instruct
spec:
containers:
- args:
- --host
- 0.0.0.0
- --port
- "8000"
- --model
- meta-llama/Llama-4-Scout-17B-16E-Instruct
- --swap-space
- "16"
- --disable-log-requests
- --tensor-parallel-size
- "8"
- --gpu-memory-utilization
- "0.98"
- --max-model-len
- "1000000"
- --trust-remote-code
- --enable-auto-tool-choice
- --enable-prefix-caching
- --tool-call-parser
- llama3_json
env:
- name: HF_HOME
value: /huggingface
- name: HUGGINGFACE_HUB_CACHE
value: /huggingface/hub
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "True"
- name: HUGGING_FACE_HUB_TOKEN
value: <-hf-token>
image: vllm/vllm-openai:v0.8.3
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: vllm-server
ports:
- containerPort: 8000
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
nvidia.com/gpu: "8"
requests:
cpu: "16"
memory: 500Gi
securityContext: {}
startupProbe:
failureThreshold: 60
httpGet:
path: /health
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /huggingface
name: hf-volume
- mountPath: /dev/shm
name: dshm
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: gitlab-docker-credentials
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
volumes:
- name: hf-volume
persistentVolumeClaim:
claimName: llama-4-scout-instruct-cache
- emptyDir:
medium: Memory
sizeLimit: 250Gi
name: dshm

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

performancePerformance-related issues

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions