Skip to content

Commit 0b43801

Browse files
authored
docs: add 405b disaggregated serving documentation (#496)
1 parent ce68339 commit 0b43801

File tree

3 files changed

+154
-54
lines changed

3 files changed

+154
-54
lines changed

examples/llm/README.md

+4-54
Original file line numberDiff line numberDiff line change
@@ -151,65 +151,15 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
151151

152152
```
153153

154-
### Multinode Examples
154+
### Multi-node deployment
155155

156-
#### Single node sized models
157-
You can deploy our example architectures on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 2 nodes
158-
159-
##### Disaggregated Deployment with KV Routing
160-
Node 1: Frontend, Processor, Router, 8 Decode
161-
Node 2: 8 Prefill
162-
163-
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by node 2.
164-
```bash
165-
# node 1
166-
docker compose -f deploy/docker-compose.yml up -d
167-
```
168-
169-
**Step 2**: Create the inference graph for this deployment. The easiest way to do this is to remove the `.link(PrefillWorker)` from the `disagg_router.py` file.
170-
171-
```python
172-
# graphs/disag_router.py
173-
# imports...
174-
Frontend.link(Processor).link(Router).link(VllmWorker)
175-
```
176-
177-
**Step 3**: Start the frontend, processor, router, and 8 VllmWorkers on node 1.
178-
```bash
179-
# node 1
180-
cd /workspace/examples/llm
181-
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml --VllmWorker.ServiceArgs.workers=8
182-
```
183-
184-
**Step 4**: Start 8 PrefillWorkers on node 2.
185-
Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly.
186-
187-
```bash
188-
# node 2
189-
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
190-
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
191-
192-
cd /workspace/examples/llm
193-
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/disagg_router.yaml --PrefillWorker.ServiceArgs.workers=8
194-
```
195-
196-
You can now use the same curl request from above to interact with your deployment!
156+
See [multinode-examples.md](multinode-examples.md) for more details.
197157

198158
### Close deployment
199159

200160
Kill all dynamo processes managed by circusd.
201161

202162
```
203-
function kill_tree() {
204-
local parent=$1
205-
local children=$(ps -o pid= --ppid $parent)
206-
for child in $children; do
207-
kill_tree $child
208-
done
209-
echo "Killing process $parent"
210-
kill -9 $parent
211-
}
212-
213-
# kill process-tree of circusd
214-
kill_tree $(pgrep circusd)
163+
ctrl-c
164+
pkill python3
215165
```
+66
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# This configuration file is used in the multinode-examples.md file
17+
# to start the 405B model on 3 nodes.
18+
19+
Frontend:
20+
served_model_name: nvidia/Llama-3.1-405B-Instruct-FP8
21+
endpoint: dynamo.Processor.chat/completions
22+
port: 8000
23+
24+
Processor:
25+
model: nvidia/Llama-3.1-405B-Instruct-FP8
26+
block-size: 64
27+
max-model-len: 8192
28+
router: kv
29+
30+
Router:
31+
model-name: nvidia/Llama-3.1-405B-Instruct-FP8
32+
min-workers: 1
33+
34+
VllmWorker:
35+
model: nvidia/Llama-3.1-405B-Instruct-FP8
36+
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
37+
block-size: 64
38+
max-model-len: 8192
39+
max-num-seqs: 16
40+
remote-prefill: true
41+
conditional-disagg: true
42+
max-local-prefill-length: 10
43+
max-prefill-queue-size: 2
44+
gpu-memory-utilization: 0.95
45+
tensor-parallel-size: 8
46+
router: kv
47+
quantization: modelopt
48+
enable-prefix-caching: true
49+
ServiceArgs:
50+
workers: 1
51+
resources:
52+
gpu: 8
53+
54+
PrefillWorker:
55+
model: nvidia/Llama-3.1-405B-Instruct-FP8
56+
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
57+
block-size: 64
58+
max-model-len: 8192
59+
max-num-seqs: 16
60+
gpu-memory-utilization: 0.95
61+
tensor-parallel-size: 8
62+
quantization: modelopt
63+
ServiceArgs:
64+
workers: 1
65+
resources:
66+
gpu: 8

examples/llm/multinode-examples.md

+84
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Multinode Examples
2+
3+
Table of Contents
4+
- [Single node sized models](#single-node-sized-models)
5+
6+
## Single node sized models
7+
You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node will need to be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
8+
9+
##### Disaggregated Deployment with KV Routing
10+
- Node 1: Frontend, Processor, Router, Decode Worker
11+
- Node 2: Prefill Worker
12+
- Node 3: Prefill Worker
13+
14+
Note that this can be easily extended to more nodes. You can also run the Frontend, Processor, and Router on a separate CPU only node if you'd like as long as all nodes have access to the NATS/ETCD endpoints!
15+
16+
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by all other nodes.
17+
```bash
18+
# node 1
19+
docker compose -f deploy/docker-compose.yml up -d
20+
```
21+
22+
**Step 2**: Create the inference graph for this node. Here we will use the `agg_router.py` (even though we are doing disaggregated serving) graph because we want the `Frontend`, `Processor`, `Router`, and `VllmWorker` to spin up (we will spin up the other decode worker and prefill worker separately on different nodes later).
23+
24+
```python
25+
# graphs/agg_router.py
26+
Frontend.link(Processor).link(Router).link(VllmWorker)
27+
```
28+
29+
**Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.
30+
31+
**Step 3**: Start the frontend, processor, router, and VllmWorker on node 1.
32+
```bash
33+
# node 1
34+
cd $DYNAMO_HOME/examples/llm
35+
dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
36+
```
37+
38+
**Step 4**: Start the first prefill worker on node 2.
39+
Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.
40+
41+
```bash
42+
# node 2
43+
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
44+
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
45+
46+
cd $DYNAMO_HOME/examples/llm
47+
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
48+
```
49+
50+
**Step 5**: Start the second prefill worker on node 3.
51+
```bash
52+
# node 3
53+
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
54+
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
55+
56+
cd $DYNAMO_HOME/examples/llm
57+
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
58+
```
59+
60+
### Client
61+
62+
In another terminal:
63+
```bash
64+
# this test request has around 200 tokens isl
65+
66+
curl <node1-ip>:8000/v1/chat/completions \
67+
-H "Content-Type: application/json" \
68+
-H "Accept: text/event-stream" \
69+
-d '{
70+
"model": "nvidia/Llama-3.1-405B-Instruct-FP8",
71+
"messages": [
72+
{
73+
"role": "user",
74+
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
75+
}
76+
],
77+
"stream": true,
78+
"max_tokens": 300
79+
}'
80+
```
81+
82+
#### Multi-node sized models
83+
84+
Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!

0 commit comments

Comments
 (0)