Skip to content

Commit 926370b

Browse files
authored
docs: Benchmarking guide updates (#678) (#699)
1 parent b7cd853 commit 926370b

File tree

3 files changed

+46
-46
lines changed

3 files changed

+46
-46
lines changed

examples/llm/benchmarks/README.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,15 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
2626

2727
H100 80GB x8 node(s) are required for benchmarking.
2828

29+
> [!NOTE]
30+
> This guide was tested on node(s) with the following hardware configuration:
31+
> * **GPUs**: 8xH100 80GB HBM3 (GPU Memory Bandwidth 3.2 TBs)
32+
> * **CPU**: 2x Intel Saphire Rapids, Intel(R) Xeon(R) Platinum 8480CL E5, 112 cores (56 cores per CPU), 2.00 GHz (Base), 3.8 Ghz (Max boost), PCIe Gen5
33+
> * **NVLink**: NVLink 4th Generation, 900 GB/s (GPU to GPU NVLink bidirectional bandwidth), 18 Links per GPU
34+
> * **InfiniBand**: 8X400Gbit/s (Compute Links), 2X400Gbit/s (Storage Links)
35+
>
36+
> Benchmarking with a different hardware configuration may yield suboptimal results.
37+
2938
1\. Build benchmarking image
3039
```bash
3140
./container/build.sh
@@ -43,7 +52,7 @@ docker compose -f deploy/docker_compose.yml up -d
4352

4453
## Disaggregated Single Node Benchmarking
4554

46-
*One H100 80GB x8 node is required for this setup.*
55+
One H100 80GB x8 node is required for this setup.
4756

4857
In the following setup we compare Dynamo disaggregated vLLM performance to
4958
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on a single node. These were chosen to optimize
@@ -72,12 +81,7 @@ Collect the performance numbers as shown on the [Collecting Performance Numbers]
7281

7382
## Disaggregated Multi Node Benchmarking
7483

75-
*Two H100 80GB x8 nodes are required for this setup.*
76-
77-
> [!Note]
78-
> Nodes used for benchmarking were part of a cluster connected via InfiniBand
79-
> NDR with 8 connections for compute and 2 for storage. Both fabrics were on
80-
> their own fat tree non-blocking topology.
84+
Two H100 80GB x8 nodes are required for this setup.
8185

8286
In the following steps we compare Dynamo disaggregated vLLM performance to
8387
[native vLLM Aggregated Baseline](#vllm-aggregated-baseline-benchmarking) on two nodes. These were chosen to optimize

examples/llm/benchmarks/disagg.yaml

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,16 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16+
Common:
17+
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
18+
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
19+
router: round-robin
20+
# Number of tokens in a batch for more efficient chunked transfers to GPUs.
21+
block-size: 128
22+
max-model-len: 3500
23+
max-num-batched-tokens: 3500
24+
disable-log-requests: true
25+
1626
Frontend:
1727
# This model was chosen for its 70B size and FP8 precision, which the TP and
1828
# DP configurations were tuned for its size, and its precision reduces model
@@ -22,38 +32,26 @@ Frontend:
2232
port: 8000
2333

2434
Processor:
25-
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
26-
router: round-robin
35+
common-configs: [model, router]
2736

2837
# x1 process with 4 GPUs generating output tokens (the "decode" phase).
2938
VllmWorker:
30-
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
31-
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
32-
# Number of tokens in a batch for more efficient chunked transfers to GPUs.
33-
block-size: 128
34-
max-model-len: 3500
39+
common-configs: [model, kv-transfer-config, router, block-size, max-model-len, disable-log-requests]
3540
# Enable prefill at different workers.
3641
remote-prefill: true
3742
# Disable local prefill so only disaggregated prefill is used.
3843
conditional-disagg: false
39-
tensor-parallel-size: 4
4044
gpu-memory-utilization: 0.95
41-
disable-log-requests: true
45+
tensor-parallel-size: 4
4246
ServiceArgs:
4347
workers: 1
4448
resources:
4549
gpu: 4
4650

4751
# x4 processes each with 1 GPU handling the initial prefill (context embedding) phase.
4852
PrefillWorker:
49-
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
50-
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
51-
block-size: 128
52-
max-model-len: 3500
53-
max-num-batched-tokens: 3500
53+
common-configs: [model, kv-transfer-config, block-size, max-model-len, max-num-batched-tokens, gpu-memory-utilization, disable-log-requests]
5454
tensor-parallel-size: 1
55-
gpu-memory-utilization: 0.95
56-
disable-log-requests: true
5755
ServiceArgs:
5856
workers: 4
5957
resources:

examples/llm/benchmarks/disagg_multinode.yaml

Lines changed: 21 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,9 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
Frontend:
17-
served_model_name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
18-
endpoint: dynamo.Processor.chat/completions
19-
port: 8000
20-
21-
Processor:
16+
Common:
2217
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
23-
block-size: 128
24-
max-model-len: 3500
18+
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
2519
# Routing policy determines how remote workers are selected for processing
2620
# prefill requests
2721
# 1. random: randomly select workers for prefill requests
@@ -31,39 +25,43 @@ Processor:
3125
# 3. kv: finding prefill workers by KV cache is not beneficial when caching is
3226
# disabled on this setup
3327
router: round-robin
28+
# Number of tokens in a batch for more efficient chunked transfers to GPUs.
29+
block-size: 128
30+
max-model-len: 3500
31+
max-num-batched-tokens: 3500
32+
disable-log-requests: true
33+
34+
Frontend:
35+
served_model_name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
36+
endpoint: dynamo.Processor.chat/completions
37+
port: 8000
38+
39+
Processor:
40+
common-configs: [model, block-size, max-model-len, router]
3441

3542
Router:
36-
model-name: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
43+
common-configs: [model]
3744
min-workers: 1
3845

3946
VllmWorker:
40-
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
41-
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
42-
block-size: 128
43-
max-model-len: 3500
47+
common-configs: [model, kv-transfer-config, router, block-size, max-model-len, disable-log-requests]
4448
# Enable prefill at different workers.
4549
remote-prefill: true
4650
# Disable local prefill so only disaggregated prefill is used.
4751
conditional-disagg: false
52+
# The GPU memory utilization do not have to match between VllmWorker and PrefillWorker.
53+
gpu-memory-utilization: 0.95
4854
# TP size is doubled from single node setup
4955
tensor-parallel-size: 8
50-
gpu-memory-utilization: 0.95
51-
disable-log-requests: true
52-
router: round-robin
5356
ServiceArgs:
5457
workers: 1
5558
resources:
5659
gpu: 8
5760

5861
PrefillWorker:
59-
model: neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
60-
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
61-
block-size: 128
62-
max-model-len: 3500
63-
max-num-batched-tokens: 3500
64-
tensor-parallel-size: 1
62+
common-configs: [model, kv-transfer-config, block-size, max-model-len, max-num-batched-tokens, disable-log-requests]
6563
gpu-memory-utilization: 0.95
66-
disable-log-requests: true
64+
tensor-parallel-size: 1
6765
ServiceArgs:
6866
# DP size is doubled from single node setup
6967
workers: 8

0 commit comments

Comments
 (0)