[Benchmark] Add single turn MTBench to Serving Bench #17202

ekagra-ranjan · 2025-04-25T19:05:34Z

This PR adds single turn MTBench to benchmark datasets.
We have been using single turn MTBench for EAGLE bench using/offline_inference/eagle.py. However, it outputs output/s which does not ignore the TTFT. To measure TPOT we have to use benchmark_serving.py. We already have a lot of results on MTBench for different EAGLE settings so this PR adds MTBench to serving benchmark to measure TPOT.

bench cmd

python3 benchmarks/benchmark_serving.py --port 9001 --save-result \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/completions \
  --dataset-name hf \
  --dataset-path philschmid/mt-bench \
  --num-prompts 80 \
  --max-concurrency 1

VANILLA
serve cmd

vllm serve meta-llama/Llama-3.1-8B-Instruct  --disable-log-requests --port 9001

Result

Starting initial single prompt test run...                         
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf                                                                                                              
Burstiness factor: 1.0 (Poisson process)                                                                                               
Maximum request concurrency: 1                                     
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [02:26<00:00,  1.83s/it]
============ Serving Benchmark Result ============ 
Successful requests:                     80        
Benchmark duration (s):                  146.25    
Total input tokens:                      5333      
Total generated tokens:                  19450     
Request throughput (req/s):              0.55      
Output token throughput (tok/s):         132.99    
Total Token throughput (tok/s):          169.45    
---------------Time to First Token----------------
Mean TTFT (ms):                          12.24     
Median TTFT (ms):                        12.22     
P99 TTFT (ms):                           13.79     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.50      
Median TPOT (ms):                        7.49      
P99 TPOT (ms):                           7.61      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.50      
Median ITL (ms):                         7.49      
P99 ITL (ms):                            7.99      
==================================================

EAGLE-1

serve cmd

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

Result

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [01:28<00:00,  1.10s/it]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  88.03     
Total input tokens:                      8133      
Total generated tokens:                  16943     
Request throughput (req/s):              0.91      
Output token throughput (tok/s):         192.46    
Total Token throughput (tok/s):          284.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          14.77     
Median TTFT (ms):                        14.74     
P99 TTFT (ms):                           15.98     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.11      
Median TPOT (ms):                        5.05      
P99 TPOT (ms):                           6.59      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.46     
Median ITL (ms):                         10.57     
P99 ITL (ms):                            11.46     
==================================================

EAGLE-3

serve cmd

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

Result

Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [01:32<00:00,  1.15s/it]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  92.35     
Total input tokens:                      8133      
Total generated tokens:                  16908     
Request throughput (req/s):              0.87      
Output token throughput (tok/s):         183.09    
Total Token throughput (tok/s):          271.16    
---------------Time to First Token----------------
Mean TTFT (ms):                          15.87     
Median TTFT (ms):                        15.15     
P99 TTFT (ms):                           21.82     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.37      
Median TPOT (ms):                        5.38      
P99 TPOT (ms):                           6.92      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.64     
Median ITL (ms):                         10.76     
P99 ITL (ms):                            11.71     
==================================================

cc: @LiuXiaoxuanPKU

github-actions · 2025-04-25T19:05:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

ekagra-ranjan · 2025-04-25T20:23:18Z

Here is the offline serving bench.

The TPOT of llama 3.1 is 7.5ms which matches 133 tokens/s obtained from /offline_inference/eagle.py
The TPOT of EAGLE-1 K=2 is 5.11ms which is 195 tokens/s which seems around the offline value of 201 tokens/s
The TPOT of EAGLE-3 K=2 is 5.37ms which is 186 tokens/ but offline gives 220 tokens/s

Online serving is slower than offline serving for EAGLE-3.

Earlier the numbers for both EAGLE-1/3 were quite low because apply_chat_template with add_generation_prompt=True was missing. Adding it improved scores for both but EAGLE-3 is still slower.

WoosukKwon · 2025-04-26T17:32:56Z

Hi @ekagra-ranjan, thanks for the PR! This is wonderful and so useful!

A few things to note:

[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE #17211 can be critical for the e2e performance
Currently, our implementation of eagle + prefix caching is not correct, perhaps leading to slightly lower acceptance rate. [V1][Spec Decode] Make eagle compatible with prefix caching. #17137 will fix this.

We may need to benchmark the performance again once the two PRs are landed, which should be soon

benchmarks/benchmark_dataset.py

WoosukKwon · 2025-04-26T18:15:55Z

benchmarks/benchmark_dataset.py

+    https://github.com/vllm-project/vllm/blob/9d98ab5ec/examples/offline_inference/eagle.py#L14-L18 # noqa: E501
+    """
+
+    DEFAULT_OUTPUT_LEN = 256  # avg len used in SD bench in vLLM


QQ: Which SD bench do you mean?

I was refering to the offline eagle bench. Lmk if you would like me to clarify this in the comment

oh I think it's quite random then. What about using longer outputs like 1K+?

1K would make the MTBench run 4x longer. My experience so far has been that 256 is good enough to know the inference metrics and doesnt change much with longer output len with MTBench.

Co-authored-by: Woosuk Kwon <[email protected]>

WoosukKwon · 2025-04-28T20:14:29Z

@ekagra-ranjan Please fix the lint errors. :)

WoosukKwon

LGTM. Thanks for the PR!

)

add mtbench to serving bench

aa6b750

ekagra-ranjan added 2 commits April 25, 2025 21:21

add chat template for EAGLE

b527e39

add add_generation_prompt=True

fe706ae

WoosukKwon reviewed Apr 26, 2025

View reviewed changes

Update benchmarks/benchmark_dataset.py

07bfe66

Co-authored-by: Woosuk Kwon <[email protected]>

ekagra-ranjan added 2 commits April 28, 2025 22:03

fix linter

b4c3793

pre-commit

8a43105

ekagra-ranjan requested a review from WoosukKwon April 28, 2025 22:41

WoosukKwon approved these changes Apr 28, 2025

View reviewed changes

WoosukKwon merged commit cfe4532 into vllm-project:main Apr 28, 2025
18 of 21 checks passed

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Benchmark] Add single turn MTBench to Serving Bench (vllm-project#17202

b90f418

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Add single turn MTBench to Serving Bench #17202

[Benchmark] Add single turn MTBench to Serving Bench #17202

ekagra-ranjan commented Apr 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 25, 2025

ekagra-ranjan commented Apr 25, 2025 •

edited

Loading

WoosukKwon commented Apr 26, 2025 •

edited

Loading

WoosukKwon Apr 26, 2025

ekagra-ranjan Apr 28, 2025

WoosukKwon Apr 28, 2025

ekagra-ranjan Apr 28, 2025

WoosukKwon commented Apr 28, 2025

WoosukKwon left a comment

[Benchmark] Add single turn MTBench to Serving Bench #17202

[Benchmark] Add single turn MTBench to Serving Bench #17202

Conversation

ekagra-ranjan commented Apr 25, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 25, 2025

ekagra-ranjan commented Apr 25, 2025 • edited Loading

WoosukKwon commented Apr 26, 2025 • edited Loading

WoosukKwon Apr 26, 2025

Choose a reason for hiding this comment

ekagra-ranjan Apr 28, 2025

Choose a reason for hiding this comment

WoosukKwon Apr 28, 2025

Choose a reason for hiding this comment

ekagra-ranjan Apr 28, 2025

Choose a reason for hiding this comment

WoosukKwon commented Apr 28, 2025

WoosukKwon left a comment

Choose a reason for hiding this comment

ekagra-ranjan commented Apr 25, 2025 •

edited by github-actions bot

Loading

ekagra-ranjan commented Apr 25, 2025 •

edited

Loading

WoosukKwon commented Apr 26, 2025 •

edited

Loading