Skip to content

[Benchmark] Add single turn MTBench to Serving Bench #17202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 28, 2025

Conversation

ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented Apr 25, 2025

This PR adds single turn MTBench to benchmark datasets.
We have been using single turn MTBench for EAGLE bench using/offline_inference/eagle.py. However, it outputs output/s which does not ignore the TTFT. To measure TPOT we have to use benchmark_serving.py. We already have a lot of results on MTBench for different EAGLE settings so this PR adds MTBench to serving benchmark to measure TPOT.

bench cmd

python3 benchmarks/benchmark_serving.py --port 9001 --save-result \
  --backend vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/completions \
  --dataset-name hf \
  --dataset-path philschmid/mt-bench \
  --num-prompts 80 \
  --max-concurrency 1

VANILLA
serve cmd

vllm serve meta-llama/Llama-3.1-8B-Instruct  --disable-log-requests --port 9001

Result

Starting initial single prompt test run...                         
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf                                                                                                              
Burstiness factor: 1.0 (Poisson process)                                                                                               
Maximum request concurrency: 1                                     
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [02:26<00:00,  1.83s/it]
============ Serving Benchmark Result ============ 
Successful requests:                     80        
Benchmark duration (s):                  146.25    
Total input tokens:                      5333      
Total generated tokens:                  19450     
Request throughput (req/s):              0.55      
Output token throughput (tok/s):         132.99    
Total Token throughput (tok/s):          169.45    
---------------Time to First Token----------------
Mean TTFT (ms):                          12.24     
Median TTFT (ms):                        12.22     
P99 TTFT (ms):                           13.79     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.50      
Median TPOT (ms):                        7.49      
P99 TPOT (ms):                           7.61      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.50      
Median ITL (ms):                         7.49      
P99 ITL (ms):                            7.99      
==================================================

EAGLE-1

serve cmd

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

Result

Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [01:28<00:00,  1.10s/it]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  88.03     
Total input tokens:                      8133      
Total generated tokens:                  16943     
Request throughput (req/s):              0.91      
Output token throughput (tok/s):         192.46    
Total Token throughput (tok/s):          284.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          14.77     
Median TTFT (ms):                        14.74     
P99 TTFT (ms):                           15.98     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.11      
Median TPOT (ms):                        5.05      
P99 TPOT (ms):                           6.59      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.46     
Median ITL (ms):                         10.57     
P99 ITL (ms):                            11.46     
==================================================

EAGLE-3

serve cmd

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests --port 9001 \
  --speculative_config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'

Result

Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [01:32<00:00,  1.15s/it]
============ Serving Benchmark Result ============
Successful requests:                     80        
Benchmark duration (s):                  92.35     
Total input tokens:                      8133      
Total generated tokens:                  16908     
Request throughput (req/s):              0.87      
Output token throughput (tok/s):         183.09    
Total Token throughput (tok/s):          271.16    
---------------Time to First Token----------------
Mean TTFT (ms):                          15.87     
Median TTFT (ms):                        15.15     
P99 TTFT (ms):                           21.82     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          5.37      
Median TPOT (ms):                        5.38      
P99 TPOT (ms):                           6.92      
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.64     
Median ITL (ms):                         10.76     
P99 ITL (ms):                            11.71     
==================================================

cc: @LiuXiaoxuanPKU

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Apr 25, 2025

Here is the offline serving bench.

The TPOT of llama 3.1 is 7.5ms which matches 133 tokens/s obtained from /offline_inference/eagle.py
The TPOT of EAGLE-1 K=2 is 5.11ms which is 195 tokens/s which seems around the offline value of 201 tokens/s
The TPOT of EAGLE-3 K=2 is 5.37ms which is 186 tokens/ but offline gives 220 tokens/s

Online serving is slower than offline serving for EAGLE-3.

Earlier the numbers for both EAGLE-1/3 were quite low because apply_chat_template with add_generation_prompt=True was missing. Adding it improved scores for both but EAGLE-3 is still slower.

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Apr 26, 2025

Hi @ekagra-ranjan, thanks for the PR! This is wonderful and so useful!

A few things to note:

  1. [V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE #17211 can be critical for the e2e performance
  2. Currently, our implementation of eagle + prefix caching is not correct, perhaps leading to slightly lower acceptance rate. [V1][Spec Decode] Make eagle compatible with prefix caching. #17137 will fix this.

We may need to benchmark the performance again once the two PRs are landed, which should be soon

https://github.com/vllm-project/vllm/blob/9d98ab5ec/examples/offline_inference/eagle.py#L14-L18 # noqa: E501
"""

DEFAULT_OUTPUT_LEN = 256 # avg len used in SD bench in vLLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Which SD bench do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was refering to the offline eagle bench. Lmk if you would like me to clarify this in the comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I think it's quite random then. What about using longer outputs like 1K+?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1K would make the MTBench run 4x longer. My experience so far has been that 256 is good enough to know the inference metrics and doesnt change much with longer output len with MTBench.

@WoosukKwon
Copy link
Collaborator

@ekagra-ranjan Please fix the lint errors. :)

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR!

@WoosukKwon WoosukKwon merged commit cfe4532 into vllm-project:main Apr 28, 2025
18 of 21 checks passed
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants