[Performance]:  Performance comparison for v1 engine and v0 engine

### Proposal to improve performance

_No response_

### Report of performance regression

Hi, I did a benchmark to compare the performance of v1 engine and v0 engine, using `benchmark_serving.py `on SharGPT and an internal dataset. 

The results for llama3-3-70b-instruct are shown as follows.
ShareGPT:

<img width="629" alt="Image" src="https://github.com/user-attachments/assets/01619df4-82c6-4129-b4dd-761a17a7f6ac" />

Our internal dataset:

<img width="629" alt="Image" src="https://github.com/user-attachments/assets/6b251c9b-5801-40e0-b402-cbb81f1ed52f" />


The average length prompts in our internal dataset is around 9k tokens. 
It seems that the performance of the v1 engine is much worse, and it seems that TTFT is much larger for long prompts under a high QPS v1 engine.

<img width="628" alt="Image" src="https://github.com/user-attachments/assets/ce08896f-7a5a-431f-9929-6f6ba5b69962" />



For setup details: I used 4 H100s and kserve for both deployments. I used vLLM-0.8.4, using quantization `fp8-dynamic`.
The launch params for v0 engine:
```
       '--gpu-memory-utilization=0.90'
       '--tensor-parallel-size=4'
       '--enable-chunked-prefill'
       '--max-num-batched-tokens=8192'
       --enable-auto-tool-choice
       --tool-call-parser=llama3_json
```

The launch params for v1 engine:
```
      '--gpu-memory-utilization=0.90'
       '--tensor-parallel-size=4'
       --enable-auto-tool-choice
       --tool-call-parser=llama3_json
```
The only difference is that for v1 engine, it does not need `enable-chunked-prefill` and max-num-batched-tokens is already 8192.

Let me know whether it’s a fair comparison.

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)





### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Performance comparison for v1 engine and v0 engine #17540

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Performance comparison for v1 engine and v0 engine #17540

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions