[Performance]: first token latency during inference is longer when the number of input tokens is small

### Proposal to improve performance

**Describe the question**
In vLLM, I noticed that the first token latency during inference is longer when the number of input tokens is small (e.g., 30 tokens), compared to when the number of input tokens is large (e.g., 800 tokens).

Is this expected behavior? Could you help explain why this happens?

**Environment**

vLLM version: (v0.8.4)

Model: (DeepSeek-R1)

GPU: (H200 * 8)

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: first token latency during inference is longer when the number of input tokens is small #17352

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: first token latency during inference is longer when the number of input tokens is small #17352

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions