Skip to content

[Performance]: waiting 队列能有多长?和哪些启动参数有关? #17824

Open
@nvliajia

Description

@nvliajia

Proposal to improve performance

我使用八卡4090启动ds32B模型,然后连续请求了1200个请求,但是我看waiting reqs和runing reqs的数量加起来大约是1000个,想知道vllm是如何控制队列长度的,队列长度和哪些启动参数有关?

Image

我的启动命令是:vllm serve llm_model/ds_32B/ --served-model-name deepseek --api-key 12345 --disable-log-requests --trust-remote-code --tensor-parallel-size 8 --max-model-len 25000 --gpu_memory_utilization 0.7 --max-num-seqs 96 --max-num-batched-tokens 18096

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions