Open
Description
Proposal to improve performance
我使用八卡4090启动ds32B模型,然后连续请求了1200个请求,但是我看waiting reqs和runing reqs的数量加起来大约是1000个,想知道vllm是如何控制队列长度的,队列长度和哪些启动参数有关?
我的启动命令是:vllm serve llm_model/ds_32B/ --served-model-name deepseek --api-key 12345 --disable-log-requests --trust-remote-code --tensor-parallel-size 8 --max-model-len 25000 --gpu_memory_utilization 0.7 --max-num-seqs 96 --max-num-batched-tokens 18096
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.