Open
Description
Proposal to improve performance
vllm version == 0.8.5.post1
without yarn
vllm serve Qwen/Qwen3-32B \
--trust-remote-code --gpu_memory_utilization 0.95 --tensor-parallel-size 2 \
--quantization bitsandbytes --load_format bitsandbytes --enforce_eager \
--max-model-len 32768
with yarn
vllm serve Qwen/Qwen3-32B \
--trust-remote-code --gpu_memory_utilization 0.95 --tensor-parallel-size 2 \
--quantization bitsandbytes --load_format bitsandbytes --enforce_eager \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--max-model-len 131072
I have some tests on my end for its agentic capabilities based on qwen3 and I have some solid findings that enabling yarn to extend window context does degrade the performace, with around 15-20% performance drop.
do u also encounter the same findings ? any suggestion about this drop ?
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.