Description
Proposal to improve performance
Hi guys i am newly here
I have some requirement Can someone guide me? Thank you very much
MY SITUATION
8 * H200
vllm 0.8.5 base docker image (also i have lmcache/vllm-openai:2025-05-17-v1)
Qwen2.5 72B need to improve its performance in ttft, tpot, throughput by PD Disaggregate by mooncake
I have try successfully by lmcache connector v1 below
` UCX_TLS=cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE=/workspace/examples/lmcache/disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml
LMCACHE_USE_EXPERIMENTAL=True
VLLM_ENABLE_V1_MULTIPROCESSING=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=0,1
VLLM_USE_V1=1
/opt/venv/bin/vllm serve /xxbucket/prod/models/QWEN-25-72B-INSTRUCT/Qwen2.5-72B-Instruct
--served-model-name llm
--port 8100
--max-model-len 6400
--max-num-batched-tokens 6400
--max-num-seqs 48
--block-size 128
--gpu-memory-utilization 0.95
--tensor-parallel-size 2
--max_seq_len_to_capture 6400
--enforce-eager
--trust-remote-code
--kv-transfer-config
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}' > /prefiller.log 2>&1 &
UCX_TLS=cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE=/workspace/examples/lmcache/disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml
LMCACHE_USE_EXPERIMENTAL=True
VLLM_ENABLE_V1_MULTIPROCESSING=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=2,3
VLLM_USE_V1=1
/opt/venv/bin/vllm serve /xxbucket/prod/models/QWEN-25-72B-INSTRUCT/Qwen2.5-72B-Instruct
--served-model-name llm
--port 8200
--max-model-len 6400
--max_seq_len_to_capture 6400
--tensor-parallel-size 2
--max-num-batched-tokens 6400
--max-num-seqs 48
--block-size 128
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--kv-transfer-config
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}' > /decoder.log 2>&1 &
/opt/venv/bin/python3 /workspace/examples/lmcache/disagg_prefill_lmcache_v1/disagg_proxy_server.py
--host 0.0.0.0
--prefiller-host 0.0.0.0
--decoder-host 0.0.0.0`
with 500 req, 3000token input, 500token output, 20 concurrency, it performs
MY PROBLEM
Can I improve performance further in way of lmcache above?
Should I try mooncake or dynamo?
THANK YOU ALL AND ANY HELP IS VERT APPRECIATED
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.