Skip to content

[Performance]: How can i improve performance further in vllm lmcache PD Disaggregate?Plz Help Me #18801

Open
@mugglew

Description

@mugglew

Proposal to improve performance

Hi guys i am newly here

I have some requirement Can someone guide me? Thank you very much

MY SITUATION

8 * H200
vllm 0.8.5 base docker image (also i have lmcache/vllm-openai:2025-05-17-v1)
Qwen2.5 72B need to improve its performance in ttft, tpot, throughput by PD Disaggregate by mooncake
I have try successfully by lmcache connector v1 below
` UCX_TLS=cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE=/workspace/examples/lmcache/disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml
LMCACHE_USE_EXPERIMENTAL=True
VLLM_ENABLE_V1_MULTIPROCESSING=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=0,1
VLLM_USE_V1=1
/opt/venv/bin/vllm serve /xxbucket/prod/models/QWEN-25-72B-INSTRUCT/Qwen2.5-72B-Instruct
--served-model-name llm
--port 8100
--max-model-len 6400
--max-num-batched-tokens 6400
--max-num-seqs 48
--block-size 128
--gpu-memory-utilization 0.95
--tensor-parallel-size 2
--max_seq_len_to_capture 6400
--enforce-eager
--trust-remote-code
--kv-transfer-config
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}' > /prefiller.log 2>&1 &

UCX_TLS=cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE=/workspace/examples/lmcache/disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml
LMCACHE_USE_EXPERIMENTAL=True
VLLM_ENABLE_V1_MULTIPROCESSING=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=2,3
VLLM_USE_V1=1
/opt/venv/bin/vllm serve /xxbucket/prod/models/QWEN-25-72B-INSTRUCT/Qwen2.5-72B-Instruct
--served-model-name llm
--port 8200
--max-model-len 6400
--max_seq_len_to_capture 6400
--tensor-parallel-size 2
--max-num-batched-tokens 6400
--max-num-seqs 48
--block-size 128
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--kv-transfer-config
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}' > /decoder.log 2>&1 &

/opt/venv/bin/python3 /workspace/examples/lmcache/disagg_prefill_lmcache_v1/disagg_proxy_server.py
--host 0.0.0.0
--prefiller-host 0.0.0.0
--decoder-host 0.0.0.0`

with 500 req, 3000token input, 500token output, 20 concurrency, it performs

Image

MY PROBLEM

Can I improve performance further in way of lmcache above?
Should I try mooncake or dynamo?
THANK YOU ALL AND ANY HELP IS VERT APPRECIATED

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions