[Performance]: How can i improve performance further in vllm lmcache PD Disaggregate？Plz Help Me

### Proposal to improve performance

Hi guys i am newly here

I have some requirement Can someone guide me? Thank you very much

### MY SITUATION
8 * H200
vllm 0.8.5 base docker image (also i have lmcache/vllm-openai:2025-05-17-v1)
Qwen2.5 72B need to improve its performance in ttft, tpot, throughput by PD Disaggregate by mooncake
I have try successfully by lmcache connector v1 below
` UCX_TLS=cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE=/workspace/examples/lmcache/disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml
LMCACHE_USE_EXPERIMENTAL=True
VLLM_ENABLE_V1_MULTIPROCESSING=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=0,1
VLLM_USE_V1=1
/opt/venv/bin/vllm serve /xxbucket/prod/models/QWEN-25-72B-INSTRUCT/Qwen2.5-72B-Instruct
--served-model-name llm
--port 8100
--max-model-len 6400
--max-num-batched-tokens 6400
--max-num-seqs 48
--block-size 128
--gpu-memory-utilization 0.95
--tensor-parallel-size 2
--max_seq_len_to_capture 6400
--enforce-eager
--trust-remote-code
--kv-transfer-config
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}' > /prefiller.log 2>&1 &

UCX_TLS=cuda_ipc,cuda_copy,tcp
LMCACHE_CONFIG_FILE=/workspace/examples/lmcache/disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml
LMCACHE_USE_EXPERIMENTAL=True
VLLM_ENABLE_V1_MULTIPROCESSING=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=2,3
VLLM_USE_V1=1
/opt/venv/bin/vllm serve /xxbucket/prod/models/QWEN-25-72B-INSTRUCT/Qwen2.5-72B-Instruct
--served-model-name llm
--port 8200
--max-model-len 6400
--max_seq_len_to_capture 6400
--tensor-parallel-size 2
--max-num-batched-tokens 6400
--max-num-seqs 48
--block-size 128
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--kv-transfer-config
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}' > /decoder.log 2>&1 &

/opt/venv/bin/python3 /workspace/examples/lmcache/disagg_prefill_lmcache_v1/disagg_proxy_server.py
--host 0.0.0.0
--prefiller-host 0.0.0.0
--decoder-host 0.0.0.0`

with 500 req, 3000token input, 500token output, 20 concurrency, it performs

![Image](https://github.com/user-attachments/assets/2ca8e373-b6cd-4f37-acc6-001bcf425524)

### MY PROBLEM
Can I improve performance further in way of lmcache above?
Should I try mooncake or dynamo?
THANK YOU ALL AND ANY HELP IS VERT APPRECIATED

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: How can i improve performance further in vllm lmcache PD Disaggregate？Plz Help Me #18801

Proposal to improve performance

MY SITUATION

MY PROBLEM

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: How can i improve performance further in vllm lmcache PD Disaggregate？Plz Help Me #18801

Description

Proposal to improve performance

MY SITUATION

MY PROBLEM

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions