InFlightBatching seems not working

### System Info

- CPU: amd64
- OS: Debian 12
- GPU: nvidia rtx4000 ada
- GPU driver: 535.161
- TensorRT-LLM version: 0.8
- tensorrtllm_backend version: 0.8

### Who can help?

@kaiyux 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Followed exactly steps of: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/. The only change is setting `kv_cache_free_gpu_mem_fraction=0.95`. 

### Expected behavior

Then I run 2 copies of https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py in the same time. The first script will finish within 30 seconds. I expect the second one will finish around the same time (about 30 seconds) 

### actual behavior

The second will only finish after 60 seconds. Hence it seems like the batching is not working and every request will block other requests come after it.


### additional notes

Very similar to #189, but the user reported that his issue was fixed after 0.6.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InFlightBatching seems not working #442

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

InFlightBatching seems not working #442

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions