Skip to content

InFlightBatching seems not working #442

Open
@larme

Description

@larme

System Info

  • CPU: amd64
  • OS: Debian 12
  • GPU: nvidia rtx4000 ada
  • GPU driver: 535.161
  • TensorRT-LLM version: 0.8
  • tensorrtllm_backend version: 0.8

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Followed exactly steps of: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/. The only change is setting kv_cache_free_gpu_mem_fraction=0.95.

Expected behavior

Then I run 2 copies of https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py in the same time. The first script will finish within 30 seconds. I expect the second one will finish around the same time (about 30 seconds)

actual behavior

The second will only finish after 60 seconds. Hence it seems like the batching is not working and every request will block other requests come after it.

additional notes

Very similar to #189, but the user reported that his issue was fixed after 0.6.1.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions