Performance Issue with return_context_logits Enabled in TensorRT-LLM

### System Info

Intel(R) Xeon(R) CPU @ 2.20GHz 
Architecture: x86_64
NVIDIA A100-SXM4-40G
Ubuntu

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I follow official examples for Llama model: https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0/examples/llama

I've been experiencing significant slowdowns when the return_context_logits flag is turned on. For context, I am utilizing the llama example and have specifically enabled the gather_context_logits flag during the TensorRT-LLM build process.

Additionally, I have been passing return_context_logits through the triton_client in an attempt to retrieve logits for the request sentences. To accommodate this, I have set the request_output_len or output_len to 1.



### Expected behavior

The anticipated behavior when enabling return_context_logits would be a manageable decrease in speed, ideally not significantly deviating from the throughput when the flag is off. Performance should ideally be on par with or better than the forward pass speed of HuggingFace implementations.



### actual behavior

<!DOCTYPE html><p cid="n117" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: 1.6; orphans: 4; margin-top: 1rem; margin-bottom: 1rem; font-size: 1.14rem; font-family: &quot;Glow Sans SC&quot;, -apple-system, sans-serif; font-weight: 500; font-style: normal; color: rgb(59, 69, 78); white-space: pre-wrap; position: relative; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration: none;">The current observed behavior shows an almost 8-fold decrease in execution speed when trying to obtain logits with a maximum length of 1. This is surprisingly slower than the forward pass speed of comparable HuggingFace models.<p cid="n118" mdtype="paragraph" class="md-end-block md-p md-focus" style="box-sizing: border-box; line-height: 1.6; orphans: 4; margin-top: 1rem; margin-bottom: 1rem; font-size: 1.14rem; font-family: &quot;Glow Sans SC&quot;, -apple-system, sans-serif; font-weight: 500; font-style: normal; color: rgb(59, 69, 78); white-space: pre-wrap; position: relative; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration: none;">Here's a comparative table of performance with and without the <code style="box-sizing: border-box; font-family: &quot;Cascadia Code&quot;, Consolas, &quot;Glow Sans SC&quot;, &quot;Courier New&quot;, 微软雅黑, &quot;Microsoft YaHei&quot;, 华文细黑, STXihei; text-align: left; vertical-align: top; padding: 3px 6px; border-radius: 6px; display: inline; line-height: 1.8; -webkit-font-smoothing: initial; font-size: 1.1rem !important; background: rgb(245, 247, 249); margin: 0px 2px;">return_context_logits</code> flag:<figure class="md-table-fig" cid="n119" mdtype="table" style="box-sizing: border-box; margin: 1.2em 0px; overflow-x: auto; max-width: calc(100% + 16px); padding: 0px; cursor: default; caret-color: rgb(36, 42, 49); color: rgb(36, 42, 49); font-family: &quot;Glow Sans SC&quot;, -apple-system, sans-serif; font-size: 14px; font-style: normal; font-variant-caps: normal; font-weight: 500; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration: none;">
Logit Status | max_gen_token | input_len | Execution Time | Average Time per Example
-- | -- | -- | -- | --
On | 1 | 2000 | 0:49 | 0.98s
Off | 1 | 2000 | 0:06 | 0.12s

</figure>

### additional notes

I have executed the trtllm-build with the following configuration:
```bash
trtllm-build --checkpoint_dir {model_dir}/tensorrt/{tp_size}-gpu \
 --remove_input_padding enable \
 --gpt_attention_plugin float16 \
 --context_fmha enable \
 --gemm_plugin float16 \
 --output_dir {model_dir}/tensorrt_llm/context_fmha \
 --paged_kv_cache disable \
 --enable_xqa disable \
 --multi_block_mode disable \
 --use_custom_all_reduce disable \
	 --tp_size {tp_size} \
	 --workers {tp_size} \
 --max_batch_size 1 \
 --max_input_len 8192 \
 --max_output_len 8192 \
	 --max_num_tokens 8192 \
 --gather_context_logits
```

Any insights or assistance in addressing this unexpected slowdown would be greatly appreciated. If there are any further experiments or specific areas you would recommend investigating, please advise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue with return_context_logits Enabled in TensorRT-LLM #419

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Issue with return_context_logits Enabled in TensorRT-LLM #419

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions