Suspiciously low performance in batched inference compared to single token

# Prerequisites

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Current Behavior

`llama_decode` takes 4x more time to complete for 2 tokens compared to 1 token. Specifically when I feed single token to `llama_decode` it takes ~12 ms. to decode on average, while for 2 or more tokens `llama_decode` takes ~50 ms. to complete. Naturally I would expect at most 2x increase in time taken to process twice the amount of tokens but in fact processing 2 tokens takes 4x more time than processing 1 token.
Naively one could assume that `llama.cpp` `CUDA` code can be tweaked in such a way so that `llama_decode` for 2 tokens would complete in at most twice the time it takes to decode 1 token. This would result in the following benefits:
* Up to 2x reduction of prompt eval time for single sequence inference
* Up to 2x decrease in **next token prediction** time for multi sequence inference

## My question

So I was wondering if these are sane considerations and if so, whether someone of the `CUDA` experts can pull off such an optimization?

## Some additional notes

Here are the results of my measurements:

| n_tokens | llama_decode time, ms |
|--------|--------|
| 1 | 12 |
| 2 | 50 |
| 4 | 51 |
| 8 | 51 |
| 64 | 56 |

* https://github.com/ggerganov/llama.cpp/pull/3749 has improved decoding time from ~100 ms. to ~50 ms.
* It feels like the issue is most prominent with quantized models.
* It also appears that the issue is `GPU` specific and does not affect the `CPU`

# Environment and Context

I am running `RTX 4070` under `WSL2`.
The model is `llama 7B` quantized using `Q4_0`

# Steps to Reproduce

The code I used to collect stats:

```cpp
#include <iostream>
#include <iomanip>
#include <vector>

#include <llama.h>
#include <common.h>


void exit_if_false(bool cond, const char* msg) {
    if (!cond) {
        std::cerr << msg << std::endl;
        exit(1);
    }
}

const int BATCH_SIZE = 2;
const bool GPU = true;

int main(int argc, char* argv[]) {
    std::cout << "Testing on " << (GPU ? "GPU" : "CPU") << '\n';
    llama_model_params model_params = llama_model_default_params();
    {
        model_params.n_gpu_layers = GPU ? 1000 : 0;
    }

    llama_context_params context_params = llama_context_default_params();
    {
        context_params.n_ctx = 1024;
        context_params.n_batch = BATCH_SIZE;
        context_params.n_threads = GPU ? 1 : 10;
    }

    llama_model* model = llama_load_model_from_file(argv[1], model_params);
    exit_if_false(model, "Can not load model");

    llama_context* ctx = llama_new_context_with_model(model, context_params);
    exit_if_false(ctx, "Can not create context");

    std::string prompt = "In another moment down went Alice after it, never once considering how in the world she was to get out again.";
    std::vector<llama_token> tokens = llama_tokenize(ctx, prompt, true, false);
    std::cout << "Processing " << tokens.size() << " tokens\n";

    llama_batch batch = llama_batch_init(BATCH_SIZE, 0, 1);
    double total_dt_ms = 0;
    int num_calls = 0;
    for (size_t start = 0; start < tokens.size(); start += BATCH_SIZE) {
        size_t end = std::min(start + BATCH_SIZE, tokens.size());
        
        llama_batch_clear(batch);
        for (size_t i = start; i < end; ++i) {
            llama_batch_add(batch, tokens[i], i, {0}, false);
        }

        double tstart = ggml_time_us();
        llama_decode(ctx, batch);
        double tend = ggml_time_us();
        double dt_ms = (tend - tstart) / 1000;
        std::cout << "llama_decode: " << std::setw(7) << std::fixed << std::setprecision(3) << dt_ms
                  << " ms. for " << std::setw(3) << batch.n_tokens << " token(s)\n";
        total_dt_ms += dt_ms;
        num_calls += 1;
    }
    llama_batch_free(batch);

    std::cout << "Average:\n"
        << (total_dt_ms / num_calls) << " ms. per call\n"
        << (total_dt_ms / tokens.size()) << " ms. per token\n";
    return 0;
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suspiciously low performance in batched inference compared to single token #3771

Prerequisites

Current Behavior

My question

Some additional notes

Environment and Context

Steps to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suspiciously low performance in batched inference compared to single token #3771

Description

Prerequisites

Current Behavior

My question

Some additional notes

Environment and Context

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions