Skip to content

Suspiciously low performance in batched inference compared to single token #3771

Closed
@Microflame

Description

@Microflame

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

llama_decode takes 4x more time to complete for 2 tokens compared to 1 token. Specifically when I feed single token to llama_decode it takes ~12 ms. to decode on average, while for 2 or more tokens llama_decode takes ~50 ms. to complete. Naturally I would expect at most 2x increase in time taken to process twice the amount of tokens but in fact processing 2 tokens takes 4x more time than processing 1 token.
Naively one could assume that llama.cpp CUDA code can be tweaked in such a way so that llama_decode for 2 tokens would complete in at most twice the time it takes to decode 1 token. This would result in the following benefits:

  • Up to 2x reduction of prompt eval time for single sequence inference
  • Up to 2x decrease in next token prediction time for multi sequence inference

My question

So I was wondering if these are sane considerations and if so, whether someone of the CUDA experts can pull off such an optimization?

Some additional notes

Here are the results of my measurements:

n_tokens llama_decode time, ms
1 12
2 50
4 51
8 51
64 56

Environment and Context

I am running RTX 4070 under WSL2.
The model is llama 7B quantized using Q4_0

Steps to Reproduce

The code I used to collect stats:

#include <iostream>
#include <iomanip>
#include <vector>

#include <llama.h>
#include <common.h>


void exit_if_false(bool cond, const char* msg) {
    if (!cond) {
        std::cerr << msg << std::endl;
        exit(1);
    }
}

const int BATCH_SIZE = 2;
const bool GPU = true;

int main(int argc, char* argv[]) {
    std::cout << "Testing on " << (GPU ? "GPU" : "CPU") << '\n';
    llama_model_params model_params = llama_model_default_params();
    {
        model_params.n_gpu_layers = GPU ? 1000 : 0;
    }

    llama_context_params context_params = llama_context_default_params();
    {
        context_params.n_ctx = 1024;
        context_params.n_batch = BATCH_SIZE;
        context_params.n_threads = GPU ? 1 : 10;
    }

    llama_model* model = llama_load_model_from_file(argv[1], model_params);
    exit_if_false(model, "Can not load model");

    llama_context* ctx = llama_new_context_with_model(model, context_params);
    exit_if_false(ctx, "Can not create context");

    std::string prompt = "In another moment down went Alice after it, never once considering how in the world she was to get out again.";
    std::vector<llama_token> tokens = llama_tokenize(ctx, prompt, true, false);
    std::cout << "Processing " << tokens.size() << " tokens\n";

    llama_batch batch = llama_batch_init(BATCH_SIZE, 0, 1);
    double total_dt_ms = 0;
    int num_calls = 0;
    for (size_t start = 0; start < tokens.size(); start += BATCH_SIZE) {
        size_t end = std::min(start + BATCH_SIZE, tokens.size());
        
        llama_batch_clear(batch);
        for (size_t i = start; i < end; ++i) {
            llama_batch_add(batch, tokens[i], i, {0}, false);
        }

        double tstart = ggml_time_us();
        llama_decode(ctx, batch);
        double tend = ggml_time_us();
        double dt_ms = (tend - tstart) / 1000;
        std::cout << "llama_decode: " << std::setw(7) << std::fixed << std::setprecision(3) << dt_ms
                  << " ms. for " << std::setw(3) << batch.n_tokens << " token(s)\n";
        total_dt_ms += dt_ms;
        num_calls += 1;
    }
    llama_batch_free(batch);

    std::cout << "Average:\n"
        << (total_dt_ms / num_calls) << " ms. per call\n"
        << (total_dt_ms / tokens.size()) << " ms. per token\n";
    return 0;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions