llama : improve batched decoding performance

Based on info from the following post, [vLLM](https://github.com/vllm-project/vllm) can achieve the following speeds for parallel decoding on A100 GPU:

https://docs.mystic.ai/docs/mistral-ai-7b-vllm-fast-inference-guide

Batch size | Tokens/s
-- | --
1 | 46
10 | 400
60 | 1.8k

(thanks to @wsxiaoys for bringing my attention to this)

Even though `llama.cpp`'s single batch inference is faster ([~72 t/s](https://github.com/ggerganov/llama.cpp/discussions/3359)) we currently don't seem to scale well with batch size. At batch size 60 for example, the performance is roughly x5 slower than what is reported in the post above.

We should understand where is the bottleneck and try to optimize the performance.

```bash
# batch size 1
./parallel -m ~/f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 1 -ns 128 -n 100 -cb

# batch size 10
./parallel -m ~/f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 10 -ns 128 -n 100 -cb

# batch size 60
./parallel -m ~/f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 60 -ns 128 -n 100 -cb
```

As discussed with @slaren, the discrepancy is likely due to lack of Flash Attention and CUDA tensor core utilization in `llama.cpp`. Still, I wouldn't be surprised if there is some low-hanging fruit that would improve the performance similar to #3412.

At the very least, we should profile things and have a better understanding where to focus in the future.

---

Here are some results with `llama.cpp` on A100 (48edda30ee545fdac2e7a33d505382888f748bbf) using OpenLLaMA 7B F16

To measure this, I've remove the system prompt from the `parallel` example to match better the vllm test above.
We count both the prompt and the generated tokens.

Batch size | Tokens/s
-- | --
1 | 108.29
8 | 247.30
10 | 296.58
16 | 368.59
32 | 422.33
60 | 489.99
64 | 481.83

```bash
# single batch
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./parallel -m models/openllama-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 1 -ns 128 -n 100 -cb

Total prompt tokens:   2011, speed: 53.51 t/s
Total gen tokens:      2059, speed: 54.79 t/s
Total speed (AVG):           speed: 108.29 t/s
```

<details>

```java
main: clearing the KV cache
Client   0, seq  126, started decoding ...
Client   0, seq  126, prompt   18 t, response   13 t, time  0.25 s, speed 126.04 t/s, cache miss 0  

Input:    If you could have any superpower, what would it be?
Response: If you could have any superpower, what would it be?

main: clearing the KV cache
Client   0, seq  127, started decoding ...
Client   0, seq  127, prompt   23 t, response   23 t, time  0.40 s, speed 113.95 t/s, cache miss 0  

Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: I have a question. Are you familiar with the Special Theory of Relativity and can you explain it to me?

main: clearing the KV cache


Total prompt tokens:   2011, speed: 53.51 t/s
Total gen tokens:      2059, speed: 54.79 t/s
Total speed (AVG):           speed: 108.29 t/s
Cache misses:             0



llama_print_timings:        load time =  3377.87 ms
llama_print_timings:      sample time =  1735.54 ms /  2187 runs   (    0.79 ms per token,  1260.13 tokens per second)
llama_print_timings: prompt eval time =  5227.17 ms /  2011 tokens (    2.60 ms per token,   384.72 tokens per second)
llama_print_timings:        eval time = 29932.81 ms /  2060 runs   (   14.53 ms per token,    68.82 tokens per second)
llama_print_timings:       total time = 37582.41 ms
```
</details>

```bash
# n_parallel = 8
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./parallel -m models/openllama-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 8 -ns 128 -n 100 -cb

Total prompt tokens:   2011, speed: 124.95 t/s
Total gen tokens:      1969, speed: 122.34 t/s
Total speed (AVG):           speed: 247.30 t/s
```

<details>

```
Client   7, seq  119, prompt   12 t, response   38 t, time  2.34 s, speed 21.33 t/s, cache miss 0  

Input:    What is the meaning of life?
Response: Hello. This is the United States Army, and we need your help! You’ve been drafted to fight in a war against an army of zombies that have taken over the world.

Client   3, seq  117, prompt   15 t, response   46 t, time  2.82 s, speed 21.66 t/s, cache miss 0  

Input:    Tell me an interesting fact about llamas.
Response: I don't know of any interesting facts about llamas, so I searched for "interesting facts about llama" on the internet. (Search engine). I found a couple of websites and read some of them.

Client   6, seq  120, prompt   13 t, response   44 t, time  2.47 s, speed 23.06 t/s, cache miss 0  

Input:    How to get a job at Google?
Response: The job is to make sure that Google search works as intended by organizing and maintaining the database. They are also responsible for making sure that everything is running smoothly, updating the website and keeping it up-to-date.

main: clearing the KV cache


Total prompt tokens:   2011, speed: 124.95 t/s
Total gen tokens:      1969, speed: 122.34 t/s
Total speed (AVG):           speed: 247.30 t/s
Cache misses:             0



llama_print_timings:        load time =  3436.27 ms
llama_print_timings:      sample time =  1684.62 ms /  2097 runs   (    0.80 ms per token,  1244.79 tokens per second)
llama_print_timings: prompt eval time = 13690.16 ms /  3975 tokens (    3.44 ms per token,   290.35 tokens per second)
llama_print_timings:        eval time =    94.53 ms /     6 runs   (   15.75 ms per token,    63.47 tokens per second)
llama_print_timings:       total time = 16093.98 ms
```
</details>


```bash
# n_parallel = 10
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./parallel -m models/openllama-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 10 -ns 128 -n 100 -cb

Total prompt tokens:   2011, speed: 153.91 t/s
Total gen tokens:      1864, speed: 142.66 t/s
Total speed (AVG):           speed: 296.58 t/s
```

<details>

```
Client   7, seq  127, prompt   23 t, response   19 t, time  1.06 s, speed 39.77 t/s, cache miss 0  

Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: We can try! If we go back in time, everything will be the same, right?

Client   5, seq  112, prompt   13 t, response   59 t, time  3.26 s, speed 22.08 t/s, cache miss 0  

Input:    How to get a job at Google?
Response: “I’ve been with Google for seven years. I started as a summer intern and have worked in a variety of roles, including Search Ads Product Marketing Manager and now Senior Manager of Product Management, Search Ads Strategy. For me, the most memorable aspect of working at Google is the people.

main: clearing the KV cache


Total prompt tokens:   2011, speed: 153.91 t/s
Total gen tokens:      1864, speed: 142.66 t/s
Total speed (AVG):           speed: 296.58 t/s
Cache misses:             0



llama_print_timings:        load time =  3420.25 ms
llama_print_timings:      sample time =  1693.70 ms /  1992 runs   (    0.85 ms per token,  1176.12 tokens per second)
llama_print_timings: prompt eval time = 10678.86 ms /  3870 tokens (    2.76 ms per token,   362.40 tokens per second)
llama_print_timings:        eval time =    96.14 ms /     6 runs   (   16.02 ms per token,    62.41 tokens per second)
llama_print_timings:       total time = 13064.91 ms
```
</details>


```bash
# n_parallel = 16
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./parallel -m models/openllama-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 16 -ns 128 -n 100 -cb

Total prompt tokens:   2011, speed: 181.94 t/s
Total gen tokens:      2063, speed: 186.65 t/s
Total speed (AVG):           speed: 368.59 t/s
```

<details>

```

Input:    What is the best way to learn a new language?
Response: The easiest way to learn any language is to live with someone that speaks that language. However, if that isn’t an option, the best way to learn any language is to use a program that uses a combination of verbal learning and verbal reinforcement to help you learn. When I first started studying Russian, I used programs like Rosetta Stone (which is great for beginners), but what worked best for me was a method

Client   9, seq   90, prompt   15 t, response   71 t, time  4.76 s, speed 18.08 t/s, cache miss 0  

Input:    What is the best way to cook a steak?
Response: The best way to cook a steak is to first preheat your oven to 425 degrees. Then, lightly season both sides of the steak with salt and pepper. Put it on a baking sheet lined with aluminum foil, drizzle with olive oil, and bake it in the oven for 10 minutes, or until medium-rare.

Client  13, seq  111, prompt   15 t, response   58 t, time  3.22 s, speed 22.69 t/s, cache miss 0  

Input:    I want to learn how to play the piano.
Response: I think you are a good piano player and I can teach you all about the piano. You will learn how to play all the songs that you like on the piano in no time. I can teach you how to improve your piano playing so that you can become an even better piano player.

main: clearing the KV cache


Total prompt tokens:   2011, speed: 181.94 t/s
Total gen tokens:      2063, speed: 186.65 t/s
Total speed (AVG):           speed: 368.59 t/s
Cache misses:             0



llama_print_timings:        load time =  3391.46 ms
llama_print_timings:      sample time =  1843.20 ms /  2191 runs   (    0.84 ms per token,  1188.69 tokens per second)
llama_print_timings: prompt eval time =  8358.01 ms /  4063 tokens (    2.06 ms per token,   486.12 tokens per second)
llama_print_timings:        eval time =   200.03 ms /    12 runs   (   16.67 ms per token,    59.99 tokens per second)
llama_print_timings:       total time = 11052.24 ms
```
</details>


```bash
# n_parallel = 32
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./parallel -m models/openllama-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 32 -ns 128 -n 100 -cb

Total prompt tokens:   2011, speed: 186.50 t/s
Total gen tokens:      2543, speed: 235.83 t/s
Total speed (AVG):           speed: 422.33 t/s
```

<details>

```
Input:    How to get a job at Google?
Response: Job Description. As an assistant, you will support the people who work at Google and our partners. This includes supporting some of the most senior leaders as they run their teams. You will have a wide variety of responsibilities, including scheduling meetings, booking travel and supporting senior leadership in planning events.

Client  19, seq   87, prompt   13 t, response   87 t, time  7.09 s, speed 14.11 t/s, cache miss 0  

Input:    How to get a job at Google?
Response: Google is a search engine for the Internet and one of the most visited sites on the Internet. However, it has not been easy to work at Google since its creation, as it has taken more than ten years to find it. At the beginning, Larry Page and Sergey Brin were looking for employees who were as intelligent as possible. They did not really understand how to work well or where to search for good workers. They simply thought

Client  25, seq  127, prompt   23 t, response   75 t, time  4.29 s, speed 22.83 t/s, cache miss 0  

Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: Yes. The Special Theory of Relativity (SR) is a theory that, in essence, says that the speed of light is constant for all observers. For example, if you have three observers at rest with respect to one another, who are moving towards each other and who have different speeds, the three will measure the same speed of light for any object that they view.

main: clearing the KV cache


Total prompt tokens:   2011, speed: 186.50 t/s
Total gen tokens:      2543, speed: 235.83 t/s
Total speed (AVG):           speed: 422.33 t/s
Cache misses:             0



llama_print_timings:        load time =  3420.38 ms
llama_print_timings:      sample time =  2267.36 ms /  2671 runs   (    0.85 ms per token,  1178.02 tokens per second)
llama_print_timings: prompt eval time =  7318.15 ms /  4535 tokens (    1.61 ms per token,   619.69 tokens per second)
llama_print_timings:        eval time =   412.22 ms /    20 runs   (   20.61 ms per token,    48.52 tokens per second)
llama_print_timings:       total time = 10782.65 ms
```
</details>


```bash
# n_parallel = 60
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./parallel -m models/openllama-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 60 -ns 128 -n 100 -cb

Total prompt tokens:   2011, speed: 235.90 t/s
Total gen tokens:      2166, speed: 254.09 t/s
Total speed (AVG):           speed: 489.99 t/s
```

<details>

```
Client  33, seq   78, prompt   13 t, response   72 t, time  6.70 s, speed 12.69 t/s, cache miss 0  

Input:    How to get a job at Google?
Response: Assistant role at Google is one of the most important jobs in the organization. The job requires candidates who are passionate, enthusiastic and well-versed with the latest technology in the market. The candidates must be passionate and able to understand and solve problems on their own. They should also be able to collaborate with others, communicate effectively, and have a strong work ethic

Client  26, seq   77, prompt   23 t, response   77 t, time  7.00 s, speed 14.29 t/s, cache miss 0  

Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: “No, sir. It is not my specialty. I only know that the theory was first put forward by Einstein, it was quite an influential theory of his and it has been used in a lot of scientific experiments and measurements. There is a whole bunch of experiments that have been done to prove it, but I cannot explain them to you. You should speak to one of my colleagues.”

Client  29, seq  102, prompt   16 t, response   79 t, time  6.41 s, speed 14.83 t/s, cache miss 0  

Input:    What is the best way to learn a new language?
Response: Well I do know that you have to know the grammar, you have to know vocabulary, and you have to get a feel for the sounds and the way it is pronounced. You also have to know the culture of where the language is spoken. And you also have to have friends that are natives of the country to practice with, and that’s really the best way to do it.

main: clearing the KV cache


Total prompt tokens:   2011, speed: 235.90 t/s
Total gen tokens:      2166, speed: 254.09 t/s
Total speed (AVG):           speed: 489.99 t/s
Cache misses:             0



llama_print_timings:        load time =  3407.33 ms
llama_print_timings:      sample time =  1923.99 ms /  2294 runs   (    0.84 ms per token,  1192.31 tokens per second)
llama_print_timings: prompt eval time =  5760.76 ms /  4170 tokens (    1.38 ms per token,   723.86 tokens per second)
llama_print_timings:        eval time =   159.77 ms /     8 runs   (   19.97 ms per token,    50.07 tokens per second)
llama_print_timings:       total time =  8524.06 ms
```
</details>


```bash
# n_parallel = 64
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./parallel -m models/openllama-7b/ggml-model-f16.gguf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 64 -ns 128 -n 100 -cb

Total prompt tokens:   2011, speed: 228.04 t/s
Total gen tokens:      2238, speed: 253.78 t/s
Total speed (AVG):           speed: 481.83 t/s
```

<details>

```
Client  61, seq   61, prompt   23 t, response   77 t, time  8.09 s, speed 12.36 t/s, cache miss 0  

Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: Sure. The Special Theory of Relativity is very simply understood by the layman. It concerns the speed of light and how to measure distance. You can imagine a room with a large light bulb at one end, a meter stick on the floor and a tape measure, a ruler, etc. at the other end of the room. When we go to that far end of the room

Client  15, seq   82, prompt   23 t, response   74 t, time  7.03 s, speed 13.79 t/s, cache miss 0  

Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: Yes, you can ask me about the Special Theory of Relativity. This theory states that the speed of light in vacuum is constant and independent of the source or the observer in a coordinate system moving relative to the source. Einstein's relativity theory also states that gravity is not a force but that it can be described as the curvature of space-time.

Client  47, seq  127, prompt   23 t, response   77 t, time  5.48 s, speed 18.24 t/s, cache miss 0  

Input:    Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: I’m sure you have heard about the Special Theory of Relativity by now, although it is not very often brought up in the classroom. It is a theory developed by the famous physicist Albert Einstein that explains how space and time are interrelated. For example, if you travel fast enough across space, you would experience time as speeding up. On the other hand, in general rel

main: clearing the KV cache


Total prompt tokens:   2011, speed: 228.04 t/s
Total gen tokens:      2238, speed: 253.78 t/s
Total speed (AVG):           speed: 481.83 t/s
Cache misses:             0



llama_print_timings:        load time =  3401.75 ms
llama_print_timings:      sample time =  1976.50 ms /  2366 runs   (    0.84 ms per token,  1197.06 tokens per second)
llama_print_timings: prompt eval time =  5806.75 ms /  4234 tokens (    1.37 ms per token,   729.15 tokens per second)
llama_print_timings:        eval time =   335.70 ms /    16 runs   (   20.98 ms per token,    47.66 tokens per second)
llama_print_timings:       total time =  8817.67 ms
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : improve batched decoding performance #3479

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : improve batched decoding performance #3479

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions