llama : add llama_kv_cache_compress #5719

ggerganov · 2024-02-25T20:23:20Z

This is an experiment to see if we can compress the KV cache data. It does not work atm, so this is mostly a demo and setup for additional experiments.

The idea is to apply self-extend and then "merge" the cells that have the same positions, effectively reducing the memory usage of the KV cache by the self-extend factor N

The merging of the cells can be done in different ways:

simple sum of the embeddings
average of the embeddings
pick different heads from the different cells

None of these strategies succeed in the basic passkey test, which is not very surprising. But you never know.

make -j && ./passkey ./models/llama-7b-v2/ggml-model-f16.gguf 250 2 90

main: n_len = 6083, n_ctx = 8192, n_kv_req = 8224, n_grp = 2, n_batch = 512, n_junk = 250, i_pos = 90

prefix tokens: 32
prompt tokens: 6067
main: processed: [     0,    512)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1135.537 ms
main: processed: [   512,   1024)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1015.598 ms
main: processed: [  1024,   1536)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1153.629 ms
main: processed: [  1536,   2048)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1270.727 ms
main: processed: [  2048,   2560)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1406.514 ms
main: processed: [  2560,   3072)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1525.955 ms
main: processed: [  3072,   3584)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1667.036 ms
main: processed: [  3584,   4096)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1810.568 ms
main: processed: [  4096,   4608)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 2000.851 ms
main: processed: [  4608,   5120)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 2044.352 ms
main: processed: [  5120,   5632)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 2205.304 ms
main: processed: [  5632,   6067)

main: passkey = 27717, inserted at position 90 / 250 (token pos: ~2184)

 What is the pass key? The pass key is the one that is used to unlock the computer. The pass key is the

main: decoded 16 tokens in 0.47 s, speed: 33.93 t/s

llama_print_timings:        load time =    1312.07 ms
llama_print_timings:      sample time =       0.38 ms /    17 runs   (    0.02 ms per token, 44502.62 tokens per second)
llama_print_timings: prompt eval time =    8880.16 ms /  6067 tokens (    1.46 ms per token,   683.21 tokens per second)
llama_print_timings:        eval time =     469.59 ms /    16 runs   (   29.35 ms per token,    34.07 tokens per second)
llama_print_timings:       total time =   27188.08 ms /  6083 tokens

ngxson · 2024-02-25T21:43:32Z

Do you think that it would work if we only combine a set of pre-defined pair of token?

For example, in the phrase that I said above, we can combine these pairs: that [a-z]+, a [a-z]+, pair of, etc..

Also, I suspect that the passcode test does not work because the tokens that correspond to number are so sensitive (i.e. cosine distance of their embedding are roughly the same) that even modify them a bit can cause big changes. What about try a text passcode, for example Chewy-Tiger-Amazon

ggerganov · 2024-03-16T10:33:25Z

Do you think that it would work if we only combine a set of pre-defined pair of token?

This approach seems too hand-crafted. We want to look for something more generally applicable

Btw, here is a new semi-relevant paper: https://arxiv.org/pdf/2403.09636.pdf

Unfortunately, the approach requires extra fine-tuning

ngxson · 2024-03-16T11:08:36Z

Btw, here is a new semi-relevant paper: https://arxiv.org/pdf/2403.09636.pdf

Thanks for directing me to this paper. So as I understand, they also selectively merge KV cells:

Based on αt, a decision is made whether KV representations kt and vt are appended to the cache or accumulated with its last element (Figure 1).

This approach seems too hand-crafted. We want to look for something more generally applicable

My idea is that maybe we can first try the hand-craft version to see if it really works or not. Then we can find a way to automate this process, maybe via n-gram analysis or train a super small model to decide append/accumulate like what the paper suggested. In anyways, I think it will be interesting to have something "on top" of the model that is easy to produce, much like how we can produce imatrix and use it in quantizing.

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Feb 25, 2024

llama : add llama_kv_cache_compress (EXPERIMENTAL)

14d7570

ggerganov force-pushed the gg/kv-compress branch from d7b8be1 to 14d7570 Compare February 27, 2024 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : add llama_kv_cache_compress #5719

llama : add llama_kv_cache_compress #5719

Uh oh!

ggerganov commented Feb 25, 2024 •

edited

Loading

Uh oh!

ngxson commented Feb 25, 2024 •

edited

Loading

Uh oh!

ggerganov commented Mar 16, 2024

Uh oh!

ngxson commented Mar 16, 2024

Uh oh!

Uh oh!

llama : add llama_kv_cache_compress #5719

Are you sure you want to change the base?

llama : add llama_kv_cache_compress #5719

Uh oh!

Conversation

ggerganov commented Feb 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Feb 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 16, 2024

Uh oh!

ngxson commented Mar 16, 2024

Uh oh!

Uh oh!

ggerganov commented Feb 25, 2024 •

edited

Loading

ngxson commented Feb 25, 2024 •

edited

Loading