Skip to content

llama : add llama_kv_cache_compress #5719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Feb 25, 2024

This is an experiment to see if we can compress the KV cache data. It does not work atm, so this is mostly a demo and setup for additional experiments.

The idea is to apply self-extend and then "merge" the cells that have the same positions, effectively reducing the memory usage of the KV cache by the self-extend factor N

The merging of the cells can be done in different ways:

  • simple sum of the embeddings
  • average of the embeddings
  • pick different heads from the different cells

None of these strategies succeed in the basic passkey test, which is not very surprising. But you never know.

make -j && ./passkey ./models/llama-7b-v2/ggml-model-f16.gguf 250 2 90
main: n_len = 6083, n_ctx = 8192, n_kv_req = 8224, n_grp = 2, n_batch = 512, n_junk = 250, i_pos = 90

prefix tokens: 32
prompt tokens: 6067
main: processed: [     0,    512)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1135.537 ms
main: processed: [   512,   1024)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1015.598 ms
main: processed: [  1024,   1536)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1153.629 ms
main: processed: [  1536,   2048)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1270.727 ms
main: processed: [  2048,   2560)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1406.514 ms
main: processed: [  2560,   3072)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1525.955 ms
main: processed: [  3072,   3584)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1667.036 ms
main: processed: [  3584,   4096)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 1810.568 ms
main: processed: [  4096,   4608)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 2000.851 ms
main: processed: [  4608,   5120)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 2044.352 ms
main: processed: [  5120,   5632)
(tmp log) KV compress pairs: 256
(tmp log) KV compress time: 2205.304 ms
main: processed: [  5632,   6067)

main: passkey = 27717, inserted at position 90 / 250 (token pos: ~2184)

 What is the pass key? The pass key is the one that is used to unlock the computer. The pass key is the

main: decoded 16 tokens in 0.47 s, speed: 33.93 t/s

llama_print_timings:        load time =    1312.07 ms
llama_print_timings:      sample time =       0.38 ms /    17 runs   (    0.02 ms per token, 44502.62 tokens per second)
llama_print_timings: prompt eval time =    8880.16 ms /  6067 tokens (    1.46 ms per token,   683.21 tokens per second)
llama_print_timings:        eval time =     469.59 ms /    16 runs   (   29.35 ms per token,    34.07 tokens per second)
llama_print_timings:       total time =   27188.08 ms /  6083 tokens

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Feb 25, 2024
@ngxson
Copy link
Collaborator

ngxson commented Feb 25, 2024

Do you think that it would work if we only combine a set of pre-defined pair of token?

For example, in the phrase that I said above, we can combine these pairs: that [a-z]+, a [a-z]+, pair of, etc..


Also, I suspect that the passcode test does not work because the tokens that correspond to number are so sensitive (i.e. cosine distance of their embedding are roughly the same) that even modify them a bit can cause big changes. What about try a text passcode, for example Chewy-Tiger-Amazon

@ggerganov
Copy link
Member Author

Do you think that it would work if we only combine a set of pre-defined pair of token?

This approach seems too hand-crafted. We want to look for something more generally applicable

Btw, here is a new semi-relevant paper: https://arxiv.org/pdf/2403.09636.pdf

Unfortunately, the approach requires extra fine-tuning

@ngxson
Copy link
Collaborator

ngxson commented Mar 16, 2024

Btw, here is a new semi-relevant paper: https://arxiv.org/pdf/2403.09636.pdf

Thanks for directing me to this paper. So as I understand, they also selectively merge KV cells:

Based on αt, a decision is made whether KV representations kt and vt are appended to the cache or accumulated with its last element (Figure 1).

This approach seems too hand-crafted. We want to look for something more generally applicable

My idea is that maybe we can first try the hand-craft version to see if it really works or not. Then we can find a way to automate this process, maybe via n-gram analysis or train a super small model to decide append/accumulate like what the paper suggested. In anyways, I think it will be interesting to have something "on top" of the model that is easy to produce, much like how we can produce imatrix and use it in quantizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants