-
Notifications
You must be signed in to change notification settings - Fork 12k
Add support for Deepseek-R1 flash attention #11557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add support for Deepseek-R1 flash attention #11557
Conversation
This is not yet working, probably because the padding and slicing are not quite working correctly. I am investigating but if you have a different suggestion than padding I can try that. |
This would really make a massive difference to running R1 locally. I don't have the ability to work on this, but is there anything that the community can do to help push this forward? |
DeepSeek support for flash attention was just merged into ik_llama.cpp - ikawrakow/ik_llama.cpp#241 |
I took a stab at integrating ikawrakow's work with this PR (also updating to current master): https://github.com/fredlas/llama.cpp/tree/ik_r1_fa There were some additional changes it needed to get it to compile, and not fail dimension sanity check asserts. Unfortunately it's generating gibberish, so I guess I'm plugging in something wrong. Likely culprits are I am not going to be able to take it any further, but maybe someone who knows the transformer guts could skim it and see something obviously (to them) wrong. |
@fredlas, I have added a comment with a suggestion about pointer offsets in the code
P. S. you might also want to build my llama.cpp fork and test cpu only inference of the model like DeepSeek-V2-Lite (V2, V3 and R1 have similar attention mehanism), -fa -ctk q8_0 -ctv q8_0 seem to work |
UPD: Definitely padding is somehow troublesome, since in cuda it turns output into complete gibberish. example:
|
This PR adds support for Flash Attention in Deepseek V3 models.
Since Deepseek v3 has n_embd_head_v != n_embd_head_k padding is needed for the flash attention operation. This change pads and slices before the flash attention operation. This is done in lieu of adding support in the flash attention kernel for a different V head dimension.