Open
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
While trying to estimate how much KV cache uses at 128k context for the major models using formula and running llama.cpp empirically, I noticed that gemma 3 has an extremely high KV cache usage:
When I read the gemma 3 technical report, its Figure 6 says it has a 5:1 interleaved sliding window attention that can reduce the KV cache usage to one sixth.
https://arxiv.org/html/2503.19786v1
I checked the llama.cpp code, it seems like while there is kq_mask for sliding window attention to speed up inference, there seems to be no code to reduce KV cache usage.
Motivation
I think it will be great if llama.cpp can support this feature such that the gemma models can be applicable to the long context situation.
Possible Implementation
No response