Feature Request: Interleaved sliding window attention support for gemma 2 and 3

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

While trying to estimate how much KV cache uses at 128k context for the major models using formula and running llama.cpp empirically, I noticed that gemma 3 has an extremely high KV cache usage:

https://www.reddit.com/r/LocalLLaMA/comments/1jl33br/qwq32b_has_the_highest_kv_cachemodel_size_ratio/

When I read the gemma 3 technical report, its Figure 6 says it has a 5:1 interleaved sliding window attention that can reduce the KV cache usage to one sixth. 

https://arxiv.org/html/2503.19786v1

I checked the llama.cpp code, it seems like while there is kq_mask for sliding window attention to speed up inference, there seems to be no code to reduce KV cache usage.


### Motivation

I think it will be great if llama.cpp can support  this feature such that the gemma models can be applicable to the long context situation.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Interleaved sliding window attention support for gemma 2 and 3 #12637

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Interleaved sliding window attention support for gemma 2 and 3 #12637

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions