Skip to content

Feature Request: Interleaved sliding window attention support for gemma 2 and 3 #12637

Open
@ymcki

Description

@ymcki

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

While trying to estimate how much KV cache uses at 128k context for the major models using formula and running llama.cpp empirically, I noticed that gemma 3 has an extremely high KV cache usage:

https://www.reddit.com/r/LocalLLaMA/comments/1jl33br/qwq32b_has_the_highest_kv_cachemodel_size_ratio/

When I read the gemma 3 technical report, its Figure 6 says it has a 5:1 interleaved sliding window attention that can reduce the KV cache usage to one sixth.

https://arxiv.org/html/2503.19786v1

I checked the llama.cpp code, it seems like while there is kq_mask for sliding window attention to speed up inference, there seems to be no code to reduce KV cache usage.

Motivation

I think it will be great if llama.cpp can support this feature such that the gemma models can be applicable to the long context situation.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions