Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization

[This pr](https://github.com/ggerganov/llama.cpp/pull/2807) mentioned a while back that, since Llama 70b used GQA, there is a specific k-quantization trick that allows them to quantize with marginal model size increases:

<img width="565" alt="image" src="https://github.com/ggerganov/llama.cpp/assets/66376113/9a1260ec-9a8e-45d9-baa9-c65879e5f417">

Mistral 7b, a very popular model released after this PR was made, also uses Grouped Query Attention.
Checking for this if the 7b is a Mistral model and applying the same treatment should theoretically provide similar gains unless I am mistaken.

<img width="542" alt="image" src="https://github.com/ggerganov/llama.cpp/assets/66376113/b63744de-3616-4071-8cfe-202491c8806f">

I think in general quantization optimization is sorely overlooked, lots of low hanging fruit there for sure....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions