Skip to content

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Closed
@kalomaze

Description

@kalomaze

This pr mentioned a while back that, since Llama 70b used GQA, there is a specific k-quantization trick that allows them to quantize with marginal model size increases:

image

Mistral 7b, a very popular model released after this PR was made, also uses Grouped Query Attention.
Checking for this if the 7b is a Mistral model and applying the same treatment should theoretically provide similar gains unless I am mistaken.

image

I think in general quantization optimization is sorely overlooked, lots of low hanging fruit there for sure....

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions