Feature Request: Splitting layers according to VRAM usage on multi GPUs setups

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

I ran into problem running nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF model in LM Studio on dual RTX 3090 setup. LM Studio splits model evenly among GPUs (default llama-cli option), but in case of Nemotron with much bigger first layers it leads to very unequal VRAM usage. This results in OOM, when I try to increase context size while having plenty of free VRAM on the second GPU. I got exactly the same behavior by using llama-cli with default even split.


### Motivation

This is necessary for any model that has unbalanced structure, e.g. first layers are much bigger than later ones. Without this feature downstream application wouldn't be able to load model weights evenly and use VRAM efficiently in case of multi GPU setup, since it doesn't have info about layers sizes.

### Possible Implementation

Please make equivalent to --tensor-split command or change it behavior to split according VRAM usage rather than number of layers for convenient use of models with asymmetric layers sizes. Possible implementation, in case of two GPUs: solve one equation with k, where k is a number of layers offloaded to GPU0 in such a way that sum of sizes of first k layers approximately equals to the sum of sizes of n-k last layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions