Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I ran into problem running nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF model in LM Studio on dual RTX 3090 setup. LM Studio splits model evenly among GPUs (default llama-cli option), but in case of Nemotron with much bigger first layers it leads to very unequal VRAM usage. This results in OOM, when I try to increase context size while having plenty of free VRAM on the second GPU. I got exactly the same behavior by using llama-cli with default even split.
Motivation
This is necessary for any model that has unbalanced structure, e.g. first layers are much bigger than later ones. Without this feature downstream application wouldn't be able to load model weights evenly and use VRAM efficiently in case of multi GPU setup, since it doesn't have info about layers sizes.
Possible Implementation
Please make equivalent to --tensor-split command or change it behavior to split according VRAM usage rather than number of layers for convenient use of models with asymmetric layers sizes. Possible implementation, in case of two GPUs: solve one equation with k, where k is a number of layers offloaded to GPU0 in such a way that sum of sizes of first k layers approximately equals to the sum of sizes of n-k last layers.