Skip to content

Feature Request: Splitting layers according to VRAM usage on multi GPUs setups #12654

Open
@goodglitch

Description

@goodglitch

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I ran into problem running nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF model in LM Studio on dual RTX 3090 setup. LM Studio splits model evenly among GPUs (default llama-cli option), but in case of Nemotron with much bigger first layers it leads to very unequal VRAM usage. This results in OOM, when I try to increase context size while having plenty of free VRAM on the second GPU. I got exactly the same behavior by using llama-cli with default even split.

Motivation

This is necessary for any model that has unbalanced structure, e.g. first layers are much bigger than later ones. Without this feature downstream application wouldn't be able to load model weights evenly and use VRAM efficiently in case of multi GPU setup, since it doesn't have info about layers sizes.

Possible Implementation

Please make equivalent to --tensor-split command or change it behavior to split according VRAM usage rather than number of layers for convenient use of models with asymmetric layers sizes. Possible implementation, in case of two GPUs: solve one equation with k, where k is a number of layers offloaded to GPU0 in such a way that sum of sizes of first k layers approximately equals to the sum of sizes of n-k last layers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions