Feature Proposal: Server Model Switching at Runtime

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

I would like to adapt the server (or create an alternate server) so that it is more suited to being changed during runtime.  My primary goal in doing so is to be able to switch models on the fly.

### Motivation

In local inference, I find that no single model is best for all tasks and I switch between models frequently using TabbyAPI. I would like to be able to have this functionality available directly in llama.cpp to be able to make use of GGUF files and the llama.cpp ecosystem.

### Possible Implementation

Specifically, I want to be able to:

* Start the server without a model loaded and be functional in a state without models
* Have multiple models loaded concurrently
* Offer a /models endpoint to:
    * List available models (GET:/models)
    * Get details about a specific model (GET:/models/{model_id})
    * List available draft models (GET:/models/draft_models)
    * Get details about a specific draft model (GET:/models/draft_models/{model_id})
    * Change model routing settings (POST:/models)
    * List loaded models and default model routing (GET:/models/status)
    * Load models (POST:/models/load)
    * Unload models (POST:/models/unload)

This may also require similar endpoints for loading and unloading LoRAs, embeddings, and other factors. For example, I may want to be able to monitor GPU status for available VRAM and use that as a check before loading models.

Before I get started, I wanted to solicit some feedback from the community on endpoint names and required features. Does anyone have any thoughts about this, the names, if there are negative impacts, or anything else?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Proposal: Server Model Switching at Runtime #13027

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Proposal: Server Model Switching at Runtime #13027

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions