Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I would like to adapt the server (or create an alternate server) so that it is more suited to being changed during runtime. My primary goal in doing so is to be able to switch models on the fly.
Motivation
In local inference, I find that no single model is best for all tasks and I switch between models frequently using TabbyAPI. I would like to be able to have this functionality available directly in llama.cpp to be able to make use of GGUF files and the llama.cpp ecosystem.
Possible Implementation
Specifically, I want to be able to:
- Start the server without a model loaded and be functional in a state without models
- Have multiple models loaded concurrently
- Offer a /models endpoint to:
- List available models (GET:/models)
- Get details about a specific model (GET:/models/{model_id})
- List available draft models (GET:/models/draft_models)
- Get details about a specific draft model (GET:/models/draft_models/{model_id})
- Change model routing settings (POST:/models)
- List loaded models and default model routing (GET:/models/status)
- Load models (POST:/models/load)
- Unload models (POST:/models/unload)
This may also require similar endpoints for loading and unloading LoRAs, embeddings, and other factors. For example, I may want to be able to monitor GPU status for available VRAM and use that as a check before loading models.
Before I get started, I wanted to solicit some feedback from the community on endpoint names and required features. Does anyone have any thoughts about this, the names, if there are negative impacts, or anything else?