Description
Motivation.
Production LoRA serving
This RFC lays out the current limitations in online LoRA serving, potential solutions, and a proposal for implementation.
Context
What we would like to offer SaaS products is a way to serve a single, multi-replica deployment of an LLM, where multiple tenants can each load or unload their own LoRA adapters for that LLM as needed without requiring downtime or redeployment.
However, the only "non-development" way to serve LoRA adapters for online inference with vLLM today is to tell vLLM about them ahead of time with the --lora-modules
CLI argument. This presents a problem for products that want to adopt vLLM for multi-tenant LoRA serving, as the only way to load a new adapter is to redeploy the entire service.
There is a "development mode" method to dynamically load LoRA adapters: Setting VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
will enable the /v1/load_lora_adapter
and /v1/unload_lora_adapter
endpoints, which can be used to load or unload new LoRA adapters at runtime. However this is currently inappropriate for production use, because it neither:
- Ensures the adapter is loaded across all replicas of the deployment
- Guarantees that the adapter will be available on a new replica, or after a replica restart
Solving both of these problems is necessary to offer multi-tenant LoRA serving in production settings.
The rest of this RFC makes the same assumptions as the /v1/load_lora_adapter
endpoint: i.e that the LoRA adapters in question are either:
- To be downloaded from HF Hub, or
- Available on disk to the vLLM process
The problem described here is tracking the metadata of which adapters should be loaded at any point in time across a deployment. Storing and loading the adapter artifacts themselves is yet another problem- other updates can be made to vLLM to address that such as:
- Accepting generic URLs in
/v1/load_lora_adapter
payloads - Accepting a tar archive upload in
/v1/load/lora_adapter
, etc.
Proposed Change.
General Solution Ideas
Option 1: Handle externally with smart routing
One option is to ignore the problem entirely at the vLLM level, and have an external routing component ensure that requests are only routed to replicas which have the adapter loaded. For example, kserve/modelmesh-serving provides a general purpose solution to this problem.
It would be possible to implement the internal APIs required for modelmesh
in vLLM so that kserve could handle loading and routing for LoRA adapters without any extra state management in vLLM. There are probably some other third-party components that could be used in the same way, or we could write our own routing component.
Pros:
- No extra state management required in vLLM
- Third party model management systems already offer compliance-ready solutions, handling issues like data security and backup and disaster recovery
- Addressing this at the routing layer can allow us to manage which adapters and how many adapters are loaded per replica. For large numbers of adapters, this could become required in order to avoid cache thrashing issues in each replica.
Cons:
- Doesn't offer a vLLM-native solution to the problem
- Increases deployment complexity
- Introduces deployment dependency on a third party component
- Would collide with other routing strategies like
- session aware routing
- prefill/decode disaggregation
Option 2: Use external state management to track adapters
Another option is to have vLLM use an external data store like etcd directly to track loaded adapters. This would be a lighter-weight option than relying on a third party model management solution, but would still introduce extra deployment dependencies and overhead.
Pros:
- Distributed data stores like etcd are well-understood and production-tested
- Backup and restore operations are relatively easy for service operators
- Logic can be wrapped in atomic transactions to ensure consistency across a deployment
- "Watch" apis can be used to push updates immediately to all replicas in a deployment
- No routing changes needed, wouldn't collide with any other routing work
Cons:
- Requires writing state management logic into vLLM
- We have to maintain a database client, and any schema changes would need to be carefully considered
- Introduces extra deployment dependency
- Increases deployment complexity
Option 3: Use simple disk-based storage to track loaded adapters
Often, replicas of a deployment will mount a shared filesystem to access common artifacts. For example, in kubernetes deployments an S3 bucket can be mounted as network-attached storage with a persistent volume claim using S3FS. This shared filesystem space can be used to write simple files that track metadata for the adapters to be loaded for a given deployment.
Pros:
- Simplest option, no additional code or deployment dependencies
- Works anywhere you can mount a filesystem
- Easy to implement and test locally
Cons:
- Simple disk storage leaves encryption, backup and restore as exercises for service operators
- Requires file write permissions, which may be a security risk
- NAS systems are generally non-atomic: concurrent writes may appear to succeed but the last one will win
- Consistency can be an issue depending on the filesystem used, writes may not be visible by other replicas for some time
Proposal
These options aren't necessarily mutually exclusive, so we propose implementing Option 3 as a short term solution.
The simplest implementation would be to store the payloads from /v1/load_lora_adapter
as json files in a configurable directory. The name of the file should be the adapter name, to easily identify if an adapter is loaded without reading file contents. At load time, this file should be written after the adapter successfully loads. When the /v1/models
endpoint is called, these files should be used to determine the full set of available adapters so that responses from all replicas are consistent (within the constraints of the underlying filesystem used).
The entire implementation can be contained within the API layer of vLLM (vllm.entrypoints.openai.*) and no changes would be required to the lora caching mechanism in the engine.
Assumptions:
- The consuming application is responsible for providing access control and guarding against misuse, i.e. not allowing one user to register 10000 adapters at once
- We can handle some basic security checks at load time like denying path traversal
- We won't be providing per-adapter authorization
Open questions:
-
Where should the adapter files be stored?
One option is the existing configuration directory, e.g. "${VLLM_CONFIG_ROOT}/adapters". This seems appropriate for storing metadata files about loaded adapters, but may be inappropriate for later expansion of caching actual adapter artifacts, if we end up going that route later. We could introduce a new
VLLM_LORA_ADAPTER_CACHE
environment variable for clarity about where this data is stored. -
How should we handle deleting adapters - i.e.
/v1/unload_lora_adapter
?Deleting the metadata files seems appropriate, but propagating the deletion across replicas seems tricky. Filesystem-watch APIs are both OS and filesytem dependent and have limitations on some network backed storage. (e.g. you can't use inotify with S3FS). We could check file existence on every inference API call for each adapter, but that would add overhead to the critical path. It may be sufficient to not attempt to unload an adapter from all replicas, instead allowing them to either eventually be unloaded when:
- They are evicted from the LRU cache
- The /models endpoint is accessed and we check all the loaded adapters
- The process ends
This would mean an adapter could remain available for inference on some replicas after unload, however since we assume that the consuming application is providing access control, this may not be a problem.
-
Should adapters be loaded at boot time?
Currently we validate all lora adapters given statically by the
--lora-modules
CLI arg by loading them at boot time. With dynamically loaded adapters, there may be an unbounded number of adapters to potentially load. The LRU caching mechanism in the engine ensures only the ones in use stay loaded, but there could be far more adapters "loaded" than can fit in the cache. We can assume that if adapter metadata is in the cache, then it has been successfully loaded before, so we don't need to re-load it to verify at boot time, and we should lazily load at inference time instead. But, if the number of adapters is low (less than cache size), we might still want to eagerly load at boot to trade a slightly longer boot time for lower latency on first inference.
Feedback Period.
Through 2/1/25
CC List.
@njhill @wangchen615 @tjohnson31415 @maxdebayser
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.