Skip to content

[RFC]: Distribute LoRA adapters across deployment #12174

Closed
@joerunde

Description

@joerunde

Motivation.

Production LoRA serving

This RFC lays out the current limitations in online LoRA serving, potential solutions, and a proposal for implementation.

Context

What we would like to offer SaaS products is a way to serve a single, multi-replica deployment of an LLM, where multiple tenants can each load or unload their own LoRA adapters for that LLM as needed without requiring downtime or redeployment.

However, the only "non-development" way to serve LoRA adapters for online inference with vLLM today is to tell vLLM about them ahead of time with the --lora-modules CLI argument. This presents a problem for products that want to adopt vLLM for multi-tenant LoRA serving, as the only way to load a new adapter is to redeploy the entire service.

There is a "development mode" method to dynamically load LoRA adapters: Setting VLLM_ALLOW_RUNTIME_LORA_UPDATING=True will enable the /v1/load_lora_adapter and /v1/unload_lora_adapter endpoints, which can be used to load or unload new LoRA adapters at runtime. However this is currently inappropriate for production use, because it neither:

  • Ensures the adapter is loaded across all replicas of the deployment
  • Guarantees that the adapter will be available on a new replica, or after a replica restart

Image

Solving both of these problems is necessary to offer multi-tenant LoRA serving in production settings.

The rest of this RFC makes the same assumptions as the /v1/load_lora_adapter endpoint: i.e that the LoRA adapters in question are either:

  1. To be downloaded from HF Hub, or
  2. Available on disk to the vLLM process

The problem described here is tracking the metadata of which adapters should be loaded at any point in time across a deployment. Storing and loading the adapter artifacts themselves is yet another problem- other updates can be made to vLLM to address that such as:

  • Accepting generic URLs in /v1/load_lora_adapter payloads
  • Accepting a tar archive upload in /v1/load/lora_adapter, etc.

Proposed Change.

General Solution Ideas

Option 1: Handle externally with smart routing

Image

One option is to ignore the problem entirely at the vLLM level, and have an external routing component ensure that requests are only routed to replicas which have the adapter loaded. For example, kserve/modelmesh-serving provides a general purpose solution to this problem.

It would be possible to implement the internal APIs required for modelmesh in vLLM so that kserve could handle loading and routing for LoRA adapters without any extra state management in vLLM. There are probably some other third-party components that could be used in the same way, or we could write our own routing component.

Pros:

  • No extra state management required in vLLM
  • Third party model management systems already offer compliance-ready solutions, handling issues like data security and backup and disaster recovery
  • Addressing this at the routing layer can allow us to manage which adapters and how many adapters are loaded per replica. For large numbers of adapters, this could become required in order to avoid cache thrashing issues in each replica.

Cons:

  • Doesn't offer a vLLM-native solution to the problem
  • Increases deployment complexity
  • Introduces deployment dependency on a third party component
  • Would collide with other routing strategies like
    • session aware routing
    • prefill/decode disaggregation

Option 2: Use external state management to track adapters

Image

Another option is to have vLLM use an external data store like etcd directly to track loaded adapters. This would be a lighter-weight option than relying on a third party model management solution, but would still introduce extra deployment dependencies and overhead.

Pros:

  • Distributed data stores like etcd are well-understood and production-tested
  • Backup and restore operations are relatively easy for service operators
  • Logic can be wrapped in atomic transactions to ensure consistency across a deployment
  • "Watch" apis can be used to push updates immediately to all replicas in a deployment
  • No routing changes needed, wouldn't collide with any other routing work

Cons:

  • Requires writing state management logic into vLLM
  • We have to maintain a database client, and any schema changes would need to be carefully considered
  • Introduces extra deployment dependency
  • Increases deployment complexity

Option 3: Use simple disk-based storage to track loaded adapters

Image

Often, replicas of a deployment will mount a shared filesystem to access common artifacts. For example, in kubernetes deployments an S3 bucket can be mounted as network-attached storage with a persistent volume claim using S3FS. This shared filesystem space can be used to write simple files that track metadata for the adapters to be loaded for a given deployment.

Pros:

  • Simplest option, no additional code or deployment dependencies
  • Works anywhere you can mount a filesystem
  • Easy to implement and test locally

Cons:

  • Simple disk storage leaves encryption, backup and restore as exercises for service operators
  • Requires file write permissions, which may be a security risk
  • NAS systems are generally non-atomic: concurrent writes may appear to succeed but the last one will win
  • Consistency can be an issue depending on the filesystem used, writes may not be visible by other replicas for some time

Proposal

These options aren't necessarily mutually exclusive, so we propose implementing Option 3 as a short term solution.

The simplest implementation would be to store the payloads from /v1/load_lora_adapter as json files in a configurable directory. The name of the file should be the adapter name, to easily identify if an adapter is loaded without reading file contents. At load time, this file should be written after the adapter successfully loads. When the /v1/models endpoint is called, these files should be used to determine the full set of available adapters so that responses from all replicas are consistent (within the constraints of the underlying filesystem used).

The entire implementation can be contained within the API layer of vLLM (vllm.entrypoints.openai.*) and no changes would be required to the lora caching mechanism in the engine.

Assumptions:

  • The consuming application is responsible for providing access control and guarding against misuse, i.e. not allowing one user to register 10000 adapters at once
    • We can handle some basic security checks at load time like denying path traversal
  • We won't be providing per-adapter authorization

Open questions:

  1. Where should the adapter files be stored?

    One option is the existing configuration directory, e.g. "${VLLM_CONFIG_ROOT}/adapters". This seems appropriate for storing metadata files about loaded adapters, but may be inappropriate for later expansion of caching actual adapter artifacts, if we end up going that route later. We could introduce a new VLLM_LORA_ADAPTER_CACHE environment variable for clarity about where this data is stored.

  2. How should we handle deleting adapters - i.e. /v1/unload_lora_adapter?

    Deleting the metadata files seems appropriate, but propagating the deletion across replicas seems tricky. Filesystem-watch APIs are both OS and filesytem dependent and have limitations on some network backed storage. (e.g. you can't use inotify with S3FS). We could check file existence on every inference API call for each adapter, but that would add overhead to the critical path. It may be sufficient to not attempt to unload an adapter from all replicas, instead allowing them to either eventually be unloaded when:

    • They are evicted from the LRU cache
    • The /models endpoint is accessed and we check all the loaded adapters
    • The process ends

    This would mean an adapter could remain available for inference on some replicas after unload, however since we assume that the consuming application is providing access control, this may not be a problem.

  3. Should adapters be loaded at boot time?

    Currently we validate all lora adapters given statically by the --lora-modules CLI arg by loading them at boot time. With dynamically loaded adapters, there may be an unbounded number of adapters to potentially load. The LRU caching mechanism in the engine ensures only the ones in use stay loaded, but there could be far more adapters "loaded" than can fit in the cache. We can assume that if adapter metadata is in the cache, then it has been successfully loaded before, so we don't need to re-load it to verify at boot time, and we should lazily load at inference time instead. But, if the number of adapters is low (less than cache size), we might still want to eagerly load at boot to trade a slightly longer boot time for lower latency on first inference.

Feedback Period.

Through 2/1/25

CC List.

@njhill @wangchen615 @tjohnson31415 @maxdebayser

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions