[Performance]: UVA vs UVM for CPU offloading on v0.8.4+

### Proposal to improve performance

Referencing the recent implementation on https://github.com/vllm-project/vllm/pull/15354 (v0.8.4+) for CPU offloading

@youkaichao, is there any specific reason to pick UVA (`cudaHostAlloc`) over UVM `cudaMallocManaged()`? 

1. UVM goes further than UVA to manage data automatically, often using page-faulting hardware to migrate pages on demand. On systems like the GH200, this has potentially additional benefits such as hardware orchestrated frequency based migration. 
2. A key benefit of Unified Memory is simplifying the heterogeneous computing memory model by eliminating the need for deep copies when accessing structured data in GPU kernels. [Source](https://developer.nvidia.com/blog/unified-memory-in-cuda-6/#unified_memory_or_unified_virtual_addressing)
3. On several discussion threads, the larger access sizes of CPU offloading makes UVM seems to be the better approach compared to UVA [Source](https://forums.developer.nvidia.com/t/page-fault-profiling/265320/3?u=rajeshshashikumar)

Upon profiling vLLM v0.8.4 on a GH200 trying to assess the penalty of page migrations with CPU offloading, I noticed two things 
1. `cudaHostAlloc()` calls were prevalent but there is no page fault data collected. Going by UVA, these were likely accessed directly from the CPU which could hurt utilization on the GPU
2. A high utilization driver process called `UVM GPU1 BH` whose behavior is unexplained on [Nvidia forums](https://forums.developer.nvidia.com/t/uvm-gpu1-bh-process-causing-100-cpu-after-standby/58110)

<img width="1346" alt="image" src="https://github.com/user-attachments/assets/57bced83-eca9-4308-9dd0-d0f4874797c6" />

Going by [this](https://arxiv.org/pdf/2407.07850) literature, if transparent offloading is desired `cudaMallocManaged()` seems to be desirable for platforms such as the GH200

<img width="474" alt="Image" src="https://github.com/user-attachments/assets/936174e3-1559-48c8-b02f-440e93e30d61" />

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)




### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: UVA vs UVM for CPU offloading on v0.8.4+ #17062

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: UVA vs UVM for CPU offloading on v0.8.4+ #17062

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions