Open
Description
Proposal to improve performance
Referencing the recent implementation on #15354 (v0.8.4+) for CPU offloading
@youkaichao, is there any specific reason to pick UVA (cudaHostAlloc
) over UVM cudaMallocManaged()
?
- UVM goes further than UVA to manage data automatically, often using page-faulting hardware to migrate pages on demand. On systems like the GH200, this has potentially additional benefits such as hardware orchestrated frequency based migration.
- A key benefit of Unified Memory is simplifying the heterogeneous computing memory model by eliminating the need for deep copies when accessing structured data in GPU kernels. Source
- On several discussion threads, the larger access sizes of CPU offloading makes UVM seems to be the better approach compared to UVA Source
Upon profiling vLLM v0.8.4 on a GH200 trying to assess the penalty of page migrations with CPU offloading, I noticed two things
cudaHostAlloc()
calls were prevalent but there is no page fault data collected. Going by UVA, these were likely accessed directly from the CPU which could hurt utilization on the GPU- A high utilization driver process called
UVM GPU1 BH
whose behavior is unexplained on Nvidia forums

Going by this literature, if transparent offloading is desired cudaMallocManaged()
seems to be desirable for platforms such as the GH200

Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.