Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option

**Problem:**
I am aware everyone has different results, in my case I am running llama.cpp on a 4090 primary and a 3090 secondary, so both are quite capable cards for llms.
I am getting around 800% slowdowns when using both cards on the same model and settings (basically regardless which model I tried), batch processing speed can go down from 2400t/sec to 200-300t/sec (8-10 times slower than on single GPU).
This happens as soon as any tiny bit of processing (-ts) is shifted to the 2nd card.

I assume it is a synchronization problem in the cuda loops, I also assume the issue does not affect every combination of GPUs, especially if one GPU is significantly slower. 

**Suggestion:**
My suggestion is to add a parameter like -layer-split, when this is used the tensors are not split up, instead the layers are split into the cards (using -ls instead of -ts).
This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.

**Caveat:**
In theory tensor split should boost performance, as both cards can process a split tensor at the same time, so it's the better solution but currently that's so far from reality, the suggested layer split should significantly boost the processing speed.

@JohannesGaessler what do you think ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions