Skip to content

Multi GPU CUDA - 8x performance degradation when splitting tensors -> let's split by layer as an option #4055

Closed
@cmp-nct

Description

@cmp-nct

Problem:
I am aware everyone has different results, in my case I am running llama.cpp on a 4090 primary and a 3090 secondary, so both are quite capable cards for llms.
I am getting around 800% slowdowns when using both cards on the same model and settings (basically regardless which model I tried), batch processing speed can go down from 2400t/sec to 200-300t/sec (8-10 times slower than on single GPU).
This happens as soon as any tiny bit of processing (-ts) is shifted to the 2nd card.

I assume it is a synchronization problem in the cuda loops, I also assume the issue does not affect every combination of GPUs, especially if one GPU is significantly slower.

Suggestion:
My suggestion is to add a parameter like -layer-split, when this is used the tensors are not split up, instead the layers are split into the cards (using -ls instead of -ts).
This means the calculations can all be computed without synchronization on a single GPU at highest possible performance of that GPU.

Caveat:
In theory tensor split should boost performance, as both cards can process a split tensor at the same time, so it's the better solution but currently that's so far from reality, the suggested layer split should significantly boost the processing speed.

@JohannesGaessler what do you think ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions