How to deploy one model instance across multiple GPUs to tackle the OOM problem?

I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I get is about 8GB. I have also set --world_size to 2 to use 2-way tensor parallelism. 

But when I try to start the triton server, I always get the Ouf of Memory error. It seems that one instance will be lauched in each GPU, but there is not enough memory in either of them. I know that 32GB memory combined is enough to deploy the model as I have done that on another machine, but I don't know how to deploy the model with 2 GPUs.

Can anyone help?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deploy one model instance across multiple GPUs to tackle the OOM problem? #462

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to deploy one model instance across multiple GPUs to tackle the OOM problem? #462

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions