Open
Description
I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I get is about 8GB. I have also set --world_size to 2 to use 2-way tensor parallelism.
But when I try to start the triton server, I always get the Ouf of Memory error. It seems that one instance will be lauched in each GPU, but there is not enough memory in either of them. I know that 32GB memory combined is enough to deploy the model as I have done that on another machine, but I don't know how to deploy the model with 2 GPUs.
Can anyone help?