Deploying Mixtral-8x7B-v0.1 with Triton 24.02 on A100 (160GB) raises "Cuda Runtime (out of memory)" exception

### System Info

### Environment
CPU architecture: x86_64
CPU/Host memory size: 440 GiB memory

### GPU properties
GPU name: A100
GPU memory size: 160GB
I am using the Azure offering of this GPU: Standard NC48ads A100 v4 (48 vcpus, 440 GiB memory)

### Libraries
**TensorRT-LLM branch or tag:** v0.8.0
**Container used:** 24.02-trtllm-python-py3 (following the support matrix)

NVIDIA driver version: Driver Version: 535.161.07

OS: Ubuntu 22.04 (Jammy)


### Who can help?

@byshiue @schetlur-nv 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Convert checkpoint:
```bash
# Run with tensor parallelism
python3 ../llama/convert_checkpoint.py --model_dir ./Mixtral-8x7B-v0.1 \
                             --output_dir ./tllm_checkpoint_mixtral_2gpu \
                             --dtype float16 \
                             --tp_size 2
```
2. Build engine file:
```
trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_2gpu \
                 --output_dir ./mixtral-engine-1 \
                 --gemm_plugin float16
```
3. Copy engine into models directory:
```
rm /tensorrtllm_backend/models/mixtral56b/mixtral56b/1/* && cp mixtral-engine-0/* /tensorrtllm_backend/models/mixtral56b/mixtral56b/1/.
```
4. Run Triton server 24.02-trtllm-python-py3 from `/tensorrtllm_backend` volume-mounted folder: `python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/models/mixtral56b --tensorrt_llm_model_name=mixtral56b --log`

### Expected behavior

I expect Triton server to start successfully, and show the Mixtral model in READY state and the server listening on ports 8000 and 8001 for HTTP and GRPC requests respectively.

### actual behavior

I get a CUDA out of memory error like so:

```
I0429 19:54:55.691380 689 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0429 19:54:56.279264 689 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7512ee000000' with size 268435456
I0429 19:54:57.497967 689 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0429 19:54:57.497991 689 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0429 19:54:57.649604 689 model_config_utils.cc:680] Server side auto-completed config: name: "ensemble"
platform: "ensemble"
max_batch_size: 1024
input {
  name: "text_input"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "max_tokens"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: -1
}
ensemble_scheduling {
  step {
    model_name: "preprocessing"
    model_version: -1
    input_map {
      key: "QUERY"
      value: "text_input"
    }
    input_map {
      key: "REQUEST_OUTPUT_LEN"
      value: "max_tokens"
    }
    output_map {
      key: "INPUT_ID"
      value: "_INPUT_ID"
    }
    output_map {
      key: "REQUEST_OUTPUT_LEN"
      value: "_REQUEST_OUTPUT_LEN"
    }
  }
  step {
    model_name: "mixtral56b"
    model_version: -1
    input_map {
      key: "input_ids"
      value: "_INPUT_ID"
    }
    input_map {
      key: "request_output_len"
      value: "_REQUEST_OUTPUT_LEN"
    }
    output_map {
      key: "output_ids"
      value: "_TOKENS_BATCH"
    }
  }
  step {
    model_name: "postprocessing"
    model_version: -1
    input_map {
      key: "TOKENS_BATCH"
      value: "_TOKENS_BATCH"
    }
    output_map {
      key: "OUTPUT"
      value: "text_output"
    }
  }
}

I0429 19:55:02.705961 689 model_config_utils.cc:680] Server side auto-completed config: name: "mixtral56b"
max_batch_size: 1024
input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: -1
  allow_ragged_batch: true
}
input {
  name: "request_output_len"
  data_type: TYPE_INT32
  dims: 1
}
output {
  name: "output_ids"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
output {
  name: "sequence_length"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "cum_log_probs"
  data_type: TYPE_FP32
  dims: -1
}
output {
  name: "output_log_probs"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "context_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "generation_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_GPU
}
parameters {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value {
    string_value: "no"
  }
}
parameters {
  key: "batch_scheduler_policy"
  value {
    string_value: "guaranteed_no_evict"
  }
}
parameters {
  key: "enable_chunked_context"
  value {
    string_value: "false"
  }
}
parameters {
  key: "enable_kv_cache_reuse"
  value {
    string_value: "${enable_kv_cache_reuse}"
  }
}
parameters {
  key: "enable_trt_overlap"
  value {
    string_value: "false"
  }
}
parameters {
  key: "exclude_input_in_output"
  value {
    string_value: "true"
  }
}
parameters {
  key: "gpt_model_path"
  value {
    string_value: "/tensorrtllm_backend/models/mixtral56b/mixtral56b/1"
  }
}
parameters {
  key: "gpt_model_type"
  value {
    string_value: "V1"
  }
}
parameters {
  key: "gpu_device_ids"
  value {
    string_value: "${gpu_device_ids}"
  }
}
parameters {
  key: "kv_cache_free_gpu_mem_fraction"
  value {
    string_value: "${kv_cache_free_gpu_mem_fraction}"
  }
}
parameters {
  key: "max_attention_window_size"
  value {
    string_value: "${max_attention_window_size}"
  }
}
parameters {
  key: "max_beam_width"
  value {
    string_value: "${max_beam_width}"
  }
}
parameters {
  key: "max_tokens_in_paged_kv_cache"
  value {
    string_value: "34000"
  }
}
parameters {
  key: "normalize_log_probs"
  value {
    string_value: "true"
  }
}
backend: "tensorrtllm"
model_transaction_policy {
}

I0429 19:55:03.103345 689 model_config_utils.cc:680] Server side auto-completed config: name: "postprocessing"
max_batch_size: 1024
input {
  name: "TOKENS_BATCH"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
output {
  name: "OUTPUT"
  data_type: TYPE_STRING
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
  key: "skip_special_tokens"
  value {
    string_value: "True"
  }
}
parameters {
  key: "tokenizer_dir"
  value {
    string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
  }
}
parameters {
  key: "tokenizer_type"
  value {
    string_value: "auto"
  }
}
backend: "python"

I0429 19:55:03.104288 689 model_config_utils.cc:680] Server side auto-completed config: name: "preprocessing"
max_batch_size: 1024
input {
  name: "QUERY"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "REQUEST_OUTPUT_LEN"
  data_type: TYPE_INT32
  dims: -1
}
input {
  name: "BAD_WORDS_DICT"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "STOP_WORDS_DICT"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "EMBEDDING_BIAS_WORDS"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "EMBEDDING_BIAS_WEIGHTS"
  data_type: TYPE_FP32
  dims: -1
  optional: true
}
input {
  name: "END_ID"
  data_type: TYPE_INT32
  dims: -1
  optional: true
}
input {
  name: "PAD_ID"
  data_type: TYPE_INT32
  dims: -1
  optional: true
}
output {
  name: "INPUT_ID"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "REQUEST_OUTPUT_LEN"
  data_type: TYPE_INT32
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
  key: "add_special_tokens"
  value {
    string_value: "False"
  }
}
parameters {
  key: "tokenizer_dir"
  value {
    string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
  }
}
parameters {
  key: "tokenizer_type"
  value {
    string_value: "auto"
  }
}
backend: "python"

I0429 19:55:03.104390 689 model_lifecycle.cc:438] AsyncLoad() 'preprocessing'
I0429 19:55:03.104439 689 model_lifecycle.cc:469] loading: preprocessing:1
I0429 19:55:03.104463 689 model_lifecycle.cc:438] AsyncLoad() 'postprocessing'
I0429 19:55:03.104500 689 model_lifecycle.cc:469] loading: postprocessing:1
I0429 19:55:03.104514 689 model_lifecycle.cc:438] AsyncLoad() 'mixtral56b'
I0429 19:55:03.104554 689 model_lifecycle.cc:469] loading: mixtral56b:1
I0429 19:55:03.104566 689 model_lifecycle.cc:547] CreateModel() 'preprocessing' version 1
I0429 19:55:03.104643 689 model_lifecycle.cc:547] CreateModel() 'mixtral56b' version 1
I0429 19:55:03.104653 689 model_lifecycle.cc:547] CreateModel() 'postprocessing' version 1
I0429 19:55:03.104737 689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0429 19:55:03.104696 689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0429 19:55:03.104765 689 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
I0429 19:55:03.104803 689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0429 19:55:03.159322 689 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0429 19:55:03.165373 689 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0429 19:55:03.201148 689 model_config_utils.cc:1904] 	ModelConfig::dynamic_batching::default_priority_level
I0429 19:55:03.201158 689 model_config_utils.cc:1904] 	ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0429 19:55:03.201166 689 model_config_utils.cc:1904] 	ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0429 19:55:03.201172 689 model_config_utils.cc:1904] 	ModelConfig::dynamic_batching::priority_levels
I0429 19:55:03.201179 689 model_config_utils.cc:1904] 	ModelConfig::dynamic_batching::priority_queue_policy::key
I0429 19:55:03.201185 689 model_config_utils.cc:1904] 	ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0429 19:55:03.201191 689 model_config_utils.cc:1904] 	ModelConfig::ensemble_scheduling::step::model_version
I0429 19:55:03.201198 689 model_config_utils.cc:1904] 	ModelConfig::input::dims
I0429 19:55:03.201220 689 model_config_utils.cc:1904] 	ModelConfig::input::reshape::shape
I0429 19:55:03.201227 689 model_config_utils.cc:1904] 	ModelConfig::instance_group::secondary_devices::device_id
I0429 19:55:03.201233 689 model_config_utils.cc:1904] 	ModelConfig::model_warmup::inputs::value::dims
I0429 19:55:03.201239 689 model_config_utils.cc:1904] 	ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0429 19:55:03.201246 689 model_config_utils.cc:1904] 	ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0429 19:55:03.201253 689 model_config_utils.cc:1904] 	ModelConfig::output::dims
I0429 19:55:03.201259 689 model_config_utils.cc:1904] 	ModelConfig::output::reshape::shape
I0429 19:55:03.201269 689 model_config_utils.cc:1904] 	ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0429 19:55:03.201275 689 model_config_utils.cc:1904] 	ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0429 19:55:03.201281 689 model_config_utils.cc:1904] 	ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0429 19:55:03.201287 689 model_config_utils.cc:1904] 	ModelConfig::sequence_batching::state::dims
I0429 19:55:03.201294 689 model_config_utils.cc:1904] 	ModelConfig::sequence_batching::state::initial_state::dims
I0429 19:55:03.201300 689 model_config_utils.cc:1904] 	ModelConfig::version_policy::specific::versions
I0429 19:55:03.202695 689 python_be.cc:2075] 'python' TRITONBACKEND API version: 1.18
I0429 19:55:03.202715 689 python_be.cc:2097] backend configuration:
{"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}}
I0429 19:55:03.202749 689 python_be.cc:2236] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0429 19:55:03.212009 689 python_be.cc:2559] TRITONBACKEND_GetBackendAttribute: setting attributes
I0429 19:55:03.221478 689 python_be.cc:2337] TRITONBACKEND_ModelInitialize: preprocessing (version 1)
I0429 19:55:03.221719 689 python_be.cc:2337] TRITONBACKEND_ModelInitialize: postprocessing (version 1)
I0429 19:55:03.221901 689 python_be.cc:2031] model configuration:
{
    "name": "preprocessing",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "QUERY",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "REQUEST_OUTPUT_LEN",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "BAD_WORDS_DICT",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "STOP_WORDS_DICT",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "EMBEDDING_BIAS_WORDS",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "EMBEDDING_BIAS_WEIGHTS",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "END_ID",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "PAD_ID",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        }
    ],
    "output": [
        {
            "name": "INPUT_ID",
            "data_type": "TYPE_INT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "REQUEST_OUTPUT_LEN",
            "data_type": "TYPE_INT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "preprocessing_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "add_special_tokens": {
            "string_value": "False"
        },
        "tokenizer_dir": {
            "string_value": "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
        },
        "tokenizer_type": {
            "string_value": "auto"
        }
    },
    "model_warmup": []
}
I0429 19:55:03.222121 689 python_be.cc:2031] model configuration:
{
    "name": "postprocessing",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "TOKENS_BATCH",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT",
            "data_type": "TYPE_STRING",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "postprocessing_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "tokenizer_type": {
            "string_value": "auto"
        },
        "tokenizer_dir": {
            "string_value": "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
        },
        "skip_special_tokens": {
            "string_value": "True"
        }
    },
    "model_warmup": []
}
I0429 19:55:03.222144 689 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0429 19:55:03.222198 689 backend_model_instance.cc:69] Creating instance preprocessing_0_0 on CPU using artifact 'model.py'
I0429 19:55:03.222474 689 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0429 19:55:03.222494 689 backend_model_instance.cc:69] Creating instance postprocessing_0_0 on CPU using artifact 'model.py'
I0429 19:55:03.235963 689 stub_launcher.cc:388] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tensorrtllm_backend/models/mixtral56b/preprocessing/1/model.py prefix0_1 1048576 1048576 689 /opt/tritonserver/backends/python 336 preprocessing_0_0 DEFAULT
I0429 19:55:03.236004 689 stub_launcher.cc:388] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tensorrtllm_backend/models/mixtral56b/postprocessing/1/model.py prefix0_2 1048576 1048576 689 /opt/tritonserver/backends/python 336 postprocessing_0_0 DEFAULT
I0429 19:55:19.751221 689 python_be.cc:2402] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_0 (device 0)
I0429 19:55:19.751472 689 backend_model_instance.cc:772] Starting backend thread for postprocessing_0_0 at nice 0 on device 0...
I0429 19:55:19.751599 689 backend_model.cc:674] Created model instance named 'postprocessing_0_0' with device id '0'
I0429 19:55:19.751910 689 model_lifecycle.cc:692] OnLoadComplete() 'postprocessing' version 1
I0429 19:55:19.751964 689 model_lifecycle.cc:730] OnLoadFinal() 'postprocessing' for all version(s)
I0429 19:55:19.751978 689 model_lifecycle.cc:835] successfully loaded 'postprocessing'
I0429 19:55:19.754814 689 python_be.cc:2402] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful preprocessing_0_0 (device 0)
I0429 19:55:19.755006 689 backend_model_instance.cc:772] Starting backend thread for preprocessing_0_0 at nice 0 on device 0...
I0429 19:55:19.755118 689 backend_model.cc:674] Created model instance named 'preprocessing_0_0' with device id '0'
I0429 19:55:19.755267 689 model_lifecycle.cc:692] OnLoadComplete() 'preprocessing' version 1
I0429 19:55:19.755305 689 model_lifecycle.cc:730] OnLoadFinal() 'preprocessing' for all version(s)
I0429 19:55:19.755317 689 model_lifecycle.cc:835] successfully loaded 'preprocessing'
I0429 19:55:19.755449 689 model_lifecycle.cc:294] VersionStates() 'preprocessing'
I0429 19:55:19.755517 689 model_lifecycle.cc:294] VersionStates() 'postprocessing'
I0429 20:01:30.442184 689 backend_model_instance.cc:772] Starting backend thread for mixtral56b_0_0 at nice 0 on device 0...
I0429 20:01:30.442460 689 backend_model.cc:674] Created model instance named 'mixtral56b_0_0' with device id '0'
E0429 20:01:53.524615 689 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x75124c2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x75124c2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x75124c2850a0]
3       0x75124e0cb572 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 946
4       0x75124e15731d tensorrt_llm::batch_manager::TrtGptModelV1::TrtGptModelV1(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig, tensorrt_llm::runtime::WorldConfig, std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 701
5       0x75124e125dd4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 2804
6       0x75124e11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
7       0x75134403cb62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x75134403cb62]
8       0x75134403d3f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x75134403d3f2]
9       0x75134402ffd5 TRITONBACKEND_ModelInstanceInitialize + 101
10      0x75134a932296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x75134a932296]
11      0x75134a9334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x75134a9334d6]
12      0x75134a916045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x75134a916045]
13      0x75134a916686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x75134a916686]
14      0x75134a922efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x75134a922efd]
15      0x751349f86ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x751349f86ee8]
16      0x75134a90cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x75134a90cf0b]
17      0x75134a91dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x75134a91dc65]
18      0x75134a92231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x75134a92231e]
19      0x75134aa140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x75134aa140c8]
20      0x75134aa179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x75134aa179ac]
21      0x75134ab6b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x75134ab6b6c2]
22      0x75134a1f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x75134a1f2253]
23      0x751349f81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x751349f81ac3]
24      0x75134a012a04 clone + 68
I0429 20:01:53.524820 689 backend_model_instance.cc:795] Stopping backend thread for mixtral56b_0_0...
```

On the command-line I see:
```
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 44668 MiB
[TensorRT-LLM][ERROR] 1: [defaultAllocator.cpp::allocate::20] Error Code 1: Cuda Runtime (out of memory)
[TensorRT-LLM][WARNING] Requested amount of GPU memory (46835179520 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[TensorRT-LLM][ERROR] 2: [safeDeserialize.cpp::load::269] Error Code 2: OutOfMemory (no further information)
```

### additional notes

I followed the process documented in here (using v0.8.0 of TRT-LLM) for the `--tp_size=2` case: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/examples/mixtral/README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying Mixtral-8x7B-v0.1 with Triton 24.02 on A100 (160GB) raises "Cuda Runtime (out of memory)" exception #438

System Info

Environment

GPU properties

Libraries

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deploying Mixtral-8x7B-v0.1 with Triton 24.02 on A100 (160GB) raises "Cuda Runtime (out of memory)" exception #438

Description

System Info

Environment

GPU properties

Libraries

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions