Skip to content

Tritonserver won't start up running Smaug 34b #459

Open
@workuser12345

Description

@workuser12345

System Info

CPU Architecture: AMD EPYC 7V13 64-Core Processor
CPU/Host memory size: 440
GPU properties: A100 80Gb
GPU name: NVIDIA A100 80GB x2
GPU mem size: 80Gb x 2
clock frequencies
Libraries
TensorRT-LLM branch or tag: main
TensorRT-LLM commit: ae52bce
Versions of TensorRT, CUDA: (10.0.1, 12.4)
container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend
nvidia driver version: 535.161.07
OS: Ubuntu 22.04.4 LTS
docker image version: custom built from main branch

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. I built the trt llm container by running
DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend
  1. I launched the container with this command
sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash
  1. I built the smaug checkpoint and trt engine following the guide here.
    a. Instead of tp_size 8, I used tp_size 2 as I'm running with 2 gpus
  2. I launch triton server with
python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/home/smaug_triton_model_repo --tensorrt_llm_model_name=smaug34b --log
  1. I then tried running the sample script from the above link with
mpirun -n 2 -allow-run-as-root python examples/summarize.py \
    --hf_model_dir /home/Smaug-34B-v0.1 \
    --engine_dir /home/smaug_triton_model_repo/smaug34b/1/ \
    --data_type fp16 \
    --test_hf \
    --hf_device_map_auto \
    --test_trt_llm

Expected behavior

I expect tritonserver to start up successfully.

actual behavior

tritonserver stops printing logs and the processes just run indefinitely. I'm unable to reach tritonserver via grpc or http so I do not believe it is running at all. The output I get is

root@keith-a100-dev4:/home/tensorrtllm_backend# I0515 21:22:16.209472 7394 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x75e75e000000' with size 268435456
I0515 21:22:16.214944 7394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0515 21:22:16.214957 7394 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0515 21:22:16.497971 7394 model_lifecycle.cc:469] loading: smaug34b:1
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1

When the running summarize.py script I get the following output

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024051400
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024051400
[05/15/2024-21:14:21] [TRT-LLM] [I] Load tokenizer takes: 0.08764123916625977 sec
[05/15/2024-21:14:21] [TRT-LLM] [I] Load tokenizer takes: 0.08883118629455566 sec
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 32
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 32
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 33338 MiB
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 32
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 32
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 33338 MiB
keith-a100-dev4:7179:7179 [0] NCCL INFO Bootstrap : Using eth0:10.5.0.15<0>
keith-a100-dev4:7179:7179 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.19.3+cuda12.3
keith-a100-dev4:7179:7179 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
keith-a100-dev4:7179:7179 [0] NCCL INFO P2P plugin IBext_v7
keith-a100-dev4:7179:7179 [0] NCCL INFO NET/IB : No device found.
keith-a100-dev4:7179:7179 [0] NCCL INFO NET/IB : No device found.
keith-a100-dev4:7179:7179 [0] NCCL INFO NET/Socket : Using [0]eth0:10.5.0.15<0> [1]enP4801s1:fe80::7e1e:52ff:fe22:721d%enP4801s1<0>
keith-a100-dev4:7179:7179 [0] NCCL INFO Using non-device net plugin version 0
keith-a100-dev4:7179:7179 [0] NCCL INFO Using network Socket
keith-a100-dev4:7180:7180 [1] NCCL INFO cudaDriverVersion 12040
keith-a100-dev4:7180:7180 [1] NCCL INFO Bootstrap : Using eth0:10.5.0.15<0>
keith-a100-dev4:7180:7180 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
keith-a100-dev4:7180:7180 [1] NCCL INFO P2P plugin IBext_v7
keith-a100-dev4:7180:7180 [1] NCCL INFO NET/IB : No device found.
keith-a100-dev4:7180:7180 [1] NCCL INFO NET/IB : No device found.
keith-a100-dev4:7180:7180 [1] NCCL INFO NET/Socket : Using [0]eth0:10.5.0.15<0> [1]enP4801s1:fe80::7e1e:52ff:fe22:721d%enP4801s1<0>
keith-a100-dev4:7180:7180 [1] NCCL INFO Using non-device net plugin version 0
keith-a100-dev4:7180:7180 [1] NCCL INFO Using network Socket
keith-a100-dev4:7180:7180 [1] NCCL INFO comm 0x646782dbed30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 200000 commId 0xe418f9626149e33e - Init START
keith-a100-dev4:7179:7179 [0] NCCL INFO comm 0x559af8995ca0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xe418f9626149e33e - Init START

keith-a100-dev4:7180:7180 [1] graph/xml.h:85 NCCL WARN Attribute busid of node nic not found
keith-a100-dev4:7180:7180 [1] NCCL INFO graph/xml.cc:589 -> 3
keith-a100-dev4:7180:7180 [1] NCCL INFO graph/xml.cc:806 -> 3
keith-a100-dev4:7180:7180 [1] NCCL INFO graph/topo.cc:689 -> 3
keith-a100-dev4:7180:7180 [1] NCCL INFO init.cc:881 -> 3
keith-a100-dev4:7180:7180 [1] NCCL INFO init.cc:1396 -> 3
keith-a100-dev4:7180:7180 [1] NCCL INFO init.cc:1641 -> 3

keith-a100-dev4:7179:7179 [0] graph/xml.h:85 NCCL WARN Attribute busid of node nic not found
keith-a100-dev4:7179:7179 [0] NCCL INFO graph/xml.cc:589 -> 3
keith-a100-dev4:7179:7179 [0] NCCL INFO graph/xml.cc:806 -> 3
keith-a100-dev4:7179:7179 [0] NCCL INFO graph/topo.cc:689 -> 3
keith-a100-dev4:7179:7179 [0] NCCL INFO init.cc:881 -> 3
keith-a100-dev4:7179:7179 [0] NCCL INFO init.cc:1396 -> 3
keith-a100-dev4:7179:7179 [0] NCCL INFO init.cc:1641 -> 3
keith-a100-dev4:7180:7180 [1] NCCL INFO init.cc:1679 -> 3
Failed, NCCL error /app/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'internal error - please report this issue to the NCCL developers'
keith-a100-dev4:7179:7179 [0] NCCL INFO init.cc:1679 -> 3
Failed, NCCL error /app/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:86 'internal error - please report this issue to the NCCL developers'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[12525,1],0]
  Exit code:    1
--------------------------------------------------------------------------

additional notes

I tried the same with gemma 7b (built with world_size=2) and the same thing happens. Wondering if it's related to running tritonserver with 2 a100s on a single machine.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions