[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

AlpinDale · 2024-12-19T22:23:23Z

Work in Progress.

github-actions · 2024-12-19T22:23:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin

Tried loading a qwen exl2 model and ran into issues with uninitialized parameter usage

vllm serve cgus/Qwen2.5-0.5B-Instruct-exl2 --port 9000 --dtype float16
...
Traceback (most recent call last):
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
    raise e
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/multiprocessing/engine.py", line 71, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/engine/llm_engine.py", line 273, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/executor/executor_base.py", line 36, in __init__
    self._init_executor()
  File "/home/mgoin/code/alpin-vllm/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.load_model()
  File "/home/mgoin/code/alpin-vllm/vllm/worker/worker.py", line 155, in load_model
    self.model_runner.load_model()
  File "/home/mgoin/code/alpin-vllm/vllm/worker/model_runner.py", line 1094, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
    return loader.load_model(vllm_config=vllm_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
    loaded_weights = model.load_weights(
                     ^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 506, in load_weights
    return loader.load_weights(weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 237, in load_weights
    autoloaded_weights = set(self._load_module("", self.module, weights))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 198, in _load_module
    yield from self._load_module(prefix,
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/utils.py", line 175, in _load_module
    loaded_params = module_load_weights(weights)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/models/qwen2.py", line 396, in load_weights
    weight_loader(param, loaded_weight)
  File "/home/mgoin/code/alpin-vllm/vllm/model_executor/model_loader/weight_utils.py", line 531, in default_weight_loader
    if param.numel() == 1 and loaded_weight.numel() == 1:
       ^^^^^^^^^^^^^
  File "/home/mgoin/venvs/alpin/lib/python3.12/site-packages/torch/nn/parameter.py", line 168, in __torch_function__
    raise ValueError(
ValueError: Attempted to use an uninitialized parameter in <method 'numel' of 'torch._C.TensorBase' objects>. This error happens when you are using a `LazyModule` or explicitly manipulating `torch.nn.parameter.UninitializedParameter` objects. When using LazyModules Call `forward` with a dummy batch to initialize the parameters before calling torch functions

vllm/model_executor/layers/quantization/__init__.py

Co-authored-by: Michael Goin <[email protected]>

alllexx88 · 2025-01-15T11:22:50Z

Thank you for this WIP PR! A quick question: I just built vllm HEAD with this PR applied, and I can't get it to work. Just would like to know if I built it right and it is expected, or I messed up 😅 It first downloads the weights fine, but then I'm getting this error:

$ CUDA_DEVICE_ORDER=PCI_BUS_ID python -m vllm.entrypoints.openai.api_server --port=5003 --host=0.0.0.0 --model LoneStriker/Llama
-3.3-70B-Instruct-3.5bpw-h6-exl2 --quantization=exl2 --dtype=float16 --tensor-parallel-size=2 --pipeline-parallel-size=3 --max-model-len=32768                                                                                        
INFO 01-15 13:09:38 __init__.py:179] Automatically detected platform cuda.                                                                                                                                                            
INFO 01-15 13:09:40 api_server.py:768] vLLM API server version 0.6.6.post2.dev226+gad388d25.d20250115                                                                                                                                 
INFO 01-15 13:09:40 api_server.py:769] args: Namespace(host='0.0.0.0', port=5003, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=N
one, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_a
s_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2', ta
sk='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto'
, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backen
d=None, worker_use_ray=False, pipeline_parallel_size=3, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_blo
ck_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, qu
antization='exl2', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_co
nfig=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_
scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_d
elay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative
_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typ
ical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_trac
es_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='aut
o', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)                                                                                             
WARNING 01-15 13:09:41 config.py:2302] Casting torch.bfloat16 to torch.float16.                                                                                                                                                       
INFO 01-15 13:09:51 config.py:516] This model supports multiple tasks: {'score', 'embed', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.                                                                                
WARNING 01-15 13:09:52 config.py:595] exl2 quantization is not fully optimized yet. The speed can be slower than non-quantized models.                                                                                                
INFO 01-15 13:09:52 config.py:1318] Defaulting to use mp for distributed inference                                                                                                                                                    
WARNING 01-15 13:09:52 config.py:641] Async output processing can not be enabled with pipeline parallel                                                                                                                               
INFO 01-15 13:09:52 llm_engine.py:232] Initializing an LLM engine (v0.6.6.post2.dev226+gad388d25.d20250115) with config: model='LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2', speculative_config=None, tokenizer='LoneStriker/Ll
ama-3.3-70B-Instruct-3.5bpw-h6-exl2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_d
ir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=3, disable_custom_all_reduce=False, quantization=exl2, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_
config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=LoneStri
ker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_pro
cessor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,2
08,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
WARNING 01-15 13:09:54 multiproc_worker_utils.py:298] Reducing Torch parallelism from 60 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.              
INFO 01-15 13:09:54 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager                                                                                            
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:54 multiproc_worker_utils.py:226] Worker ready; awaiting tasks                                                                                                                       
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:54 multiproc_worker_utils.py:226] Worker ready; awaiting tasks                                                                                                                       
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:54 multiproc_worker_utils.py:226] Worker ready; awaiting tasks                                                                                                                       
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:54 multiproc_worker_utils.py:226] Worker ready; awaiting tasks                                                                                                                       
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:54 multiproc_worker_utils.py:226] Worker ready; awaiting tasks                                                                                                                       
INFO 01-15 13:09:54 cuda.py:247] Using Flash Attention backend.                                                                                                                                                                       
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:54 cuda.py:247] Using Flash Attention backend.                                                                                                                                       
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:54 cuda.py:247] Using Flash Attention backend.                                                                                                                                       
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:54 cuda.py:247] Using Flash Attention backend.                                                                                                                                       
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:54 cuda.py:247] Using Flash Attention backend.                                                                                                                                       
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:54 cuda.py:247] Using Flash Attention backend.                                                                                                                                       
INFO 01-15 13:09:57 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                                                
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:57 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
INFO 01-15 13:09:57 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                                                          
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:57 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:57 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:57 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:57 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:57 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:57 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:57 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:57 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:57 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:58 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/mlguru/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5.json                                              
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:58 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/mlguru/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5.json 
INFO 01-15 13:09:58 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/mlguru/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5.json                                                                              
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:58 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/mlguru/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5.json                                              
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:58 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/mlguru/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5.json                                              
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:58 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/mlguru/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5.json                                              
(VllmWorkerProcess pid=3222324) WARNING 01-15 13:09:58 custom_all_reduce.py:143] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_r
educe=True explicitly.                                                                                                                                                                                                                
(VllmWorkerProcess pid=3222326) WARNING 01-15 13:09:58 custom_all_reduce.py:143] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_r
educe=True explicitly.                                                                                                                                                                                                                
WARNING 01-15 13:09:58 custom_all_reduce.py:143] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.          
(VllmWorkerProcess pid=3222325) WARNING 01-15 13:09:58 custom_all_reduce.py:143] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_r
educe=True explicitly.                                                                                                                                                                                                                
(VllmWorkerProcess pid=3222323) WARNING 01-15 13:09:58 custom_all_reduce.py:143] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_r
educe=True explicitly.                                                                                                                                                                                                                
(VllmWorkerProcess pid=3222327) WARNING 01-15 13:09:58 custom_all_reduce.py:143] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_r
educe=True explicitly.                                                                                                                                                                                                                
INFO 01-15 13:09:58 shm_broadcast.py:256] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_b25f3ea1'), local_subscribe_port=41215, remote_subscribe_
port=None)                                                                                                                                                                                                                            
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:58 shm_broadcast.py:256] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_33fa8fd7'), local_subscri
be_port=59655, remote_subscribe_port=None)                                                                                                                                                                                            
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:58 shm_broadcast.py:256] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_7e28ad79'), local_subscri
be_port=35951, remote_subscribe_port=None)                                                                                                                                                                                            
INFO 01-15 13:09:58 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                                                
INFO 01-15 13:09:58 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                                                          
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:58 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:58 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:58 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:58 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:58 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:58 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:58 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:58 utils.py:937] Found nccl from library libnccl.so.2                                                                                                                                
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:58 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:58 pynccl.py:67] vLLM is using nccl==2.21.5                                                                                                                                          
(VllmWorkerProcess pid=3222325) INFO 01-15 13:09:58 model_runner.py:1097] Starting to load model LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2...                                                                                 
INFO 01-15 13:09:58 model_runner.py:1097] Starting to load model LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2...                                                                                                                 
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:58 model_runner.py:1097] Starting to load model LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2...                                                                                 
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:58 model_runner.py:1097] Starting to load model LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2...                                                                                 
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:58 model_runner.py:1097] Starting to load model LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2...                                                                                 
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:58 model_runner.py:1097] Starting to load model LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2...                                                                                 
INFO 01-15 13:09:59 weight_utils.py:253] Using model weights format ['*.safetensors']                                                                                                                                                 
(VllmWorkerProcess pid=3222324) INFO 01-15 13:09:59 weight_utils.py:253] Using model weights format ['*.safetensors']                                                                                                                 
(VllmWorkerProcess pid=3222326) INFO 01-15 13:09:59 weight_utils.py:253] Using model weights format ['*.safetensors']                                                                                                                 
(VllmWorkerProcess pid=3222327) INFO 01-15 13:09:59 weight_utils.py:253] Using model weights format ['*.safetensors']                                                                                                                 
(VllmWorkerProcess pid=3222323) INFO 01-15 13:09:59 weight_utils.py:253] Using model weights format ['*.safetensors']                                                                                                                 
VllmWorkerProcess pid=3222325) INFO 01-15 13:09:59 weight_utils.py:253] Using model weights format ['*.safetensors']
[rank0]: Traceback (most recent call last):                                                                                                                                                                                           
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/runpy.py", line 196, in _run_module_as_main                                                                                                                              
[rank0]:     return _run_code(code, main_globals, None,                                                                                                                                                                               
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/runpy.py", line 86, in _run_code                                                                                                                                         
[rank0]:     exec(code, run_globals)                                                                                                                                                                                                  
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 832, in <module>                                                                                              
[rank0]:     uvloop.run(run_server(args))                                                                                                                                                                                             
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run                                                                                                                       
[rank0]:     return loop.run_until_complete(wrapper())
[rank0]:   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete                                                                                                                                                  
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper                                                                                                                   
[rank0]:     return await main                                                                                                                                                                                                        
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 796, in run_server                                                                                            
[rank0]:     async with build_async_engine_client(args) as engine_client:                                                                                                                                                             
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/contextlib.py", line 199, in __aenter__                                                                                                                                  
[rank0]:     return await anext(self.gen)                                                                                                                                                                                             
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 125, in build_async_engine_client                                                                             
[rank0]:     async with build_async_engine_client_from_engine_args(                                                                                                                                                                   
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/contextlib.py", line 199, in __aenter__                                                                                                                                  
[rank0]:     return await anext(self.gen)                                                                                                                                                                                             
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 149, in build_async_engine_client_from_engine_args                                                            
[rank0]:     engine_client = AsyncLLMEngine.from_engine_args(                                                                                                                                                                         
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 642, in from_engine_args                                                                                            
[rank0]:     engine = cls(                                                                                                                                                                                                            
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 592, in __init__                                                                                                    
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 265, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 271, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config, )
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 222, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 42, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 87, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 143, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/worker/worker.py", line 155, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1099, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 368, in load_model
[rank0]:     loaded_weights = model.load_weights(
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 586, in load_weights
[rank0]:     return loader.load_weights(
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 233, in load_weights
[rank0]:     autoloaded_weights = set(self._load_module("", self.module, weights))
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 185, in _load_module
[rank0]:     for child_prefix, child_weights in self._groupby_prefix(weights):
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 99, in _groupby_prefix
[rank0]:     for prefix, group in itertools.groupby(weights_by_parts,
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 96, in <genexpr>
[rank0]:     weights_by_parts = ((weight_name.split(".", 1), weight_data)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 586, in <genexpr>
[rank0]:     return loader.load_weights(
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 344, in _get_all_weights
[rank0]:     yield from self._get_weights_iterator(primary_weights)
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 300, in _get_weights_iterator
[rank0]:     hf_folder, hf_weights_files, use_safetensors = self._prepare_weights(
[rank0]:   File "/opt/vllm/.pixi/envs/default/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 291, in _prepare_weights
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Cannot find any model weights with `LoneStriker/Llama-3.3-70B-Instruct-3.5bpw-h6-exl2`
(VllmWorkerProcess pid=3222324) INFO 01-15 13:10:00 multiproc_worker_utils.py:251] Worker exiting
/opt/vllm/.pixi/envs/default/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
(VllmWorkerProcess pid=3222327) INFO 01-15 13:10:00 multiproc_worker_utils.py:251] Worker exiting
(VllmWorkerProcess pid=3222326) INFO 01-15 13:10:00 multiproc_worker_utils.py:251] Worker exiting
(VllmWorkerProcess pid=3222323) INFO 01-15 13:10:00 multiproc_worker_utils.py:251] Worker exiting
(VllmWorkerProcess pid=3222325) INFO 01-15 13:10:00 multiproc_worker_utils.py:251] Worker exiting
INFO 01-15 13:10:00 multiproc_worker_utils.py:126] Killing local vLLM worker processes
/opt/vllm/.pixi/envs/default/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W115 13:10:01.812740509 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
/opt/vllm/.pixi/envs/default/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Thanks 😃

mergify · 2025-03-13T03:21:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AlpinDale.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

add exllamav2 kernels

ac439e9

mergify bot added the ci/build label Dec 19, 2024

mgoin added quantization kernel labels Dec 19, 2024

AlpinDale added 3 commits December 20, 2024 00:24

fix compilation

491671b

int64_t -> int in registration

4eb9a09

add exl2 linear method

5bb4bc8

mgoin reviewed Dec 20, 2024

View reviewed changes

vllm/model_executor/layers/quantization/__init__.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/__init__.py Outdated Show resolved Hide resolved

AlpinDale and others added 2 commits January 6, 2025 01:08

Update vllm/model_executor/layers/quantization/__init__.py

7458439

Co-authored-by: Michael Goin <[email protected]>

Update vllm/model_executor/layers/quantization/__init__.py

a8498b3

Co-authored-by: Michael Goin <[email protected]>

mergify bot added the needs-rebase label Mar 13, 2025

JohnConnor123 mentioned this pull request Mar 25, 2025

[Roadmap] vLLM Roadmap Q1 2025 #11862

Open

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

Uh oh!

AlpinDale commented Dec 19, 2024 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 19, 2024

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

alllexx88 commented Jan 15, 2025

Uh oh!

mergify bot commented Mar 13, 2025

Uh oh!

Uh oh!

Uh oh!

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

Are you sure you want to change the base?

[Kernel] Add ExLlamaV2 Weight Quantization Support #11348

Uh oh!

Conversation

AlpinDale commented Dec 19, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2024

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alllexx88 commented Jan 15, 2025

Uh oh!

mergify bot commented Mar 13, 2025

Uh oh!

Uh oh!

AlpinDale commented Dec 19, 2024 •

edited by github-actions bot

Loading