Description
Hello executorch team
I've been exploring my options to quantize and export either Llama3.2-1B or 3B using the Executorch + QNN workflow. I have tried different approaches but none of them has worked out. My question is, given the current state of Executorch + QNN backend, is it possible to quantize a Llama3.2 1B/3B model that responds with a correct answer? For my application I'm interested in running the model on NPU so running on CPU is not an option. Could you also take a look at what I've tried and let me know if you have any suggestion I can follow.
Thanks in advance.
This is what I have tried:
-
Quantize and export the model with Executorch + QNN (
examples/qualcomm/oss_scripts/llama/llama.py --ptq 16a4w
) produces a model that returns invalid responses (symbols or non-words) during inferences. Seems like similar issues have been already raised. -
Export a pre-quantized model such as
Llama-3.2-1B-Instruct-SpinQuant_INT4_EO
. For this I get an error "Only Tensors of floating point and complex dtype can require gradients
". Can you confirm that this feature is not supported as suggested by this ticket?
python -m examples.models.llama.export_llama -t /workplace/models/Llama3.1-8B-Instruct/tokenizer.model -p /workplace/models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/params.json -c /workplace/models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/consolidated.00.pth --use_kv_cache --qnn --soc_model SM8550
- I noticed there are no SpinQuant matrices for smaller models like 1B/3B however I tried generating one following this guide. I was able to generate a matrix file (R.bin) however when running the model on the target device I get the following failures I'm still investigating:
python -m examples.models.llama.export_llama -t /workplace/models/Llama-3.2-1B/original/tokenizer.model -p /workplace/models/Llama-3.2-1B/original/params.json -c /workplace/models/Llama-3.2-1B/original/consolidated.00.pth --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --soc_model SM8550 --optimized_rotation_path /workplace/models/matrix_llama3.2-1b_w16-a4/R.bin --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create transport for device, error: 4000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load skel, error: 4000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Transport layer setup failed: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse default platform info: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load default platform info: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse platform config: 14001
[ERROR] [Qnn ExecuTorch]: Failed to create device_handle for Backend ID 6, error=14001
E 00:00:01.580015 executorch:QnnManager.cpp:305] Fail to configure Qnn device
E 00:00:01.580019 executorch:QnnExecuTorchBackend.cpp:95] Fail to initialize Qnn Manager
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Backend 1 free cleanup called during process exit