Skip to content

Quantization of Llama3.2-1b/3b with QNN backend #10993

Open
@moxprox

Description

@moxprox

Hello executorch team

I've been exploring my options to quantize and export either Llama3.2-1B or 3B using the Executorch + QNN workflow. I have tried different approaches but none of them has worked out. My question is, given the current state of Executorch + QNN backend, is it possible to quantize a Llama3.2 1B/3B model that responds with a correct answer? For my application I'm interested in running the model on NPU so running on CPU is not an option. Could you also take a look at what I've tried and let me know if you have any suggestion I can follow.

Thanks in advance.

This is what I have tried:

  1. Quantize and export the model with Executorch + QNN (examples/qualcomm/oss_scripts/llama/llama.py --ptq 16a4w) produces a model that returns invalid responses (symbols or non-words) during inferences. Seems like similar issues have been already raised.

  2. Export a pre-quantized model such as Llama-3.2-1B-Instruct-SpinQuant_INT4_EO. For this I get an error "Only Tensors of floating point and complex dtype can require gradients". Can you confirm that this feature is not supported as suggested by this ticket?

python -m examples.models.llama.export_llama -t /workplace/models/Llama3.1-8B-Instruct/tokenizer.model -p /workplace/models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/params.json -c /workplace/models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/consolidated.00.pth --use_kv_cache --qnn --soc_model SM8550

  1. I noticed there are no SpinQuant matrices for smaller models like 1B/3B however I tried generating one following this guide. I was able to generate a matrix file (R.bin) however when running the model on the target device I get the following failures I'm still investigating:

python -m examples.models.llama.export_llama -t /workplace/models/Llama-3.2-1B/original/tokenizer.model -p /workplace/models/Llama-3.2-1B/original/params.json -c /workplace/models/Llama-3.2-1B/original/consolidated.00.pth --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --soc_model SM8550 --optimized_rotation_path /workplace/models/matrix_llama3.2-1b_w16-a4/R.bin --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create transport for device, error: 4000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load skel, error: 4000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Transport layer setup failed: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse default platform info: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load default platform info: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse platform config: 14001
[ERROR] [Qnn ExecuTorch]: Failed to create device_handle for Backend ID 6, error=14001
E 00:00:01.580015 executorch:QnnManager.cpp:305] Fail to configure Qnn device
E 00:00:01.580019 executorch:QnnExecuTorchBackend.cpp:95] Fail to initialize Qnn Manager
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Backend 1 free cleanup called during process exit

cc @cccclai @cbilgin

Metadata

Metadata

Assignees

Labels

module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions