Quantization of Llama3.2-1b/3b with QNN backend

Hello executorch team

I've been exploring my options to quantize and export either Llama3.2-1B or 3B using the Executorch + QNN workflow. I have tried different approaches but none of them has worked out. My question is, given the current state of Executorch + QNN backend, is it possible to quantize a Llama3.2 1B/3B model that responds with a correct answer? For my application I'm interested in running the model on NPU so running on CPU is not an option. Could you also take a look at what I've tried and let me know if you have any suggestion I can follow. 

Thanks in advance.

This is what I have tried:

1. Quantize and export the model with Executorch + QNN (`examples/qualcomm/oss_scripts/llama/llama.py --ptq 16a4w`) produces a model that returns invalid responses (symbols or non-words) during inferences. Seems like [similar issues](https://github.com/pytorch/executorch/issues/10900) have been already raised.

2. Export a pre-quantized model such as `Llama-3.2-1B-Instruct-SpinQuant_INT4_EO`. For this I get an error "`Only Tensors of floating point and complex dtype can require gradients`". Can you confirm that this feature is not supported as suggested by this [ticket](https://github.com/pytorch/executorch/issues/7133)? 

`python -m examples.models.llama.export_llama -t /workplace/models/Llama3.1-8B-Instruct/tokenizer.model -p /workplace/models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/params.json -c /workplace/models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/consolidated.00.pth  --use_kv_cache --qnn --soc_model SM8550 `

3. I noticed there are no SpinQuant matrices for smaller models like 1B/3B however I tried generating one following [this guide](https://colab.research.google.com/gist/zxdmike/abbb2c9b0d1fd1f4ed8cdae8c02180f4#scrollTo=v9bwckEYXUvd). I was able to generate a matrix file (R.bin) however when running the model on the target device I get the following failures I'm still investigating:

`python -m examples.models.llama.export_llama -t /workplace/models/Llama-3.2-1B/original/tokenizer.model -p /workplace/models/Llama-3.2-1B/original/params.json -c /workplace/models/Llama-3.2-1B/original/consolidated.00.pth  --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --soc_model SM8550 --optimized_rotation_path /workplace/models/matrix_llama3.2-1b_w16-a4/R.bin --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
`

```
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create transport for device, error: 4000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load skel, error: 4000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Transport layer setup failed: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse default platform info: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load default platform info: 14001
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse platform config: 14001
[ERROR] [Qnn ExecuTorch]: Failed to create device_handle for Backend ID 6, error=14001
E 00:00:01.580015 executorch:QnnManager.cpp:305] Fail to configure Qnn device
E 00:00:01.580019 executorch:QnnExecuTorchBackend.cpp:95] Fail to initialize Qnn Manager
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Backend 1 free cleanup called during process exit
```



cc @cccclai @cbilgin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization of Llama3.2-1b/3b with QNN backend #10993

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantization of Llama3.2-1b/3b with QNN backend #10993

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions