Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I'm attempting to run llama.cpp, latest master, with TheBloke's Falcon 180B Q5/Q6 quantized GGUF models, but it errors out with "invalid character".
I'm unable to find any issues about this online anywhere.
Another system of mind causes the same problem, and a buddy's system does as well.
llama.cpp functions normally on other models, such as Llama2, WizardLM, etc.
The downloaded GGUF file works with "text-generation-webui" so it is functioning, and verified as a good copy by others in the community.
Current Behavior
$ ./main -t 8 -m ../falcon-180b-chat.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "USER: Write a story about llamas. ASSISTANT:"
# ( OR any number of parameters, just -m <model> is enough )
...
< Many Tensors >
...
lama_model_loader: - tensor 640: blk.79.attn_norm.weight f32 [ 14848, 1, 1, 1 ]
llama_model_loader: - tensor 641: blk.79.ffn_down.weight q6_K [ 59392, 14848, 1, 1 ]
llama_model_loader: - tensor 642: output_norm.bias f32 [ 14848, 1, 1, 1 ]
llama_model_loader: - tensor 643: output_norm.weight f32 [ 14848, 1, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: falcon.context_length u32
llama_model_loader: - kv 3: falcon.tensor_data_layout str
llama_model_loader: - kv 4: falcon.embedding_length u32
llama_model_loader: - kv 5: falcon.feed_forward_length u32
llama_model_loader: - kv 6: falcon.block_count u32
llama_model_loader: - kv 7: falcon.attention.head_count u32
llama_model_loader: - kv 8: falcon.attention.head_count_kv u32
llama_model_loader: - kv 9: falcon.attention.layer_norm_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.merges arr
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: general.quantization_version u32
llama_model_loader: - type f32: 322 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q5_K: 201 tensors
llama_model_loader: - type q6_K: 120 tensors
error loading model: invalid character
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../falcon-180b-chat.Q5_K_M.gguf'
main: error: unable to load model
Happy to provide longer output, but it was pretty standard model shapes/sizes ahead of the loader and error.
Environment and Context
Dell R740xd, 640GB RAM, Skylake processors Xeon Silver 4112 CPU @ 2.60GHz, Ubuntu Focal 20.04,
$ git log | head -1
commit 019ba1dcd0c7775a5ac0f7442634a330eb0173cc
$ shasum -a 256 ../falcon-180b-chat.Q5_K_M.gguf
e49e65f34b807d7cdae33d91ce8bd7610f87cd534a2d17ef965c6cf6b03bf3d8 ../falcon-180b-chat.Q5_K_M.gguf
Please let me know if this is already known, I can't seem to find it, and/or if I can help repo somehow. Thx