Description
Prerequisites
On b1557
Expected Behavior
The model should generate output as normal, as defined in the grammar file. This appears to only impact deepseek, as llama variants and yi run fine.
Current Behavior
terminate called after throwing an instance of 'std::out_of_range'
what(): _Map_base::at
Environment and Context
AMD Ryzen 7 3700X 8-Core Processor
0a:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070] (rev a1)
Linux 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC
NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
The example below is ./main -n -1 -c 8192 -ngl 0 --repeat_penalty 1.2 --color -i --mirostat 2 -m ../llama/gguf/deepseek-coder-6.7b-instruct.Q8_0.gguf --grammar-file grammar/any_text.gbnf --prompt Test
any_text.gbnf:
root ::= ([^\n]+ "\n")+
Failure Logs
The error happens immediately:
...
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q8_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 6.67 GiB (8.50 BPW)
llm_load_print_meta: general.name = deepseek-ai_deepseek-coder-6.7b-instruct
llm_load_print_meta: BOS token = 32013 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 126 'Ä'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 6830.87 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 0.25
llama_new_context_with_model: kv self size = 4096.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 555.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 552.00 MiB
llama_new_context_with_model: total VRAM used: 552.00 MiB (model: 0.00 MiB, context: 552.00 MiB)
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.200, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 2, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
Testterminate called after throwing an instance of 'std::out_of_range'
what(): _Map_base::at