terminate running deepseek models with gbnf grammars

# Prerequisites

On b1557

# Expected Behavior

The model should generate output as normal, as defined in the grammar file.  This appears to only impact deepseek, as llama variants and yi run fine.

# Current Behavior

terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at

# Environment and Context

AMD Ryzen 7 3700X 8-Core Processor
0a:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070] (rev a1)
Linux 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC
NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2

# Failure Information (for bugs)

Please help provide information about the failure / bug.

# Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

The example below is `./main -n -1 -c 8192 -ngl 0 --repeat_penalty 1.2 --color -i --mirostat 2 -m ../llama/gguf/deepseek-coder-6.7b-instruct.Q8_0.gguf --grammar-file grammar/any_text.gbnf --prompt Test`

any_text.gbnf:
```
root ::= ([^\n]+ "\n")+
```

# Failure Logs

The error happens immediately:

```
...
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 6.67 GiB (8.50 BPW)
llm_load_print_meta: general.name   = deepseek-ai_deepseek-coder-6.7b-instruct
llm_load_print_meta: BOS token = 32013 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token  = 126 'Ä'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 6830.87 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 100000.0
llama_new_context_with_model: freq_scale = 0.25
llama_new_context_with_model: kv self size  = 4096.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 555.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 552.00 MiB
llama_new_context_with_model: total VRAM used: 552.00 MiB (model: 0.00 MiB, context: 552.00 MiB)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling:
        repeat_last_n = 64, repeat_penalty = 1.200, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 2, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Testterminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

terminate running deepseek models with gbnf grammars #4206

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

terminate running deepseek models with gbnf grammars #4206

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions