llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

I did the following test to tokenize `wiki.test.raw` using our tokenizer and the Python tokenizer.
The expectation is that the outputs will match:

```bash
# generate ggml-vocab-falcon.gguf
./convert-falcon-hf-to-gguf.py --vocab-only ~/development/huggingface/falcon-7b/ --outfile ./models/ggml-vocab-falcon.gguf

# tokenize using Python
python3 tests/test-tokenizer-0-falcon.py ~/development/huggingface/falcon-7b/ --fname-tok ./build/wikitext-2-raw/wiki.test.raw

# tokenize using llama.cpp
cd build
make -j
./bin/test-tokenizer-0-falcon ../models/ggml-vocab-falcon.gguf ./wikitext-2-raw/wiki.test.raw

# compare the results
cmp ./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp 
./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp differ: char 1, line 1
```

The results are pretty close, but not exactly the same. Any ideas why the test does not pass?
I thought that #3252 would resolve this

cc @goerch 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions