Closed
Description
I did the following test to tokenize wiki.test.raw
using our tokenizer and the Python tokenizer.
The expectation is that the outputs will match:
# generate ggml-vocab-falcon.gguf
./convert-falcon-hf-to-gguf.py --vocab-only ~/development/huggingface/falcon-7b/ --outfile ./models/ggml-vocab-falcon.gguf
# tokenize using Python
python3 tests/test-tokenizer-0-falcon.py ~/development/huggingface/falcon-7b/ --fname-tok ./build/wikitext-2-raw/wiki.test.raw
# tokenize using llama.cpp
cd build
make -j
./bin/test-tokenizer-0-falcon ../models/ggml-vocab-falcon.gguf ./wikitext-2-raw/wiki.test.raw
# compare the results
cmp ./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp
./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp differ: char 1, line 1
The results are pretty close, but not exactly the same. Any ideas why the test does not pass?
I thought that #3252 would resolve this
cc @goerch