Closed
Description
When attempting to use the -dtw
option, whisper.cpp/main crashes with a WHISPER_ASSERT
error after printing the first few lines of the transcription.
Here is the command I'm running:
whisper.cpp/main -f audio.wav -m ggml-model-whisper-small.bin -ojf -of transcription -dtw small
Here is the output I get:
(I've redacted the transcription text, but the output I saw is accurate to what you hear in my input file)
whisper_init_from_file_with_params_no_state: loading model from 'ggml-model-whisper-small.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 1
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 768
whisper_model_load: n_text_head = 12
whisper_model_load: n_text_layer = 12
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 3 (small)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 487.01 MB
whisper_model_load: model size = 487.01 MB
whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 80
whisper_init_state: kv self size = 56.62 MB
whisper_init_state: kv cross size = 56.62 MB
whisper_init_state: kv pad size = 4.72 MB
whisper_init_state: alignment heads masks size = 480 B
whisper_init_state: compute buffer (conv) = 22.41 MB
whisper_init_state: compute buffer (encode) = 280.07 MB
whisper_init_state: compute buffer (cross) = 6.18 MB
whisper_init_state: compute buffer (decode) = 198.64 MB
system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0
main: processing 'audio.wav' (1598648 samples, 99.9 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
whisper_mel_init: n_len = 12991, n_len_org = 9991, n_mel = 80
[00:00:00.000 --> 00:00:08.000] <redacted>
[00:00:08.000 --> 00:00:13.440] <redacted>
[00:00:13.440 --> 00:00:20.480] <redacted>
[00:00:20.480 --> 00:00:25.680] <redacted>
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
Aborted
Removing -dtw small
from the command allows it to run successfully for the full length of the input file (1:39.6). However, I'm trying to get the more accurate DTW timestamps.
I am running whisper.cpp in WSL, with inference happening on the CPU.
Pinging @denersc since he seems to have been the one who implemented DTW in whisper.cpp.
Metadata
Metadata
Assignees
Labels
No labels