Skip to content

WHISPER_ASSERT failure with -dtw option #2301

Closed
@NWalker1208

Description

@NWalker1208

When attempting to use the -dtw option, whisper.cpp/main crashes with a WHISPER_ASSERT error after printing the first few lines of the transcription.

Here is the command I'm running:

whisper.cpp/main -f audio.wav -m ggml-model-whisper-small.bin -ojf -of transcription -dtw small

Here is the output I get:
(I've redacted the transcription text, but the output I saw is accurate to what you hear in my input file)

whisper_init_from_file_with_params_no_state: loading model from 'ggml-model-whisper-small.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3 (small)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   487.01 MB
whisper_model_load: model size    =  487.01 MB
whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 80
whisper_init_state: kv self size  =   56.62 MB
whisper_init_state: kv cross size =   56.62 MB
whisper_init_state: kv pad  size  =    4.72 MB
whisper_init_state: alignment heads masks size = 480 B
whisper_init_state: compute buffer (conv)   =   22.41 MB
whisper_init_state: compute buffer (encode) =  280.07 MB
whisper_init_state: compute buffer (cross)  =    6.18 MB
whisper_init_state: compute buffer (decode) =  198.64 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'audio.wav' (1598648 samples, 99.9 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_mel_init: n_len = 12991, n_len_org = 9991, n_mel = 80

[00:00:00.000 --> 00:00:08.000]   <redacted>
[00:00:08.000 --> 00:00:13.440]   <redacted>
[00:00:13.440 --> 00:00:20.480]   <redacted>
[00:00:20.480 --> 00:00:25.680]   <redacted>
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
WHISPER_ASSERT: src/whisper.cpp:7224: nth == 1
Aborted

Removing -dtw small from the command allows it to run successfully for the full length of the input file (1:39.6). However, I'm trying to get the more accurate DTW timestamps.

I am running whisper.cpp in WSL, with inference happening on the CPU.

Pinging @denersc since he seems to have been the one who implemented DTW in whisper.cpp.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions