whisper : add support for backends with multiple ggml_backend_buffer_type #2863

eddnjjn · 2025-03-04T14:04:11Z

This patch adds support for backends with multiple ggml_backend_buffer_type to Whisper.cpp. When running on Arm devices, this patch enables the use of the aarch64 and KleidiAI kernels to accelerate matmul operations.

Signed-off-by: Dan Johansson <[email protected]>

bradmurray-dt · 2025-03-11T15:15:38Z

Is anything additional needed to run this? Do you have any performance comparisons?

eddnjjn · 2025-03-12T14:14:11Z

Is anything additional needed to run this? Do you have any performance comparisons?
Note that this patch mainly targets Arm devices. If your compile and runtime environment is the same (-march=native), you shouldn't have to add any compiler flags as cmake takes care of this. If you cross compile, then you need to specify the target CPU architecture (e.g. -DCMAKE_C_FLAGS=-march=armv8.2a+dotprod+i8mm+sve -DCMAKE_CXX_FLAGS=-march=armv8.2a+dotprod+i8mm+sve).

If you want to run with Arm® KleidiAI™, add -DGGML_CPU_KLEIDIAI=ON to the cmake command line options.

Also, you must quantize the model to Q4_0 as this is the format supported by aarch64 and KleidiAI.

On a Pixel 8 device, this patch gives a 1.44-1.7x performance increase for whisper-bench using the medium.en model. Below you can see the output from whisper-bench for 1-4 threads running on Pixel 8 without and with this patch.

Output from whisper-bench running on Pixel 8

main branch (`fc7b1ee`)

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 1
whisper_print_timings: load time = 311.55 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 34401.29 ms / 1 runs (34401.29 ms per run)
whisper_print_timings: decode time = 11087.40 ms / 256 runs ( 43.31 ms per run)
whisper_print_timings: batchd time = 13467.27 ms / 320 runs ( 42.09 ms per run)
whisper_print_timings: prompt time = 115861.49 ms / 4096 runs ( 28.29 ms per run)
whisper_print_timings: total time = 174820.05 ms

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 2
whisper_print_timings: load time = 278.75 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 23314.24 ms / 1 runs (23314.24 ms per run)
whisper_print_timings: decode time = 7970.02 ms / 256 runs ( 31.13 ms per run)
whisper_print_timings: batchd time = 8333.07 ms / 320 runs ( 26.04 ms per run)
whisper_print_timings: prompt time = 62768.42 ms / 4096 runs ( 15.32 ms per run)
whisper_print_timings: total time = 102388.12 ms

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 3
whisper_print_timings: load time = 279.81 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 15143.52 ms / 1 runs (15143.52 ms per run)
whisper_print_timings: decode time = 6830.06 ms / 256 runs ( 26.68 ms per run)
whisper_print_timings: batchd time = 6372.15 ms / 320 runs ( 19.91 ms per run)
whisper_print_timings: prompt time = 46688.25 ms / 4096 runs ( 11.40 ms per run)
whisper_print_timings: total time = 75036.06 ms

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 4
whisper_print_timings: load time = 275.25 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 11744.57 ms / 1 runs (11744.57 ms per run)
whisper_print_timings: decode time = 5819.78 ms / 256 runs ( 22.73 ms per run)
whisper_print_timings: batchd time = 5133.26 ms / 320 runs ( 16.04 ms per run)
whisper_print_timings: prompt time = 37920.15 ms / 4096 runs ( 9.26 ms per run)
whisper_print_timings: total time = 60619.95 ms

PR#2863 enabled (running with KleidiAI)

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 1
whisper_print_timings: load time = 397.04 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 19907.53 ms / 1 runs (19907.53 ms per run)
whisper_print_timings: decode time = 7629.50 ms / 256 runs ( 29.80 ms per run)
whisper_print_timings: batchd time = 7538.61 ms / 320 runs ( 23.56 ms per run)
whisper_print_timings: prompt time = 66405.38 ms / 4096 runs ( 16.21 ms per run)
whisper_print_timings: total time = 101483.28 ms

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 2
whisper_print_timings: load time = 393.19 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 14440.07 ms / 1 runs (14440.07 ms per run)
whisper_print_timings: decode time = 5991.29 ms / 256 runs ( 23.40 ms per run)
whisper_print_timings: batchd time = 5250.08 ms / 320 runs ( 16.41 ms per run)
whisper_print_timings: prompt time = 45144.96 ms / 4096 runs ( 11.02 ms per run)
whisper_print_timings: total time = 70828.62 ms

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 3
whisper_print_timings: load time = 400.55 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 9718.67 ms / 1 runs ( 9718.67 ms per run)
whisper_print_timings: decode time = 5209.60 ms / 256 runs ( 20.35 ms per run)
whisper_print_timings: batchd time = 3819.17 ms / 320 runs ( 11.93 ms per run)
whisper_print_timings: prompt time = 32099.34 ms / 4096 runs ( 7.84 ms per run)
whisper_print_timings: total time = 50848.71 ms

LD_LIBRARY_PATH=. ./whisper-bench -m medium-q4_0.bin -t 4
whisper_print_timings: load time = 378.32 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: encode time = 7698.60 ms / 1 runs ( 7698.60 ms per run)
whisper_print_timings: decode time = 4945.30 ms / 256 runs ( 19.32 ms per run)
whisper_print_timings: batchd time = 3219.06 ms / 320 runs ( 10.06 ms per run)
whisper_print_timings: prompt time = 26154.78 ms / 4096 runs ( 6.39 ms per run)
whisper_print_timings: total time = 42019.84 ms

ggerganov · 2025-03-13T13:24:45Z

Looks like a good addition, but we need to do some testing and make sure everything works correctly. Testing is a bit tedious atm, because we don't have good CI, so any feedback from the community if this branch works as expected are very welcome.

eddnjjn · 2025-03-19T14:48:16Z

Just want to add that I’ve tested the patch using whisper-cli and whisper-bench in the following environments

Linux x86 (Ubuntu 20.04.6) – CPU (aarch64) backend
Android Pixel 8 – CPU (aarch64, KleidiAI) backend
macOS – Metal and CPU (with/without BLAS, aarch64, KleidiAI) backends

ggerganov · 2025-03-20T08:13:41Z

Thanks for the update. I'll do some testing soon on my devices and if everything looks OK, will merge.

ggerganov

I have done some testing on Mac and my Linux box and things appear to be functional. So I think we can proceed to merge this.

ggerganov · 2025-03-25T14:33:56Z

src/whisper-arch.h

+// SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliates <[email protected]>
+// SPDX-License-Identifier: MIT
+//
+


Please remove this copyright notice.

Signed-off-by: Dan Johansson <[email protected]>

eddnjjn added 2 commits March 4, 2025 14:50

whisper : add support for ggml_backend_buffer_type

ee19e15

Signed-off-by: Dan Johansson <[email protected]>

fix compile error when building on Ubuntu

e32becc

Signed-off-by: Dan Johansson <[email protected]>

eddnjjn changed the title ~~whisper : add support for ggml_backend_buffer_type~~ whisper : add support for backends with multiple ggml_backend_buffer_type Mar 10, 2025

ggerganov approved these changes Mar 25, 2025

View reviewed changes

remove copyright header from include file

eb1357d

Signed-off-by: Dan Johansson <[email protected]>

ggerganov merged commit 21d890d into ggml-org:master Mar 26, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : add support for backends with multiple ggml_backend_buffer_type #2863

whisper : add support for backends with multiple ggml_backend_buffer_type #2863

eddnjjn commented Mar 4, 2025 •

edited

Loading

bradmurray-dt commented Mar 11, 2025

eddnjjn commented Mar 12, 2025

main branch (`fc7b1ee`)

PR#2863 enabled (running with KleidiAI)

ggerganov commented Mar 13, 2025 •

edited

Loading

eddnjjn commented Mar 19, 2025

ggerganov commented Mar 20, 2025

ggerganov left a comment

ggerganov Mar 25, 2025

whisper : add support for backends with multiple ggml_backend_buffer_type #2863

whisper : add support for backends with multiple ggml_backend_buffer_type #2863

Conversation

eddnjjn commented Mar 4, 2025 • edited Loading

bradmurray-dt commented Mar 11, 2025

eddnjjn commented Mar 12, 2025

main branch (fc7b1ee)

PR#2863 enabled (running with KleidiAI)

ggerganov commented Mar 13, 2025 • edited Loading

eddnjjn commented Mar 19, 2025

ggerganov commented Mar 20, 2025

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Mar 25, 2025

Choose a reason for hiding this comment

eddnjjn commented Mar 4, 2025 •

edited

Loading

main branch (`fc7b1ee`)

ggerganov commented Mar 13, 2025 •

edited

Loading